ACM Home Page
Please provide us with feedback. Feedback
Towards scalable reliability frameworks for error prone CMPs
Full text PdfPdf (831 KB)
Source
International Conference on Compilers, Architecture and Synthesis for Embedded Systems archive
Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems table of contents
Grenoble, France
SESSION: Reliability and reconfigurability table of contents
Pages 261-270  
Year of Publication: 2009
ISBN:978-1-60558-626-7
Authors
Joseph Sloan  Coordinated Science Laboratory, University of Illinois, Urbana, IL, USA
Rakesh Kumar  Coordinated Science Laboratory, University of Illinois, Urbana, IL, USA
Sponsors
SIGDA: ACM Special Interest Group on Design Automation
ACM: Association for Computing Machinery
SIGBED: ACM Special Interest Group on Embedded Systems
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 23,   Downloads (12 Months): 23,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1629395.1629432
What is a DOI?

ABSTRACT

As technology scales and the energy of computation continually approaches thermal equilibrium [1,2], parameter variations and noise levels will lead to larger error rates at various levels of the computation stack. The error rates would be especially high for post-CMOS and nanoelectronic systems as well as for probabilistic [3] and stochastic architectures [4]. N-modular redundancy (NMR) at the core-level has been proposed as a way to attain system reliability goals for multicore architectures. While core-level DMR and TMR have been shown to be effective when errors are rare, a large amount of core-level redundancy will be required for attaining system reliability goals in face of high error rates. This makes voting latency and bandwidth significant performance bottlenecks for such systems. In this paper, we present a scalable NMR framework for error prone chip multiprocessors(CMPs). The framework supports in-network fault tolerance where voting logic is integrated into routers to allow for truly distributed voting. The in-network fault tolerance router utilizes the expected redundancy in vote messages, to reduce some of the blocking overhead incurred at the leader, and also provide a mechanism to trade-off network bandwidth with latency. Our framework also supports proactive checkpoint deallocation which allows cores participating in voting to continue on with execution instead of waiting on notification from the voting logic. Finally, the framework supports dynamic constitution that allows an arbitrary core on this chip to be a part of an NMR group. This allows bypassing faulty cores as well as scheduling for performance. Our experiments show significant performance/bandwidth benefits from these optimizations.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
International Technology Roadmap for Semiconductors 2005, http://public.itrs.net.
 
2
C. Constantinescu, "Trends and challenges in vlsi circuit reliability,"Micro, IEEE, vol. 23, no. 4, pp. 14--19, July-Aug. 2003.
 
3
L. N. Chakrapani, P. Korkmaz, B. E. S. Akgul, and K. V. Palem, "Probabilistic system-on-a-chip architectures," ACM Trans. Des. Autom. Electron. Syst., vol. 12, no. 3, pp. 1--28, 2007.
 
4
Stochastic Processors (or processors that do not always compute correctly by design), NSF Workshop on Science of Power Management. [Online]. Available: http://scipm.cs.vt.edu/Slides/2.RakeshKumar.pdf
 
5
P. Shivakumar, M. Kistler, S. Keckler, D. Burger, and L. Alvisi, "Modeling the effect of technology trends on the soft error rate of combinational logic," 2002, pp. 389--398.
 
6
S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, "A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor," in MICRO 36: Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture. Washington, DC, USA: IEEE Computer Society, 2003, p. 29.
 
7
N. J. Wang, J. Quek, T. M. Rafacz, and S. J. patel, Characterizing the effects of transient faults on a high-performance processor pipeline," in DSN '04: Proceedings of the 2004 International Conference on Dependable Systems and Networks. Washington, DC, USA: IEEE Computer Society, 2004, p. 61.
 
8
A. Ionescu, "New functionality and ultra low power: key opportunities for post-cmos era," April 2008, pp. 72--73.
 
9
K. Tsukagoshi, N. Yoneya, S. Uryu, Y. Aoyagi, A. Kanda, Y. Ootuka, and B. W. Alphenaar, "Carbon nanotube devices for nanoelectronics," Physica B: Condensed Matter, vol. 323, no. 1--4, pp. 107 -- 114, 2002, proceedings of the Tsukuba Symposium on Carbon Nanotube in Commemoration of the 10th Anniversary of its Discovery.
 
10
A. van Roosmalen and G. Zhang, "Reliability challenges in the nanoelectronics era,"Microelectronics and Reliability, vol. 46, no. 9--11, pp. 1403 -- 1414, 2006, proceedings of the 17th European Symposium on Reliability of Electron Devices, Failure Physics and Analysis. Wuppertal, Germany 3rd-6th October 2006.
 
11
Predictive Technology Model, Arizon State University, School of Engineering. [Online]. Available: http://www.eas.asu.edu/ ptm/
 
12
B. C. Paul, S. Fujita, M. Okajima, and T. Lee, Modeling and analysis of circuit performance of ballistic cnfet," in DAC '06: Proceedings of the 43rd annual conference on Design automation. New York, NY, USA: ACM, 2006, pp. 717--722.
 
13
M. L. Shooman, Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design. New York, NY, USA: John Wiley & Sons, Inc., 2002.
 
14
D. Siewiorek, V. Kini, H. Mashburn, S. McConnel, and M. Tsao, "A case study of c.mmp, cm*, and c.vmp: Part i.experiences with fault tolerance in multiprocessor systems," Proceedings of the IEEE, vol. 66, no. 10, pp. 1178--1199, Oct. 1978.
 
15
D. Avresky, S. Geoghegan, and Y. Varoglu, Evaluation of software-implemented fault-tolerance (sift) approach in gracefully degradable multi-computer systems," Reliability, IEEE Transactions on, vol. 55, no. 3, pp. 451--457, Sept. 2006.
 
16
J. Hopkins, A.L., I. Smith, T.B., and J. Lala, "Ftmp.a highly reliable fault-tolerant multiprocess for aircraft, Proceedings of the IEEE, vol. 66, no. 10, pp. 1221--1239, Oct. 1978.
 
17
T. M. Austin, "Diva: A dynamic approach to microprocessor verification," Journal of Instruction-Level Parallelism, vol. 2, p. 2000, 2000.
 
18
D. Jewett, "Integrity s2: a fault-tolerant unix platform," Fault-Tolerant Computing, 1991. FTCS-21.
 
19
Digest of Papers., Twenty-First International Symposium, pp. 512--519, Jun 1991.
 
20
P. N. Sanda, J. W. Kellington, P. Kudva, R. Kalla, R. B. McBeth, J. Ackaret, R. Lockwood, J. Schumann, and C. R. Jones, "Soft-error resilience of the ibm power6 processor," IBM Journal of Research and Development, vol. 52, no. 3, pp. 275--284, 2008.
 
21
J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe, Reunion: Complexity-effective multicore redundancy," Microarchitecture, IEEE/ACM International Symposium on, vol. 0, pp. 223--234, 2006.
 
22
C. LaFrieda, E. Ipek, J. F. Martinez, and R. Manohar, Utilizing dynamically coupled cores to form a resilient chip multiprocessor," in DSN '07: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Washington, DC, USA: IEEE Computer Society, 2007, pp. 317--326.
 
23
A. Golander, S. Weiss, and R. Ronen, "Ddmr: Dynamic and scalable dual modular redundancy with short validation intervals," Computer Architecture Letters, vol. 7, no. 2, pp. 65--68, Feb. 2008.
 
24
D. Sanchez, J. L. Arag´on, and J. M. Garcia, Evaluating dynamic core coupling in a scalable tiled-cmp architecture," in Proc. of the 7th Int. Workshop on Duplicating, Deconstructing, and Debunking (WDDD), in conjunction with ISCA'08, Jun 2008.
 
25
A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir, "The nyu ultracomputer-designing a mimd, shared-memory parallel machine," in ISCA '98: 25 years of the international symposia on Computer architecture (selected papers). New York, NY, USA: ACM, 1998, pp. 239--254.
 
26
A. Shye, T. Moseley, V. J. Reddi, J. Blomstedt, and D. A. Connors, "Using process-level redundancy to exploit multiple cores for transient fault tolerance," in DSN '07: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. Washington, DC, USA: IEEE Computer Society, 2007, pp. 297--306.
 
27
The M5 Simulator System, University of Michigan, http://www.m5sim.org/wiki/index.php/mainpage