ACM Home Page
Please provide us with feedback. Feedback
Vicis: a reliable network for unreliable silicon
Full text PdfPdf (266 KB)
Source Annual ACM IEEE Design Automation Conference archive
Proceedings of the 46th Annual Design Automation Conference table of contents
San Francisco, California
SESSION: Network-on-chip advances for power, reliability and the memory bottleneck table of contents
Pages 812-817  
Year of Publication: 2009
ISBN:978-1-60558-497-3
Authors
David Fick  University of Michigan, Ann Arbor, MI
Andrew DeOrio  University of Michigan, Ann Arbor, MI
Jin Hu  University of Michigan, Ann Arbor, MI
Valeria Bertacco  University of Michigan, Ann Arbor, MI
David Blaauw  University of Michigan, Ann Arbor, MI
Dennis Sylvester  University of Michigan, Ann Arbor, MI
Sponsors
EDAC : Electronic Design Automation Consortium
SIGDA: ACM Special Interest Group on Design Automation
IEEE-CAS : Circuits & Systems
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 12,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1629911.1630119
What is a DOI?

ABSTRACT

Process scaling has given designers billions of transistors to work with. As feature sizes near the atomic scale, extensive variation and wearout inevitably make margining uneconomical or impossible. The ElastIC project seeks to address this by creating a large-scale chip-multiprocessor that can self-diagnose, adapt, and heal. Creating large, flexible designs in this environment naturally lends itself to the repetitive nature of network-on-chip (NoC), but the loss of a single link or router will result in complete network failure. In this work we present Vicis, an ElastIC-style NoC that can tolerate the loss of many network components due to wearout induced hard faults. Vicis uses the inherent redundancy in the network and its routers in order to maintain correct operation while incurring a much lower area overhead than previously proposed N-modular redundancy (NMR) based solutions. Each router has a built-in-self-test (BIST) that diagnoses the locations of hard fault and runs a number of algorithms to best use ECC, port swapping, and a crossbar bypass bus to mitigate them. The routers work together to run distributed algorithms to solve network-wide problems as well, protecting the networking against critical failures in individual routers. In this work we show that with stuck-at fault rates as high as 1 in 2000 gates, Vicis will continue to operate with approximately half of its routers still functional and communicating.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Massively Parallel Processing Arrays Techhnology Overview. Ambric Technology Overview, 2008.
 
2
S. Bell et al. TILE64 processor: A 64-core SoC with mesh interconnect. Proc. ISSCC, 2008.
 
3
D. Bertozzi, L. Benini, and G. De Micheli. Low power error resilient encoding for on-chip data buses. Proc. DATE, 2002.
 
4
T. Bjerregaard and S. Mahadevan. A survey of research and practices of network-on-chip. ACM Computer Survey, 2006.
 
5
S. Borkar. Microarchitecture and design challenges for gigascale integration. Proc. Micro, keynote address, 2004.
 
6
S. Borkar. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. Proc. Micro, 2005.
 
7
K. Constantinides, S. Plaza, J. Blome, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and M. Orshansky. BulletProof: a defect-tolerant CMP switch architecture. Proc. HPCA, 2006.
 
8
W. J. Dally, L. R. Dennison, D. Harris, K. Kan, and T. Xanthopoulos. The reliable router: A reliable and high-performance communication substrate for parallel computers. Proc. PCRCW, 1994.
 
9
D. Fick, A. DeOrio, G. Chen, V. Bertacco, D. Sylvester, and D. Blaauw. A highly resilient routing algorithm for fault-tolerant NoCs. Proc. DATE, 2009.
 
10
C. J. Glass and L. M. Ni. Fault-tolerant wormhole routing in meshes without virtual channels. IEEE Trans. on Parallel and Distributed Systems, 1996.
 
11
M. E. Gomez et al. An efficient fault-tolerant routing methodology for meshes and tori. IEEE Computer Architecture Letters, 2004.
 
12
T. R. Halfhill. Ambric's New Parallel Processor: Globally Asynchronous Architecture Eases Parallel Programming. Microprocessor Report, 2006.
 
13
C.-T. Ho and L. Stockmeyer. A new approach to fault-tolerant wormhole routing for mesh-connected parallel computers. IEEE Trans. on Computers, 2004.
 
14
E. Karl, D. Blaauw, D. Sylvester, and T. Mudge. Reliability modeling and management in dynamic microprocessor-based systems. Proc. DAC, 2006.
 
15
J. Keane, S. Venkatraman, P. Butzen, and C. H. Kim. An array-based test circuit for fully automated gate dielectric breakdown characterization. Proc. CICC, 2008.
 
16
S.-J. Pan and K.-T. Cheng. A framework for system reliability analysis considering both system error tolerance and component test quality. Proc. DATE, 2007.
 
17
D. Park, C. Nicopoulos, and J. K. N. V. C. Das. Exploring fault-tolerant network-on-chip architectures. Proc. DSN, 2006.
 
18
V. Puente, J. A. Gregorio, F. Vallejo, and R. Beivide. Immunet: A cheap and robust fault-tolerant packet routing mechanism. ACM SIGARCH Computer Architecture News, 2004.
 
19
S. Rodrigo, J. Flich, J. Duato, and M. Hummel. Efficient unicast and multicast support for CMPs. Proc. Micro, 2008.
 
20
D. Sylvester, D. Blaauw, and E. Karl. ElastIC: An Adaptive Self-Healing Architecture for Unpredictable Silicon. IEEE Design & Test, 2006.
 
21
S. R. Vangal et al. An 80-tile sub-100w teraflops processor in 65-nm cmos. IEEE Journal of Solid-State Circuits, 2008.
 
22
J. Wu. A fault-tolerant and deadlock-free routing protocol in 2D meshes based on odd-even turn model. IEEE Trans. on Computers, 2003.
 
23
J. Zhou and F. C. M. Lau. Multi-phase minimal fault-tolerant wormhole routing in meshes. Parallel Computing, 2004.
 
24
H. Zimmer and A. Jantsch. A fault model notation and error-control scheme for switch-to-switch buses in a network-on-chip. Proc. CODES+ISSS, 2003.