| Vicis: a reliable network for unreliable silicon |
| Full text |
Pdf
(266 KB)
|
| Source
|
Annual ACM IEEE Design Automation Conference
archive
Proceedings of the 46th Annual Design Automation Conference
table of contents
San Francisco, California
SESSION: Network-on-chip advances for power, reliability and the memory bottleneck
table of contents
Pages 812-817
Year of Publication: 2009
ISBN:978-1-60558-497-3
|
|
Authors
|
|
David Fick
|
University of Michigan, Ann Arbor, MI
|
|
Andrew DeOrio
|
University of Michigan, Ann Arbor, MI
|
|
Jin Hu
|
University of Michigan, Ann Arbor, MI
|
|
Valeria Bertacco
|
University of Michigan, Ann Arbor, MI
|
|
David Blaauw
|
University of Michigan, Ann Arbor, MI
|
|
Dennis Sylvester
|
University of Michigan, Ann Arbor, MI
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 12, Downloads (12 Months): 12, Citation Count: 0
|
|
|
ABSTRACT
Process scaling has given designers billions of transistors to work with. As feature sizes near the atomic scale, extensive variation and wearout inevitably make margining uneconomical or impossible. The ElastIC project seeks to address this by creating a large-scale chip-multiprocessor that can self-diagnose, adapt, and heal. Creating large, flexible designs in this environment naturally lends itself to the repetitive nature of network-on-chip (NoC), but the loss of a single link or router will result in complete network failure. In this work we present Vicis, an ElastIC-style NoC that can tolerate the loss of many network components due to wearout induced hard faults. Vicis uses the inherent redundancy in the network and its routers in order to maintain correct operation while incurring a much lower area overhead than previously proposed N-modular redundancy (NMR) based solutions. Each router has a built-in-self-test (BIST) that diagnoses the locations of hard fault and runs a number of algorithms to best use ECC, port swapping, and a crossbar bypass bus to mitigate them. The routers work together to run distributed algorithms to solve network-wide problems as well, protecting the networking against critical failures in individual routers. In this work we show that with stuck-at fault rates as high as 1 in 2000 gates, Vicis will continue to operate with approximately half of its routers still functional and communicating.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Massively Parallel Processing Arrays Techhnology Overview. Ambric Technology Overview, 2008.
|
| |
2
|
S. Bell et al. TILE64 processor: A 64-core SoC with mesh interconnect. Proc. ISSCC, 2008.
|
| |
3
|
D. Bertozzi, L. Benini, and G. De Micheli. Low power error resilient encoding for on-chip data buses. Proc. DATE, 2002.
|
| |
4
|
T. Bjerregaard and S. Mahadevan. A survey of research and practices of network-on-chip. ACM Computer Survey, 2006.
|
| |
5
|
S. Borkar. Microarchitecture and design challenges for gigascale integration. Proc. Micro, keynote address, 2004.
|
| |
6
|
S. Borkar. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. Proc. Micro, 2005.
|
| |
7
|
K. Constantinides, S. Plaza, J. Blome, B. Zhang, V. Bertacco, S. Mahlke, T. Austin, and M. Orshansky. BulletProof: a defect-tolerant CMP switch architecture. Proc. HPCA, 2006.
|
| |
8
|
W. J. Dally, L. R. Dennison, D. Harris, K. Kan, and T. Xanthopoulos. The reliable router: A reliable and high-performance communication substrate for parallel computers. Proc. PCRCW, 1994.
|
| |
9
|
D. Fick, A. DeOrio, G. Chen, V. Bertacco, D. Sylvester, and D. Blaauw. A highly resilient routing algorithm for fault-tolerant NoCs. Proc. DATE, 2009.
|
| |
10
|
C. J. Glass and L. M. Ni. Fault-tolerant wormhole routing in meshes without virtual channels. IEEE Trans. on Parallel and Distributed Systems, 1996.
|
| |
11
|
M. E. Gomez et al. An efficient fault-tolerant routing methodology for meshes and tori. IEEE Computer Architecture Letters, 2004.
|
| |
12
|
T. R. Halfhill. Ambric's New Parallel Processor: Globally Asynchronous Architecture Eases Parallel Programming. Microprocessor Report, 2006.
|
| |
13
|
C.-T. Ho and L. Stockmeyer. A new approach to fault-tolerant wormhole routing for mesh-connected parallel computers. IEEE Trans. on Computers, 2004.
|
| |
14
|
E. Karl, D. Blaauw, D. Sylvester, and T. Mudge. Reliability modeling and management in dynamic microprocessor-based systems. Proc. DAC, 2006.
|
| |
15
|
J. Keane, S. Venkatraman, P. Butzen, and C. H. Kim. An array-based test circuit for fully automated gate dielectric breakdown characterization. Proc. CICC, 2008.
|
| |
16
|
S.-J. Pan and K.-T. Cheng. A framework for system reliability analysis considering both system error tolerance and component test quality. Proc. DATE, 2007.
|
| |
17
|
D. Park, C. Nicopoulos, and J. K. N. V. C. Das. Exploring fault-tolerant network-on-chip architectures. Proc. DSN, 2006.
|
| |
18
|
V. Puente, J. A. Gregorio, F. Vallejo, and R. Beivide. Immunet: A cheap and robust fault-tolerant packet routing mechanism. ACM SIGARCH Computer Architecture News, 2004.
|
| |
19
|
S. Rodrigo, J. Flich, J. Duato, and M. Hummel. Efficient unicast and multicast support for CMPs. Proc. Micro, 2008.
|
| |
20
|
D. Sylvester, D. Blaauw, and E. Karl. ElastIC: An Adaptive Self-Healing Architecture for Unpredictable Silicon. IEEE Design & Test, 2006.
|
| |
21
|
S. R. Vangal et al. An 80-tile sub-100w teraflops processor in 65-nm cmos. IEEE Journal of Solid-State Circuits, 2008.
|
| |
22
|
J. Wu. A fault-tolerant and deadlock-free routing protocol in 2D meshes based on odd-even turn model. IEEE Trans. on Computers, 2003.
|
| |
23
|
J. Zhou and F. C. M. Lau. Multi-phase minimal fault-tolerant wormhole routing in meshes. Parallel Computing, 2004.
|
| |
24
|
H. Zimmer and A. Jantsch. A fault model notation and error-control scheme for switch-to-switch buses in a network-on-chip. Proc. CODES+ISSS, 2003.
|
|