Reliability Research - Fortune or Fallacy
Overall theme (Antonio Gonzalez):
The purpose of this panel is
to debate the relevance of reliability research for computer architects.
In the past, reliability had been addressed through fabrication (i.e.,
burn-in, testing) and circuit techniques, whereas microarchitecture
techniques had been primarily focused just on mission-critical systems.
However, over the past 5-10 years, reliability has moved more into the
mainstream of computer architecture research.
On the one hand, transient and permanent faults are a looming problem due
to CMOS scaling that must be solved. In a recent keynote, Shekhar Borkar
summed up the emerging design space as follows: "Future designs will
consist of 100B transistors, 20B of which are unusable due to
manufacturing defects, 10B will fail over time due to wearout, and regular
intermittent errors will be observed." This vision clearly suggests that
fault-tolerance must become a first-class design feature.
On the other hand, some people believe that reliability provides little
added value for the bulk of computer systems that are sold today. They
claim that researchers have artificially enhanced the magnitude of the
problems to increase the perceived value of their work. In reality,
unreliable operation has been accepted by consumers as common place and
significant increases in hardware failure rates will have little effect on
the end user experience. Reliability is simply a tax that the doom-sayers
want to levy on your computer system.
This panel will confront these two points of view, through two world-class
researchers in computer architecture: Scott Mahlke and Shubu Mukherjee.
Fortune viewpoint (Shubu Mukherjee):
Captain Jean-Luc Picard of the starship USS
Enterprise once said that there are three versions of the truth: your truth,
his truth, and the truth. An end-user's truth is that occasional failures
are just a nuisance, but not a major showstopper. Well, this is, of
course, unless it happens at an inconvenient moment, such as when Windows 98
crashed during a Bill Gates demo. The truth, however, is very different
from the perspective of an IT manager who has to deal with 1000s of end-users.
The greater the number of end-user complaints per day, the greater is her
company's total cost of ownership for these machines. And, the God-given truth
is that silicon reliability is getting increasingly worse with every generation
revealing the dark side of
I will argue that fault tolerance has now become a mainstream architecture
consideration in most silicon chips industry will produce. The goal of hardware
vendors is to maintain the hardware error rate low enough that they continue to
get obfuscated by software crashes and bugs. This is becoming increasingly
challenging because on average software is getting more reliable, while silicon
reliability is rapidly getting worse. Radiation-induced soft errors,
process-related instability, wearout, and variability
are introducing new challenges and risks to chip design that the industry has
never comprehended before. To make matters worse, architecture research in
reliability is fraught with misconceptions and fallacies. It will take an
increasing amount of discipline, education, and research to get over these
hurdles.
Fallacy viewpoint (Scott Mahlke):
How often does your laptop crash?
Bill Gates has stated that 5 percent of Windows machines crash, on average,
twice daily. To put this another way, any given
machine will crash about three times a month. How often do you lose a
call on your cell phone? How often is a word garbled where you have to
ask the person on the other end to repeat something? Would you care if a
pixel in a frame of your video was the wrong color? The majority of
consumers care little about the reliable operation of electronic devices, and
the concerns are dropping as these devices become more disposable. In
2006, the average lifetime of a business cell phone was 9 months. Building
devices whose hardware functions flawlessly for 20 years simply does not make
economic sense. Further, the processor is one of the least likely sources
of faults. Third party software, operating systems, disks, memory, and
LCD screens all have much lower reliability. Why then are computer
architects spending so much effort to build highly reliable computer systems?
Even if silicon faults scale up by an order of magnitude in the coming
years, the end user is unlikely to see any difference because the reliability
of the overall system is dominated by other factors. Further, most
devices will be replaced before wearout defects can
manifest. In this panel, I will present the top 5 reasons computer
architecture research in reliability is a fallacy.