Single-threaded vs. Multi-threaded

Single-threaded vs. Multi-threaded position statements

Joel Emer - In the early 1970s when Intel introduced the 4004 microprocessor researchers immediately proposed that they would deliver immense performance improvements by hooking together 100s or 1000s of 4004s. Probably, you've never seen one of these wonders, or the fruits of any of subsequent proposals to build massively parallel machines out of each successive microprocessor generation. What, if anything, has changed now? Clearly it is not just that we cannot achieve IPC gains in direct proportion to the number of transistors used, because that's never been true. If it had been true then we'd have IPCs several orders of magnitude larger than we have today. Our recent rule of thumb has been processor performance improves as the square root of the number of transistors and cache miss rates likewise improve as the square root of the cache size. Yet, despite these sub-linear architectural improvements, such machines have been the preferred trajectory. Why is this insufficient for today? Are there no ideas that will bring even that sub-linear architectural performance gain? Why aren't the order-of-magnitude gains being promised by SIMD, vector and/or streaming processors of interest? Is our problem one of lack of innovation rabbits, now that we've lost MIPS and Alpha and marginalized SPARC and Power? Is the complexity of the x86 architecture a factor in the inability to push across the next architectural performance step? Or is there a feeling that irrespective of architecture that we've crossed a complexity threshold beyond which we can't build better processors in a timely fashion. Is using full (but simpler) processors as the building blocks the right granularity? Are multiprocessors really such a panacea of simplicity? How much complexity is going to be introduced in the inter-processor interconnect, in the shared cache hierarchy and in processor support for mechanisms like transactional memory? And much of the challenge of past generations has been coping with the increasing disparity between processor speed and memory speed or the limits of die to memory bandwidth. Does having a multiprocessor with its multiple contexts just make this problem worse not better? And even if multiprocessors really are a simpler alternative, what is the application domain over which they are going to provide a benefit? And will enough people be able to program them?

Yale Patt - I start with the premise that the purpose of a chip is to optimally execute someones desired single application. That is, the reason for designing an expensive chip, rather than a network of simple chips is to speed up the execution of individual programs. For servers, multiple cheaper chips could be much more effective. Second, I do not believe that everything we do must be transparent to the buffoon-programmer who can not be expected to understand anything beyond the highest level, template-type programming language.

This does not suggest that research should not be undertaken to figure out how to get people to think via a "parallel programming model." Or, that multiprocessor processing is not important. Cache coherency, memory consistency, contention issues, etc. are all relevant avenues for useful research. The fact is that current programmers do not naturally think "parallel programming model," which means a lot of applications are single thread. If performance of them is important, then we need to provide single thread performance. It is true that more and more applications naturally support lots of threads. Thus, the ability to handle lots of parallel threads is also important. However, we need to consider Amdahls Law -- getting the whole job done fast also requires the ability to handle the serial part. Ergo, Pentium_X/Niagara_Y. To do this this:

1. We treat what I have called the Levels of Transformation (from natural language problem statement to logic circuits) as one integrated whole.

a. That is, we add large functional units to the microarchitecture that remain powered off when not in use, but are powered up via compiled code when necessary to carry out a needed piece of work, specified by the algorithm. (My "Refrigerator" analogy.) ...and we let the algorithm writer and the compiler writer know that it is available.

b. We add many very light-weight processor engines for processing the embarrassingly parallel part of an algorithm. (Niagara X.)

c. We add some (very few, perhaps only one) very heavy-weight processors with serious hybrid branch prediction, out-of-order execution, etc. to handle the serial part of an algorithm. (Pentium Y.)

2. We provide appropriate interconnect to allow the Pentium Y core to talk to the Niagara processors, without throwing away a lot of performance waiting for information to go from one part of the chip to another.

3. We deal with the off-chip memory bandwidth demands by asking how to reduce this bandwidth. Re, code: denser encoding of the I stream. Re, data: representing values with the minimum number of bits required. Re: on-chip storage, we store what we need. In all three cases, can we take advantage of the enormous increase in logic capability to save off-chip bandwidth?

Mark Hill - For decades technologists have provided architects with more transistors that we have used to make faster processors via bit-level parallelism, instruction-level parallelism, and memory hierarchies. It now appears that not even Yale Patt can figure out ways to use even more transistors to speed up processors in cost- and power-effective ways. A critical barrier is that it appears hard to do useful work on behalf of a single thread for the hundreds of instruction opportunities it now takes to access main memory. Thus, it is time turn to the easier task of teaching Yale multi-threaded programming.