FPGAs versus GPUs
for Datacenters Moderator: Babak Falsafi (EPFL) Panelists: Bill Dally
(Stanford/NVIDIA), Desh Singh (Altera) Moderator’s
Introduction The slowdown in Dennard Scaling is giving
rise to the use of accelerators in datacenters. The most promising
accelerator platforms are GPUs and FPGAs. GPUs rely on the data parallel
execution model, offer higher-level programming abstractions and a rich set
of libraries and best suited for dense floating-point arithmetic. FPGAs
enable arbitrary forms of parallelism and generalized acceleration including
fixed-point arithmetic but lack in programmability, computational density and
well-defined memory abstractions. The debate between GPUs and FPGAs is on the
one hand related to the broader applicability of the data-parallel execution
model and dense floating-point arithmetic to emerging datacenter workloads,
and the programmability and computational density/efficiency and basic
utility of FPGAs on the other. Most of the debate will center around the above issues. I also have a few questions
about: -
What should we do about data services/data management operators that are
neither data parallel, nor fit the spatial computing model or the current
CPUs are servers that are primarily designed for desktop workloads? -
What are the best ways to interface GPUs/FPGAs as accelerators with the rest
of the system both from a programming abstraction perspective and at the
hardware/system level (e.g., cache coherence shared memory, message passing)? Bill
Dally’s position statement For processing tasks in the data center,
one should use the right tool for the job.
If the job involves intensive integer or floating point arithmetic,
high bandwidth to bulk memory, and close interaction with a CPU, the right
tool is a GPU. If the job involves
non-arithmetic logic (e.g., bioinformatics, coding, etc...), moderate memory
bandwidth, and loose CPU interaction, the right tool is an ASIC. If the right tool is an ASIC but you don't
have the volume to justify it, use an FPGA, but realize that a LUT costs
20-100x the power and area of the equivalent function in an ASIC. GPUs are the tool of choice for
arithmetic-intensive applications because they achieve industry-leading
perf/W on numeric benchmarks and are easy to program. Upcoming GPUs with HBM have memory
bandwidth that cannot be approached by FPGAs with discrete memory. With NVLINK, GPUs can transparently share
memory with CPUs, further simplifying the programming task. Desh
Singh’s position statement I believe that the most efficient way to
implement an algorithm is to design a custom hardware circuit that is absolutely
dedicated to implementing only that algorithm. While this is impractical
given the generality and rapid evolution of workloads, the FPGA offers us the
next best alternative. Using FPGA reconfigurable hardware, we can implement
any circuit that we like by taking advantage of a millions of programmable
resources arranged in a blank slate. In the past, FPGA design has posed a
high barrier to entry due to hardware centric tool flows. However, recent
evolutions in compiler technology are now opening the doors to software
programmers using FPGAs with high level languages (C,C++,OpenCL,Java,etc.). The convergence of tools and new
specializations in FPGA architecture will make into a formidable application
accelerator in the data center. |