Oliver Pell (Imperial College London), Lee W. Howes (Imperial College London), Kubilay Atasu (Imperial College London), Olav Beckmann (Imperial College London), Oskar Mencer (Imperial College London)
Keywords: FPGA
Abstract:
Field Programmable Gate Arrays (FPGAs) are semiconductor devices that contain a grid of programmable cells, which the user configures to implement any digital circuit of up to a few million gates. Modern FPGAs allow the user to reconfigure these circuits many times each second, making FPGAs fully programmable and general purpose. Recent FPGA technology provides sufficient resources to tackle scientific applications on large-scale parallel systems.
As a case study, we implement the Fast Fourier Transform [1] in a flexible floating point implementation. We utilize A Stream Compiler [2] (ASC) which combines C++ syntax with flexible floating point support by providing a HWfloat data-type. The resulting FFT can be targeted to a variety of FPGA platforms in FFTW-style, though not yet completely automatically. The resulting FFT circuit can be adapted to the particular resources available on the system. The optimal implementation of an FFT accelerator depends on the length and dimensionality of the FFT, the available FPGA area, the available hard DSP blocks, the FPGA board architecture, and the precision and range of the application [3]. Software-style object-orientated abstractions allow us to pursue an accelerated pace of development by maximizing re-use of design patterns. ASC allows a few core hardware descriptions to generate hundreds of different circuit variants to meet particular speed, area and precision goals.
The key to achieving maximum acceleration of FFT computation is to match memory and compute bandwidths so that maximum use is made of computational resources. Modern FPGAs contain up to hundreds of independent SRAM banks to store intermediate results, providing ample scope for optimizing memory parallelism.
At 175Mhz, one of Maxelers Radix-4 FFT cores computes 4x as many 1024pt FFTs per second as a dual Pentium-IV Xeon machine running FFTW. Eight such parallel cores fit onto the largest FPGA in the Xilinx Virtex-4 family, providing a 32x speed-up over performing the calculation in software.
Our work at Imperial combines the Maxeler cores with a high performance FPGA computing platform from HP to demonstrate the potential of FPGAs for scientific computing applications. Clearly, performance depends on the communication bandwidth to the FPGAs and we can clearly see in Figure 1 that PCI Express x4 is well matched with current FPGAs. Further activities focus on investigation of larger radices and optimizations for longer transform lengths. FPGAs with a single core are competitive for FFT transforms of more than 16 points by utilizing greater parallelism and reduced memory hierarchy overhead compared to CPUs.
%Z M. Frigo and S. G. Johnson. The Design and Implementation of FFTW3. Proc. IEEE, 93(2), 2005
O. Mencer. ASC, A Stream Compiler for Computing with FPGAs. IEEE Trans. CAD, 2006
K. S. Hemmert, K. Underwood. An Analysis of the Dobule-Precision Floating-Point FFT on FPGAs. Proc. FCCM05, IEEE Computer Society Press, 2005
Date of Conference: September 10-14, 2006
Track: Poster