Open Source FPGA Accelerator & Hardware Software Codesign Toolset for CUDA Kernels
- Funded by: European Commission
- Project Acrinym: FASTCUDA
- Funded under: SEVENTH FRAMEWORK PROGRAMME (FP7-SME)
- Budget: Overall 1.603.596,00 €
- Start Date: 1st November 2011
- Duration: 24 months
- Website(s): www.fastcuda.eu – CORDIS
Scientific applications such as graphics, biological modeling, molecular dynamics and others, are usually highly parallel and can benefit from specialized hardware to accelerate their execution. For this reason, highly parallel Graphic Processing Units (GPUs) have been traditionally favored over General Purpose Processors for running such applications. In the same way, FPGAs can potentially provide even higher speedups at lower power consumption than GPUs. However, their use is still limited since the path to porting an application onto FPGAs’ custom hardware is often prohibitively cumbersome. Therefore, FASTCUDA facilitates this path by providing a novel methodology, architecture and toolset to automatically port and run already-parallelized algorithms onto reconfigurable hardware. For this purpose, the FASTCUDA methodology utilizes CUDA, a Graphical Processing Unit (GPU) language, which exposes parallelism at source code.
The FASTCUDA toolset splits, with minimal user intervention, application’s code into two parts: one that is compiled and executed as parallel software on an embedded multi-core, and another consisting of multiple special-purpose accelerators that are synthesized and implemented in hardware. A last generation low power FPGA provides the processing power and the logic capacity to implement and execute both parts.
In particular, FASTCUDA is a design methodology and accompanying toolset that allows CUDA programs to be executed efficiently on a shared memory, multi-core CPU communicating with an FPGA-based accelerator. A multi-core processor, consisting of multiple embedded cores (configurable small processors), is used so as to run the host program serially and the SW CUDA kernels in parallel. Threads belonging to the same CUDA thread-block are executed by the same core. The HW CUDA kernels are partitioned into thread-blocks, and synthesized and implemented inside an “Accelerator” block. Each thread-block has a local private memory while the global shared memory can be accessed by any thread following the philosophy of the CUDA model.
For our prototype version, we have used the Xilinx Virtex-6 FPGA with 500MB of external DDR memory placed on a Xilinx ML605 evaluation board, and the multi-core processor consists of an array of Xilinx Microblaze CPUs. However, real products designed with FASTCUDA may also use faster embedded processors such as the ARM Cortex-A9 MPCore.