FASTCUDA

Open Source FPGA Accelerator & Hardware Software Codesign Toolset for CUDA Kernels

Χρηματοδότηση: European Commission
Κωδικός Έργου: FASTCUDA
Πρόγραμμα: SEVENTH FRAMEWORK PROGRAMME (FP7-SME)
Προϋπολογισμός: Overall 1.603.596,00 €
Ημερομηνία Έναρξης: 1st November 2011
Διάρκεια: 24 months
Website(s): www.fastcuda.eu – CORDIS

Πληροφορίες

Σύντομη Περιγραφή

Scientific applications such as graphics, biological modeling, molecular dynamics and others, are usually highly parallel and can benefit from specialized hardware to accelerate their execution. For this reason, highly parallel Graphic Processing Units (GPUs) have been traditionally favored over General Purpose Processors for running such applications. In the same way, FPGAs can potentially provide even higher speedups at lower power consumption than GPUs. However, their use is still limited since the path to porting an application onto FPGAs’ custom hardware is often prohibitively cumbersome. Therefore, FASTCUDA facilitates this path by providing a novel methodology, architecture and toolset to automatically port and run already-parallelized algorithms onto reconfigurable hardware. For this purpose, the FASTCUDA methodology utilizes CUDA, a Graphical Processing Unit (GPU) language, which exposes parallelism at source code.

The FASTCUDA toolset splits, with minimal user intervention, application’s code into two parts: one that is compiled and executed as parallel software on an embedded multi-core, and another consisting of multiple special-purpose accelerators that are synthesized and implemented in hardware. A last generation low power FPGA provides the processing power and the logic capacity to implement and execute both parts.

In particular, FASTCUDA is a design methodology and accompanying toolset that allows CUDA programs to be executed efficiently on a shared memory, multi-core CPU communicating with an FPGA-based accelerator. A multi-core processor, consisting of multiple embedded cores (configurable small processors), is used so as to run the host program serially and the SW CUDA kernels in parallel. Threads belonging to the same CUDA thread-block are executed by the same core. The HW CUDA kernels are partitioned into thread-blocks, and synthesized and implemented inside an “Accelerator” block. Each thread-block has a local private memory while the global shared memory can be accessed by any thread following the philosophy of the CUDA model.

For our prototype version, we have used the Xilinx Virtex-6 FPGA with 500MB of external DDR memory placed on a Xilinx ML605 evaluation board, and the multi-core processor consists of an array of Xilinx Microblaze CPUs. However, real products designed with FASTCUDA may also use faster embedded processors such as the ARM Cortex-A9 MPCore.

Στόχοι έργου

In recent years, an observable trend in High Performance Computing (HPC) architectures has been the inclusion of accelerators, such as Graphical Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), to improve the performance of scientific applications. Several applications, ranging from graphics, to biological modeling, molecular dynamics, physics and others, have been successfully ported to GPUs, taking benefit of highly parallel hardware to accelerate their execution. Porting to GPUs, hard as it may be, requires only software skills to code the specific algorithm into parallel multi-threaded software. On the other hand, the path to FPGA development is notoriously more difficult since porting an algorithm to custom hardware is less straightforward, and the simulation-verification-debugging cycle can be many orders longer. For this reason, even though FPGAs’ custom hardware can potentially provide higher speedups at lower power consumption than GPUs, GPU-based solutions dominate the scientific world.

FASTCUDA aims to bridge this gap by taking advantage of the software parallelization effort that has gone into porting scientific applications to GPUs, and utilize it so as to implement FPGA-based systems. FASTCUDA focuses on CUDA, a GPU architecture and programming model initially developed by Nvidia for its line of GPUs, and provides a novel methodology, architecture and toolset to automatically port and run CUDA programs onto FPGA hardware.

Execution starts with the CUDA host program running single-threaded on the host CPU. Whenever a CUDA kernel is invoked, the host CPU dispatches the execution of the kernel to an accelerator (separate device) that supports parallel execution of multiple threads. Traditionally these are Nvidia’s GPUs or other multi-core platforms. However, we prove that even higher performance acceleration, as well as lower power and energy consumption, can be obtained if a computationally intensive CUDA kernel is synthesized into hardware and mapped onto an FPGA for execution. Therefore, FASTCUDA employs a hybrid approach: it uses an FPGA-based accelerator for executing the time critical CUDA kernels and a multi-core processor for executing the CUDA kernels that could not fit in the FPGA fabric.

FASTCUDA is a design methodology and accompanying toolset that allows CUDA programs to be executed efficiently on a shared memory, multi-core CPU communicating with an FPGA-based accelerator. A modern FPGA provides all required resources; multiple embedded micro-CPUs for the CUDA host program and the CUDA kernels that will be executed on the multi-core processor as well as large logic capacity for the CUDA kernels that will be accelerated in hardware. Toward this end FASTCUDA has not developed everything from scratch but it has joined numerous on-going efforts in industry and academia to create a unified efficient open-source framework.

The objectives of FASTCUDA were twofold:

create an innovative embedded system design flow by designing highly efficient components and by taking advantage of numerous open-source ongoing efforts in codesign of embedded systems, both at the academic and at the industrial level
enable an easier transition from research results to industrial exploitation, i.e. standardization of codesign usage

FASTCUDA has successfully defined the new design flow and has provided to the open-community the related toolset. The objectives have been achieved by defining, implementing and disseminating a publicly available platform that takes as input a description of the system in the CUDA programming model, and produces an efficient FPGA-based embedded design that executes certain CUDA kernels in software, while it implements the rest in hardware according to a hardware/software partitioning algorithm that has been developed throughout the project.

In order to fulfill the aforementioned objectives we have built the FASTCUDA platform which is comprised of the following sub-systems:

A novel reconfigurable computing (RC) architecture composed of a multi-processor system, shared memory and reconfigurable fabric in order to run the multi-threaded CUDA applications.
An advanced high-level synthesis tool which efficiently maps the coarse and fine grained parallelism exposed in CUDA kernels onto the reconfigurable fabric.
A compiler framework in order to port the CUDA programming model to the FASTCUDA multi- processor environment.
A design space exploration strategy based on profiling, user-driven block partitioning, and analysis by simulation, compilation and high-level synthesis of the quality of each point in the design space.
A central on-chip processor that coordinates the execution of the CUDA kernels and executes the main code (referred as host code in the CUDA programming model) of the CUDA application.

The FASTCUDA platform is relatively easy to use through a graphical user interface (GUI) in order to gain wide acceptability by the embedded design community. Especially, as the tool targets the group of designers programming in a high-level and it is critical to speed-up their design time, the factor of having a tool that operates in a user friendly environment is of major importance. This can play an important role to the wide adoption of the tool.

Αποτελέσματα

FASTCUDA’s main target was to derive a high level synthesis toolset in order to efficiently run a CUDA application on a FPGA-based hybrid platform which consists of a multi-core processor and an FPGA accelerator. Throughout the project several tools were developed. A brief description of the main results/foregrounds is the following:

High Level Synthesis tool: A complete software package that takes as input a CUDA kernel, which describes a part of the application and provides as output synthesizable multi-threaded SystemC code and RTL code that implements the exact same functionality with the input.
CUDA to multi-threaded C Compiler: A complete software package that (a) takes as an input a CUDA kernel, which describes a part of the application and (b) provides as output a CPU-based code performing the exact same function.
Multi-core processor: A hardware package that provides a multi-core CPU platform customized for the executions of CUDA kernels.
Εstimation tools: Software packages that given a CUDA description of an application, they provide performance estimation numbers.
Εxploration tool: A complete software package that takes as an input a description of an application in CUDA (including the parts that will be implemented both in hardware and in software) as well as the characteristics of the FPGA-based platform and gives the necessary performance and power estimations for various hardware-software partitioning alternatives to the designer, so as to allow him/her to choose the optimal underlying architecture.
SW-HW bridge and system API: A hardware package that provides the SW-HW bridge between the multi-core and the FPGA accelerator, a software package that includes the SW-HW communication API library.
Since there was no available Xilinx IP core which could provide cache coherency for the FASTCUDA multi-core processor, FASTCUDA built its own HW blocks which provide cache coherency.
Numerous CUDA applications have been developed addressing different application domains from security to bioinformatics.

Εταίροι

FASTCUDA

FASTCUDA

Πληροφορίες

Άλλα Έργα

ΕΛΑΙΩΝ

WMatch

VARCITIES

TRADENET

SUN

SpeDial

SENTINEL

SecOPERA

SAFEMETAL

RUNNER

ΣΧΕΤΙΚΟΙ ΣΥΝΔΕΣΜΟΙ

Τελευταία Νέα

ΔΙΕΥΘΥΝΣΗ