IT技术之家

首页 > 硬件开发

硬件开发

Xilinx?Alveo系列FPGA加速卡软件开发入门_小强不吃菜

发布时间:2023-12-10 02:57:29 硬件开发 108次 标签:fpga开发 Powered by 金山文档
随着异构计算越来越火,FPGA加速卡在诸多领域的应用也越来越多。FPGA加速卡与GPU加速卡加速原理完全不同。GPU本质上是依靠海量的并行运算单元,提升整体的吞吐量,来吃尽内存带宽。FPGA是通用的门阵列,按照数据结构和运算特点搭建专用运算单元,能够以更低功耗和时延实现高吞吐。本文主要以一个简单demo介绍FPGA的项目结构和功能...

背景

随着异构计算越来越火,FPGA加速卡在诸多领域的应用也越来越多。

FPGA加速卡与GPU加速卡加速原理完全不同。

GPU本质上是依靠海量的并行运算单元,提升整体的吞吐量,来吃尽内存带宽。

FPGA是通用的门阵列,按照数据结构和运算特点搭建专用运算单元,能够以更低功耗和时延实现高吞吐。

上一篇我们已经完成了环境搭建,本篇将主要介绍项目结构和工作原理

整体架构

使用GPU加速时,CPU发送数据和指令到GPU即可,无需考虑执行指令的运算核的设计,FPGA芯片运算核是需要开发的。如下图所示:主机通过PCIe连接FPGA加速卡。因此在Host端和Device端都有相应的代码。

传统的FPGA开发方式使用HDL语言,无法动态的修改FPGA内部的功能。FPGA加速的核心就是运算核更贴近数据,当数据结构或是处理流程变化,就需要修改运算核结构。如果每次替换运算核需要让服务器断电显然不现实,因此行业的普遍做法就是将芯片内分区,分为不可修改的静态区和可以修改的动态区。静态区内就是DMA、PCIe、DDR等基础的功能核,用户的运算核则部署到动态区中,通过AXI接口连接。

xilinx提供了Host端和Device端的数据交互,我们需要解决的Host的程序和Device的Kernel。

如下图所示,项目构建分为3个部分。Host端的程序、FPGA的Kernel、以及衔接Kernel和FPGA内部的Link。由于有link的存在,消除了一部分FPGA芯片规格差异,降低了Kernel的开发难度也可以在一定程度上与硬件解耦。

创建第一个工程

官方推荐的第一个DEMO工程就是向量加法,方便初学者快速掌握项目结构和运行原理。

创建工程首先需平台文件,我这里使用的是Alveo U50。常用的还有Alveo U200 , Alveo U50

随便取个项目名字

导入官方提供的例程,推荐初学者选择这个,包括了HLS、OpenCL的使用以及任务并行化的使用。

创建完成后可以看到项目包括多个包。最重要的就Host和Kernel的工程。

代码简介

Host程序主要包括如下操作:

加载xclbin文件,获取Device,加载xclbin之后获取所需的Kernel。

给Kernel分配空间,异步执行,等结果回传。

最后用CPU再算一下,比对加速卡和CPU的计算结果。

/**
* Copyright (C) 2019-2021 Xilinx, Inc
*
* Licensed under the Apache License, Version 2.0 (the "License"). You may
* not use this file except in compliance with the License. A copy of the
* License is located at
*
*     http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations
* under the License.
*/

#include "xcl2.hpp"
#include <algorithm>
#include <vector>
#define DATA_SIZE 4096

int main(int argc, char** argv) {
    if (argc != 2) {
        std::cout << "Usage: " << argv[0] << " <XCLBIN File>" << std::endl;
        return EXIT_FAILURE;
    }

    std::string binaryFile = argv[1];
    size_t vector_size_bytes = sizeof(int) * DATA_SIZE;
    cl_int err;
    cl::Context context;
    cl::Kernel krnl_vector_add;
    cl::CommandQueue q;
    // Allocate Memory in Host Memory
    // When creating a buffer with user pointer (CL_MEM_USE_HOST_PTR), under the
    // hood user ptr
    // is used if it is properly aligned. when not aligned, runtime had no choice
    // but to create
    // its own host side buffer. So it is recommended to use this allocator if
    // user wish to
    // create buffer using CL_MEM_USE_HOST_PTR to align user buffer to page
    // boundary. It will
    // ensure that user buffer is used when user create Buffer/Mem object with
    // CL_MEM_USE_HOST_PTR
    std::vector<int, aligned_allocator<int> > source_in1(DATA_SIZE);
    std::vector<int, aligned_allocator<int> > source_in2(DATA_SIZE);
    std::vector<int, aligned_allocator<int> > source_hw_results(DATA_SIZE);
    std::vector<int, aligned_allocator<int> > source_sw_results(DATA_SIZE);

    // Create the test data
    std::generate(source_in1.begin(), source_in1.end(), std::rand);
    std::generate(source_in2.begin(), source_in2.end(), std::rand);
    for (int i = 0; i < DATA_SIZE; i++) {
        source_sw_results[i] = source_in1[i] + source_in2[i];
        source_hw_results[i] = 0;
    }

    // OPENCL HOST CODE AREA START
    // get_xil_devices() is a utility API which will find the xilinx
    // platforms and will return list of devices connected to Xilinx platform
    auto devices = xcl::get_xil_devices();
    // read_binary_file() is a utility API which will load the binaryFile
    // and will return the pointer to file buffer.
    auto fileBuf = xcl::read_binary_file(binaryFile);
    cl::Program::Binaries bins{{fileBuf.data(), fileBuf.size()}};
    bool valid_device = false;
    for (unsigned int i = 0; i < devices.size(); i++) {
        auto device = devices[i];
        // Creating Context and Command Queue for selected Device
        OCL_CHECK(err, context = cl::Context(device, nullptr, nullptr, nullptr, &err));
        OCL_CHECK(err, q = cl::CommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &err));
        std::cout << "Trying to program device[" << i << "]: " << device.getInfo<CL_DEVICE_NAME>() << std::endl;
        cl::Program program(context, {device}, bins, nullptr, &err);
        if (err != CL_SUCCESS) {
            std::cout << "Failed to program device[" << i << "] with xclbin file!\n";
        } else {
            std::cout << "Device[" << i << "]: program successful!\n";
            OCL_CHECK(err, krnl_vector_add = cl::Kernel(program, "vadd", &err));
            valid_device = true;
            break; // we break because we found a valid device
        }
    }
    if (!valid_device) {
        std::cout << "Failed to program any device found, exit!\n";
        exit(EXIT_FAILURE);
    }

    // Allocate Buffer in Global Memory
    // Buffers are allocated using CL_MEM_USE_HOST_PTR for efficient memory and
    // Device-to-host communication
    OCL_CHECK(err, cl::Buffer buffer_in1(context, CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, vector_size_bytes,
                                         source_in1.data(), &err));
    OCL_CHECK(err, cl::Buffer buffer_in2(context, CL_MEM_USE_HOST_PTR | CL_MEM_READ_ONLY, vector_size_bytes,
                                         source_in2.data(), &err));
    OCL_CHECK(err, cl::Buffer buffer_output(context, CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY, vector_size_bytes,
                                            source_hw_results.data(), &err));

    int size = DATA_SIZE;
    OCL_CHECK(err, err = krnl_vector_add.setArg(0, buffer_in1));
    OCL_CHECK(err, err = krnl_vector_add.setArg(1, buffer_in2));
    OCL_CHECK(err, err = krnl_vector_add.setArg(2, buffer_output));
    OCL_CHECK(err, err = krnl_vector_add.setArg(3, size));

    // Copy input data to device global memory
    OCL_CHECK(err, err = q.enqueueMigrateMemObjects({buffer_in1, buffer_in2}, 0 /* 0 means from host*/));

    // Launch the Kernel
    // For HLS kernels global and local size is always (1,1,1). So, it is
    // recommended
    // to always use enqueueTask() for invoking HLS kernel
    OCL_CHECK(err, err = q.enqueueTask(krnl_vector_add));

    // Copy Result from Device Global Memory to Host Local Memory
    OCL_CHECK(err, err = q.enqueueMigrateMemObjects({buffer_output}, CL_MIGRATE_MEM_OBJECT_HOST));
    q.finish();
    // OPENCL HOST CODE AREA END

    // Compare the results of the Device to the simulation
    bool match = true;
    for (int i = 0; i < DATA_SIZE; i++) {
        if (source_hw_results[i] != source_sw_results[i]) {
            std::cout << "Error: Result mismatch" << std::endl;
            std::cout << "i = " << i << " CPU result = " << source_sw_results[i]
                      << " Device result = " << source_hw_results[i] << std::endl;
            match = false;
            break;
        }
    }

    std::cout << "TEST " << (match ? "PASSED" : "FAILED") << std::endl;
    return (match ? EXIT_SUCCESS : EXIT_FAILURE);
}

Kernel代码如下,代码很简单就是加法,主要就是HLS的写法需要适应:

/**
* Copyright (C) 2019-2021 Xilinx, Inc
*
* Licensed under the Apache License, Version 2.0 (the "License"). You may
* not use this file except in compliance with the License. A copy of the
* License is located at
*
*     http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations
* under the License.
*/

/*******************************************************************************
Description:

    This example uses the load/compute/store coding style which is generally
    the most efficient for implementing kernels using HLS. The load and store
    functions are responsible for moving data in and out of the kernel as
    efficiently as possible. The core functionality is decomposed across one
    of more compute functions. Whenever possible, the compute function should
    pass data through HLS streams and should contain a single set of nested loops.

    HLS stream objects are used to pass data between producer and consumer
    functions. Stream read and write operations have a blocking behavior which
    allows consumers and producers to synchronize with each other automatically.

    The dataflow pragma instructs the compiler to enable task-level pipelining.
    This is required for to load/compute/store functions to execute in a parallel
    and pipelined manner.

    The kernel operates on vectors of NUM_WORDS integers modeled using the hls::vector
    data type. This datatype provides intuitive support for parallelism and
    fits well the vector-add computation. The vector length is set to NUM_WORDS
    since NUM_WORDS integers amount to a total of 64 bytes, which is the maximum size of
    a kernel port. It is a good practice to match the compute bandwidth to the I/O
    bandwidth. Here the kernel loads, computes and stores NUM_WORDS integer values per
    clock cycle and is implemented as below:
                                       _____________
                                      |             |<----- Input Vector 1 from Global Memory
                                      |  load_input |       __
                                      |_____________|----->|  |
                                       _____________       |  | in1_stream
Input Vector 2 from Global Memory --->|             |      |__|
                               __     |  load_input |        |
                              |  |<---|_____________|        |
                   in2_stream |  |     _____________         |
                              |__|--->|             |<--------
                                      | compute_add |      __
                                      |_____________|---->|  |
                                       ______________     |  | out_stream
                                      |              |<---|__|
                                      | store_result |
                                      |______________|-----> Output result to Global Memory

*******************************************************************************/

// Includes
#include <hls_vector.h>
#include <hls_stream.h>
#include "assert.h"

#define MEMORY_DWIDTH 512
#define SIZEOF_WORD 4
#define NUM_WORDS ((MEMORY_DWIDTH) / (8 * SIZEOF_WORD))

#define DATA_SIZE 4096

// TRIPCOUNT identifier
const int c_size = DATA_SIZE;

static void load_input(hls::vector<uint32_t, NUM_WORDS>* in,
                       hls::stream<hls::vector<uint32_t, NUM_WORDS> >& inStream,
                       int vSize) {
mem_rd:
    for (int i = 0; i < vSize; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
        inStream << in[i];
    }
}

static void compute_add(hls::stream<hls::vector<uint32_t, NUM_WORDS> >& in1_stream,
                        hls::stream<hls::vector<uint32_t, NUM_WORDS> >& in2_stream,
                        hls::stream<hls::vector<uint32_t, NUM_WORDS> >& out_stream,
                        int vSize) {
// The kernel is operating with vector of NUM_WORDS integers. The + operator performs
// an element-wise add, resulting in NUM_WORDS parallel additions.
execute:
    for (int i = 0; i < vSize; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
        out_stream << (in1_stream.read() + in2_stream.read());
    }
}

static void store_result(hls::vector<uint32_t, NUM_WORDS>* out,
                         hls::stream<hls::vector<uint32_t, NUM_WORDS> >& out_stream,
                         int vSize) {
mem_wr:
    for (int i = 0; i < vSize; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
        out[i] = out_stream.read();
    }
}

extern "C" {

/*
    Vector Addition Kernel

    Arguments:
        in1  (input)  --> Input vector 1
        in2  (input)  --> Input vector 2
        out  (output) --> Output vector
        size (input)  --> Number of elements in vector
*/

void vadd(hls::vector<uint32_t, NUM_WORDS>* in1,
          hls::vector<uint32_t, NUM_WORDS>* in2,
          hls::vector<uint32_t, NUM_WORDS>* out,
          int size) {
#pragma HLS INTERFACE m_axi port = in1 bundle = gmem0
#pragma HLS INTERFACE m_axi port = in2 bundle = gmem1
#pragma HLS INTERFACE m_axi port = out bundle = gmem0

    static hls::stream<hls::vector<uint32_t, NUM_WORDS> > in1_stream("input_stream_1");
    static hls::stream<hls::vector<uint32_t, NUM_WORDS> > in2_stream("input_stream_2");
    static hls::stream<hls::vector<uint32_t, NUM_WORDS> > out_stream("output_stream");

    // Since NUM_WORDS values are processed
    // in parallel per loop iteration, the for loop only needs to iterate 'size / NUM_WORDS' times.
    assert(size % NUM_WORDS == 0);
    int vSize = size / NUM_WORDS;
#pragma HLS dataflow

    load_input(in1, in1_stream, vSize);
    load_input(in2, in2_stream, vSize);
    compute_add(in1_stream, in2_stream, out_stream, vSize);
    store_result(out, out_stream, vSize);
}
}

用的HW测试,对比计算结果正确。

下面是工程的打包文件

DEMO工程下载

性能调优

和一般的CPU代码不同,我们使用加速卡的目标就是要给应用加速,高性能才是最重要的。vitis IDE集成了强大的性能分析工具vitis analyzer,能够打通底层Kernel的每个module到最上层的C++代码的每一个API。从全局分析整个程序的性能瓶颈和耗时操作,快速定位问题。

在项目执行完成后,对应目录下会生成.xrt文件。可以使用vitis analyzer打开

对于Kernel内部每一个module的运行性能,分析文件在HW的工程下,可以看到每一个module的运行性能,还可以打开Vitis HLS进行进一步的分析。

简单分析

以此加法工程为例,进行简单分析,尽量讲清楚加速卡的使用场景。

应用异构计算或者说加速卡之前要先确认计算的类型。比如这个DEMO进行的运算是 A+B的简单运算。

DEMO中使用CPU计算的方法如下,循环4096次耗时仅为21us左右,

// CPU计算A+B
    for (int i = 0; i < DATA_SIZE; i++) {
        source_sw_results[i] = source_in1[i] + source_in2[i];
    }

通过vitis Analyzer查看使用FPGA计算耗时高达88ms,如果算上Host与Device通过PCIe的传输耗时,加速卡的性能弱爆。

那么到底加速了个啥?

CPU主频有高达几GHz,GPU仅有1GHz左右,FPGA往往只有数百MHz。

在运算A+B这类运算的时候,即使是用10多年前的古董CPU来计算,速度也可以轻松碾压现在最先进的加速卡。

异构计算必须是不适合CPU处理的数据结构或者运算,才适合卸载到加速卡中运行。

GPU是通过海量的并行计算单元,发挥高带宽内存的优势,提升整体的数据吞吐。也就是通过超高内存带宽和海量芯片算力来提升计算性能。所以如DL/ML的训练,数据库查询等等都可以使用GPU来加速。

FPGA是通过通用门阵列,硬件电路实现运算过程,所以FPGA的特点就是更加适合数据结构和运算特性。因为是硬件电路实现数据计算,单条数据的计算耗时更短也更稳定,同时FPGA加速卡有更多数据接口,更贴近数据源,因此更适合做各类基础数据运算。例如,金融行业的交易系统、数据中心的存算分离等。