自学内容网 自学内容网

并行编程实战——TBB框架的应用之四Supra对CUDA的支持

一、CUDA和OPENCL

从TBB旧的版本到OneAPI的新的框架,都对并行计算最大可能的进行了支持。这其中就包括CUDA和OPENGL,这两个框架对大多数做开发的人来说,可能听到的较多,但真正用的并不多。反倒是在AI应用中,因为涉及到图像的处理,大多使用GPU来进行,所以应用比较广泛。
这里不打算对二者进行详细的说明,有兴趣的可以自行查看相关资料。大概简单说一下,二者做为异构平台并行计算的框架,CUDA是特定平台(NVIDIA)的而OPENCL是类似于一个标准存在的,理论上是可以适应各种平台的。

二、TBB中的应用

在OneAPI当然对二者也进行了支持,毕竟并行框架里不支持更多的并行框架,简直是不要不要的。都是为了性能,如果能迭加产生1+1>2的效果得有多好。这在Supra中有所体现,下面看一下相关的代码:
1、使用CUDA

#include "ImageProcessingCuda.h"

#include <thrust/transform.h>
#include <thrust/execution_policy.h>

using namespace std;

namespace supra
{
namespace ImageProcessingCudaInternal
{
typedef ImageProcessingCuda::WorkType WorkType;

// here the actual processing happens!

template <typename InputType, typename OutputType>
__global__ void processKernel(const InputType* inputImage, vec3s size, WorkType factor, OutputType* outputImage)
{
size_t x = blockDim.x*blockIdx.x + threadIdx.x;
size_t y = blockDim.y*blockIdx.y + threadIdx.y;
size_t z = blockDim.z*blockIdx.z + threadIdx.z;

size_t width = size.x;
size_t height = size.y;
size_t depth = size.z;

if (x < width && y < height && z < depth)
{
// Perform a pixel-wise operation on the image

// Get the input pixel value and cast it to out working type.
// As this should in general be a type with wider range / precision, this cast does not loose anything.
WorkType inPixel = inputImage[x + y*width + z *width*height];

// Perform operation, in this case multiplication
WorkType value = inPixel * factor;

// Store the output pixel value.
// Because this is templated, we need to cast from "WorkType" to "OutputType".
// This should happen in a sane way, that is with clamping. There is a helper for that!
outputImage[x + y*width + z *width*height] = clampCast<OutputType>(value);
}
}
}

template <typename InputType, typename OutputType>
shared_ptr<Container<OutputType> > ImageProcessingCuda::process(const shared_ptr<const Container<InputType>>& imageData, vec3s size, WorkType factor)
{
// here we prepare the buffers and call the cuda kernel

size_t width = size.x;
size_t height = size.y;
size_t depth = size.z;

// make sure the data is in cpu memory
auto inImageData = imageData;
if (!inImageData->isGPU() && !inImageData->isBoth())
{
inImageData = make_shared<Container<InputType> >(LocationGpu, *inImageData);
}

// prepare the output memory
auto outImageData = make_shared<Container<OutputType> >(LocationGpu, inImageData->getStream(), width*height*depth);

// call the kernel for the heavy-lifting
dim3 blockSize(32, 4, 1);
dim3 gridSize(
static_cast<unsigned int>((size.x + blockSize.x - 1) / blockSize.x),
static_cast<unsigned int>((size.y + blockSize.y - 1) / blockSize.y),
static_cast<unsigned int>((size.z + blockSize.z - 1) / blockSize.z));
ImageProcessingCudaInternal::processKernel << <gridSize, blockSize, 0, inImageData->getStream() >> > (
inImageData->get(),
size,
factor,
outImageData->get());
// check for cuda launch errors
cudaSafeCall(cudaPeekAtLastError());
// You should NOT synchronize the device or the stream we are working on!!

// return the result!
return outImageData;
}

...
}

主要目的就是将图像在GPU中进行处理。
2、使用OPENCL
代码如下:


template <typename InputType, typename OutputType>
shared_ptr<Container<OutputType>> ScanConverter::convert(const shared_ptr<USImage> &inImage) {
...
  auto p = make_shared<Container<OutputType>>(LocationGpu, pScanlineData->getStream(),
                                                  m_imageSize.x * m_imageSize.y * m_imageSize.z);

  if (m_is2D) {
    sycl::range<3> blockSize(1, 256, 1);
    sycl::range<3> gridSize(1, static_cast<unsigned int>((m_imageSize.y + blockSize[1] - 1) / blockSize[1]),
                            static_cast<unsigned int>((m_imageSize.x + blockSize[2] - 1) / blockSize[2]));

    static long scan_call_count = 0;

    sycl::event scan_event = pScanlineData->getStream()->submit([&](sycl::handler &c) {
      auto m_imageSize_x_2 = (uint32_t)m_imageSize.x;
      auto m_imageSize_y_3 = (uint32_t)m_imageSize.y;
...

      c.parallel_for(sycl::nd_range<3>(gridSize * blockSize, blockSize), [=](sycl::nd_item<3> item_1) {
        scanConvert2D(numScanlines, numSamples, m_imageSize_x_2, m_imageSize_y_3, m_mask_get_4,
                      m_sampleIdx_get_5, m_weightX_get_6, m_weightY_get_7, pScanlineData_get_8, p_get_9,
                      item_1);
      });
    });

...

  } else {
    sycl::range<3> blockSize(1, 256, 1);
    sycl::range<3> gridSize(static_cast<unsigned int>((m_imageSize.z + blockSize[0] - 1) / blockSize[0]),
                            static_cast<unsigned int>((m_imageSize.y + blockSize[1] - 1) / blockSize[1]),
                            static_cast<unsigned int>((m_imageSize.x + blockSize[2] - 1) / blockSize[2]));

    pScanlineData->getStream()->submit([&](sycl::handler &c) {
      auto m_imageSize_x_3 = (uint32_t)m_imageSize.x;
      auto m_imageSize_y_4 = (uint32_t)m_imageSize.y;
      auto m_imageSize_z_5 = (uint32_t)m_imageSize.z;
...

      c.parallel_for(sycl::nd_range<3>(gridSize * blockSize, blockSize), [=](sycl::nd_item<3> item_1) {
        scanConvert3D((uint32_t)scanlineLayout.x, (uint32_t)scanlineLayout.y, numSamples, m_imageSize_x_3,
                      m_imageSize_y_4, m_imageSize_z_5, m_mask_get_6, m_sampleIdx_get_7, m_weightX_get_8,
                      m_weightY_get_9, m_weightZ_get_10, pScanlineData_get_11, p_get_12, item_1);
      });
    });
  }
  return p;
}

不过好像这个OpenCL使用的是Intel自己的库(Data Parallel C++ ,DPC++, 是 oneAPI 的 SYCL 实现),大家在使用时可以仔细的看看。

三、总结

通过这些年来看国外的框架和库,有一个粗浅的经验即国外一般是互相引用互相借鉴,不大怎么重复造轮子。这个在Java上非常明显,就是一引用就是一大群库或框架。或者一个框架中有一批其它的库,而其它的库又引用其它的库,C++中也是如此。国内的的相对封闭许多,大多都是自我工作,一般都是在内部引用一些库或者一些国外非常有名的库。
这样说的意思是说国内的软件开发环境其实相对国外要差不少,仍然需要进一步的提高。当然,国内的开发环境也不是一天两天的原因更不是某个人某些人造成的,需要大家一起努力。


原文地址:https://blog.csdn.net/fpcc/article/details/143658246

免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!