[C#] Bgr24彩色位图转为Gray8灰度位图的跨平台SIMD硬件加速向量算法

🕗 发布于 2024-11-20 08:48 算法 c# .net 图像处理 SIMD

将彩色位图转为灰度位图，是图像处理的常用算法。本文将介绍 Bgr24彩色位图转为Gray8灰度位图的算法，除了会给出标量算法外，还会给出向量算法。且这些算法是跨平台的，同一份源代码，能在 X86及Arm架构上运行，且均享有SIMD硬件加速。

一、标量算法

1.1 算法实现

对于彩色转灰度，由于人眼对红绿蓝三种颜色的敏感程度不同，在灰度转换时，每个颜色分配的权重也是不同的。有一个很著名的心理学公式：

Gray = R*0.299 + G*0.587 + B*0.114

该公式含有浮点数，而浮点数运算一般比较慢。
于是在具体实现时，需要做一定优化。可以将小数转为定点整数，这样便能将除法转为移位。整数计算比浮点型快，移位运算和加减法比乘除法快，于是取得了比较好的效果。
但是这种方法也会带来一定的精度损失，我们可以根据实际情况选择定点整数的精度位数。
这里我们使用16位精度，源代码如下。

public static unsafe void ScalarDo(BitmapData src, BitmapData dst) {
    const int cbPixel = 3; // Bgr24
    const int shiftPoint = 16;
    const int mulPoint = 1 << shiftPoint; // 0x10000
    const int mulRed = (int)(0.299 * mulPoint + 0.5); // 19595
    const int mulGreen = (int)(0.587 * mulPoint + 0.5); // 38470
    const int mulBlue = mulPoint - mulRed - mulGreen; // 7471
    int width = src.Width;
    int height = src.Height;
    int strideSrc = src.Stride;
    int strideDst = dst.Stride;
    byte* pRow = (byte*)src.Scan0.ToPointer();
    byte* qRow = (byte*)dst.Scan0.ToPointer();
    for (int i = 0; i < height; i++) {
        byte* p = pRow;
        byte* q = qRow;
        for (int j = 0; j < width; j++) {
            *q = (byte)((p[2] * mulRed + p[1] * mulGreen + p[0] * mulBlue) >> shiftPoint);
            p += cbPixel; // Bgr24
            q += 1; // Gray8
        }
        pRow += strideSrc;
        qRow += strideDst;
    }
}

1.2 基准测试代码

使用 BenchmarkDotNet 进行基准测试。
可以实现分配好数据。源代码如下。

private static readonly Random _random = new Random(1);
private BitmapData _sourceBitmapData = null;
private BitmapData _destinationBitmapData = null;
private BitmapData _expectedBitmapData = null;

[Params(1024, 2048, 4096)]
public int Width { get; set; }
public int Height { get; set; }

~Bgr24ToGrayBgr24Benchmark() {
    Dispose(false);
}

public void Dispose() {
    Dispose(true);
    GC.SuppressFinalize(this);
}

private void Dispose(bool disposing) {
    if (_disposed) return;
    _disposed = true;
    if (disposing) {
        Cleanup();
    }
}

private BitmapData AllocBitmapData(int width, int height, PixelFormat format) {
    const int strideAlign = 4;
    if (width <= 0) throw new ArgumentOutOfRangeException($"The width({width}) need > 0!");
    if (height <= 0) throw new ArgumentOutOfRangeException($"The width({height}) need > 0!");
    int stride = 0;
    switch (format) {
        case PixelFormat.Format8bppIndexed:
            stride = width * 1;
            break;
        case PixelFormat.Format24bppRgb:
            stride = width * 3;
            break;
    }
    if (stride <= 0) throw new ArgumentOutOfRangeException($"Invalid pixel format({format})!");
    if (0 != (stride % strideAlign)) {
        stride = stride - (stride % strideAlign) + strideAlign;
    }
    BitmapData bitmapData = new BitmapData();
    bitmapData.Width = width;
    bitmapData.Height = height;
    bitmapData.PixelFormat = format;
    bitmapData.Stride = stride;
    bitmapData.Scan0 = Marshal.AllocHGlobal(stride * height);
    return bitmapData;
}

private void FreeBitmapData(BitmapData bitmapData) {
    if (null == bitmapData) return;
    if (IntPtr.Zero == bitmapData.Scan0) return;
    Marshal.FreeHGlobal(bitmapData.Scan0);
    bitmapData.Scan0 = IntPtr.Zero;
}

[GlobalCleanup]
public void Cleanup() {
    FreeBitmapData(_sourceBitmapData); _sourceBitmapData = null;
    FreeBitmapData(_destinationBitmapData); _destinationBitmapData = null;
    FreeBitmapData(_expectedBitmapData); _expectedBitmapData = null;
}

[GlobalSetup]
public void Setup() {
    Height = Width;
    // Create.
    Cleanup();
    _sourceBitmapData = AllocBitmapData(Width, Height, PixelFormat.Format24bppRgb);
    _destinationBitmapData = AllocBitmapData(Width, Height, PixelFormat.Format8bppIndexed);
    _expectedBitmapData = AllocBitmapData(Width, Height, PixelFormat.Format8bppIndexed);
    RandomFillBitmapData(_sourceBitmapData, _random);
}

使用这些已分配好的数据，能很容易写出ScalarDo 的基准测试代码。

[Benchmark(Baseline = true)]
public void Scalar() {
    ScalarDo(_sourceBitmapData, _destinationBitmapData);
}

二、向量算法

2.1 算法思路

对于24位转8位灰度，可以使用这种办法: 每次从源位图读取3个向量，进行3-元素组的解交织运算，得到 R,G,B 平面数据。随后使用向量化的乘法与加法，来计算灰度值。结果是存储了灰度值的1个向量，可以将它存储到目标位图。
例如 Sse指令集使用的是128位向量，此时1个向量为16字节。每次从源位图读取3个向量，就是读取了48字节，即16个RGB像素。最后将1个向量存储到目标位图，就是写入了16字节，即16个灰度像素。

对于3-元素组的解交织，可以使用 shuffle 类别的指令来实现。例如对于X86架构的 128位向量，可以使用 SSSE3 的 _mm_shuffle_epi8 指令，它对应 NET 中的 Ssse3.Shuffle 方法。源代码如下。

static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_X_Part0 = Vector128.Create((sbyte)0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_X_Part1 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_X_Part2 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 4, 7, 10, 13).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Y_Part0 = Vector128.Create((sbyte)1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Y_Part1 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Y_Part2 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 2, 5, 8, 11, 14).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Z_Part0 = Vector128.Create((sbyte)2, 5, 8, 11, 14, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Z_Part1 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, 1, 4, 7, 10, 13, -1, -1, -1, -1, -1, -1).AsByte();
static readonly Vector128<byte> YGroup3Unzip_Shuffle_Byte_Z_Part2 = Vector128.Create((sbyte)-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 3, 6, 9, 12, 15).AsByte();

public static Vector128<byte> YGroup3Unzip(Vector128<byte> data0, Vector128<byte> data1, Vector128<byte> data2, out Vector128<byte> y, out Vector128<byte> z) {
    var f0A = YGroup3Unzip_Shuffle_Byte_X_Part0;
    var f0B = YGroup3Unzip_Shuffle_Byte_X_Part1;
    var f0C = YGroup3Unzip_Shuffle_Byte_X_Part2;
    var f1A = YGroup3Unzip_Shuffle_Byte_Y_Part0;
    var f1B = YGroup3Unzip_Shuffle_Byte_Y_Part1;
    var f1C = YGroup3Unzip_Shuffle_Byte_Y_Part2;
    var f2A = YGroup3Unzip_Shuffle_Byte_Z_Part0;
    var f2B = YGroup3Unzip_Shuffle_Byte_Z_Part1;
    var f2C = YGroup3Unzip_Shuffle_Byte_Z_Part2;
    var rt0 = Sse2.Or(Sse2.Or(Ssse3.Shuffle(data0, f0A), Ssse3.Shuffle(data1, f0B)), Ssse3.Shuffle(data2, f0C));
    var rt1 = Sse2.Or(Sse2.Or(Ssse3.Shuffle(data0, f1A), Ssse3.Shuffle(data1, f1B)), Ssse3.Shuffle(data2, f1C));
    var rt2 = Sse2.Or(Sse2.Or(Ssse3.Shuffle(data0, f2A), Ssse3.Shuffle(data1, f2B)), Ssse3.Shuffle(data2, f2C));
    y = rt1;
    z = rt2;
    return rt0;
}

为了使跨平台编写向量算法更方便，我开发了 VectorTraits 库，已经集成了上述算法。该库提供了“Vectors.YGroup3Unzip”方法。该方法能够跨平台，它会使用各个平台的shuffle指令。

X86 256-bit: Use _mm256_shuffle_epi8 and other instructions.
Arm: Use vqvtbl1q_u8 instructions.
Wasm: Use i8x16.swizzle instructions.

2.2 算法实现

有了“Vectors.YGroup3Unzip”方法后，便能方便的编写24位转8位灰度的算法了。灰度系数有8位精度，于是需要将 8位数据变宽为16位后，再来计算乘法与加法。最后再将 16位数据，变窄为8位。源代码如下。

public static unsafe void UseVectorsDoBatch(byte* pSrc, int strideSrc, int width, int height, byte* pDst, int strideDst) {
    const int cbPixel = 3; // Bgr24
    const int shiftPoint = 8;
    const int mulPoint = 1 << shiftPoint; // 0x100
    const ushort mulRed = (ushort)(0.299 * mulPoint + 0.5); // 77
    const ushort mulGreen = (ushort)(0.587 * mulPoint + 0.5); // 150
    const ushort mulBlue = mulPoint - mulRed - mulGreen; // 29
    Vector<ushort> vmulRed = new Vector<ushort>(mulRed);
    Vector<ushort> vmulGreen = new Vector<ushort>(mulGreen);
    Vector<ushort> vmulBlue = new Vector<ushort>(mulBlue);
    int vectorWidth = Vector<byte>.Count;
    int maxX = width - vectorWidth;
    byte* pRow = pSrc;
    byte* qRow = pDst;
    for (int i = 0; i < height; i++) {
        Vector<byte>* pLast = (Vector<byte>*)(pRow + maxX * cbPixel);
        Vector<byte>* qLast = (Vector<byte>*)(qRow + maxX * 1);
        Vector<byte>* p = (Vector<byte>*)pRow;
        Vector<byte>* q = (Vector<byte>*)qRow;
        for (; ; ) {
            Vector<byte> r, g, b, gray;
            Vector<ushort> wr0, wr1, wg0, wg1, wb0, wb1;
            // Load.
            b = Vectors.YGroup3Unzip(p[0], p[1], p[2], out g, out r);
            // widen(r) * mulRed + widen(g) * mulGreen + widen(b) * mulBlue
            Vector.Widen(r, out wr0, out wr1);
            Vector.Widen(g, out wg0, out wg1);
            Vector.Widen(b, out wb0, out wb1);
            wr0 = Vectors.Multiply(wr0, vmulRed);
            wr1 = Vectors.Multiply(wr1, vmulRed);
            wg0 = Vectors.Multiply(wg0, vmulGreen);
            wg1 = Vectors.Multiply(wg1, vmulGreen);
            wb0 = Vectors.Multiply(wb0, vmulBlue);
            wb1 = Vectors.Multiply(wb1, vmulBlue);
            wr0 = Vector.Add(wr0, wg0);
            wr1 = Vector.Add(wr1, wg1);
            wr0 = Vector.Add(wr0, wb0);
            wr1 = Vector.Add(wr1, wb1);
            // Shift right and narrow.
            wr0 = Vectors.ShiftRightLogical_Const(wr0, shiftPoint);
            wr1 = Vectors.ShiftRightLogical_Const(wr1, shiftPoint);
            gray = Vector.Narrow(wr0, wr1);
            // Store.
            *q = gray;
            // Next.
            if (p >= pLast) break;
            p += cbPixel;
            ++q;
            if (p > pLast) p = pLast; // The last block is also use vector.
            if (q > qLast) q = qLast;
        }
        pRow += strideSrc;
        qRow += strideDst;
    }
}

上面源代码中的“Vectors.ShiftRightLogical_Const”，是VectorTraits 库提供的方法。它能替代 NET 7.0 新增的 Vector.ShiftRightLogical 方法，使早期版本的 NET 也能使用逻辑右移位。
“Vectors.Multiply”也是VectorTraits 库提供的方法。它能避免无符号类型有时没有硬件加速的问题。

2.3 基准测试代码

[Benchmark]
public void UseVectors() {
    UseVectorsDo(_sourceBitmapData, _destinationBitmapData, false);
}

[Benchmark]
public void UseVectorsParallel() {
    UseVectorsDo(_sourceBitmapData, _destinationBitmapData, true);
}

public static unsafe void UseVectorsDo(BitmapData src, BitmapData dst, bool useParallel = false) {
    int vectorWidth = Vector<byte>.Count;
    int width = src.Width;
    int height = src.Height;
    if (width <= vectorWidth) {
        ScalarDo(src, dst);
        return;
    }
    int strideSrc = src.Stride;
    int strideDst = dst.Stride;
    byte* pSrc = (byte*)src.Scan0.ToPointer();
    byte* pDst = (byte*)dst.Scan0.ToPointer();
    int processorCount = Environment.ProcessorCount;
    int batchSize = height / (processorCount * 2);
    bool allowParallel = useParallel && (batchSize > 0) && (processorCount > 1);
    if (allowParallel) {
        int batchCount = (height + batchSize - 1) / batchSize; // ceil((double)length / batchSize)
        Parallel.For(0, batchCount, i => {
            int start = batchSize * i;
            int len = batchSize;
            if (start + len > height) len = height - start;
            byte* pSrc2 = pSrc + start * strideSrc;
            byte* pDst2 = pDst + start * strideDst;
            UseVectorsDoBatch(pSrc2, strideSrc, width, len, pDst2, strideDst);
        });
    } else {
        UseVectorsDoBatch(pSrc, strideSrc, width, height, pDst, strideDst);
    }
}

完整源码在 Bgr24ToGray8Benchmark.cs

随后为该算法编写基准测试代码。

三、基准测试结果

3.1 X86 架构

X86架构下的基准测试结果如下。

BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4460/23H2/2023Update/SunValley3)
AMD Ryzen 7 7840H w/ Radeon 780M Graphics, 1 CPU, 16 logical and 8 physical cores
.NET SDK 8.0.403
  [Host]     : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  DefaultJob : .NET 8.0.10 (8.0.1024.46610), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI


| Method               | Width | Mean         | Error      | StdDev     | Ratio | RatioSD | Code Size |
|--------------------- |------ |-------------:|-----------:|-----------:|------:|--------:|----------:|
| Scalar               | 1024  |  1,028.55 us |  12.545 us |  11.735 us |  1.00 |    0.02 |     152 B |
| UseVectors           | 1024  |     94.06 us |   0.606 us |   0.537 us |  0.09 |    0.00 |        NA |
| UseVectorsParallel   | 1024  |     24.98 us |   0.390 us |   0.365 us |  0.02 |    0.00 |        NA |
| PeterParallelScalar  | 1024  |    216.47 us |   1.719 us |   1.524 us |  0.21 |    0.00 |        NA |
|                      |       |              |            |            |       |         |           |
| Scalar               | 2048  |  4,092.26 us |  21.098 us |  18.703 us |  1.00 |    0.01 |     152 B |
| UseVectors           | 2048  |    507.70 us |   9.626 us |  11.459 us |  0.12 |    0.00 |        NA |
| UseVectorsParallel   | 2048  |    118.98 us |   1.025 us |   0.959 us |  0.03 |    0.00 |        NA |
| PeterParallelScalar  | 2048  |    803.30 us |   9.226 us |   8.630 us |  0.20 |    0.00 |        NA |
|                      |       |              |            |            |       |         |           |
| Scalar               | 4096  | 16,391.12 us | 121.643 us | 113.785 us |  1.00 |    0.01 |     152 B |
| UseVectors           | 4096  |  2,472.16 us |  32.452 us |  30.356 us |  0.15 |    0.00 |        NA |
| UseVectorsParallel   | 4096  |  2,034.85 us |  33.074 us |  30.937 us |  0.12 |    0.00 |        NA |
| PeterParallelScalar  | 4096  |  3,139.85 us |  32.657 us |  27.270 us |  0.19 |    0.00 |        NA |

Scalar: 标量算法。
UseVectors: 矢量算法。
UseVectorsParallel: 并发的矢量算法。

3.2 Arm 架构

同样的源代码可以在 Arm 架构上运行。基准测试结果如下。

BenchmarkDotNet v0.14.0, macOS Sequoia 15.0.1 (24A348) [Darwin 24.0.0]
Apple M2, 1 CPU, 8 logical and 8 physical cores
.NET SDK 8.0.204
  [Host]     : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD [AttachedDebugger]
  DefaultJob : .NET 8.0.4 (8.0.424.16909), Arm64 RyuJIT AdvSIMD


| Method               | Width | Mean         | Error     | StdDev    | Ratio | RatioSD |
|--------------------- |------ |-------------:|----------:|----------:|------:|--------:|
| Scalar               | 1024  |    635.31 us |  0.537 us |  0.448 us |  1.00 |    0.00 |
| UseVectors           | 1024  |    127.04 us |  0.567 us |  0.474 us |  0.20 |    0.00 |
| UseVectorsParallel   | 1024  |     46.37 us |  0.336 us |  0.314 us |  0.07 |    0.00 |
| PeterParallelScalar  | 1024  |    202.19 us |  1.025 us |  0.959 us |  0.32 |    0.00 |
|                      |       |              |           |           |       |         |
| Scalar               | 2048  |  2,625.64 us |  1.795 us |  1.402 us |  1.00 |    0.00 |
| UseVectors           | 2048  |    521.40 us |  0.301 us |  0.282 us |  0.20 |    0.00 |
| UseVectorsParallel   | 2048  |    152.11 us |  3.548 us | 10.064 us |  0.06 |    0.00 |
| PeterParallelScalar  | 2048  |    711.00 us |  1.806 us |  1.601 us |  0.27 |    0.00 |
|                      |       |              |           |           |       |         |
| Scalar               | 4096  | 10,457.09 us |  5.697 us |  5.051 us |  1.00 |    0.00 |
| UseVectors           | 4096  |  2,058.16 us |  4.110 us |  3.643 us |  0.20 |    0.00 |
| UseVectorsParallel   | 4096  |  1,152.15 us | 21.134 us | 21.703 us |  0.11 |    0.00 |
| PeterParallelScalar  | 4096  |  2,897.94 us | 56.893 us | 91.871 us |  0.28 |    0.01 |

附录

完整源代码: https://github.com/zyl910/VectorTraits.Sample.Benchmarks/blob/main/VectorTraits.Sample.Benchmarks.Inc/Image/Bgr24ToGray8Benchmark.cs
YGroup3Unzip 的文档: https://zyl910.github.io/VectorTraits_doc/api/Zyl.VectorTraits.Vectors.YGroup3Unzip.html
VectorTraits 的NuGet包: https://www.nuget.org/packages/VectorTraits
VectorTraits 的在线文档: https://zyl910.github.io/VectorTraits_doc/
VectorTraits 源代码: https://github.com/zyl910/VectorTraits

原文地址：https://blog.csdn.net/zyl910/article/details/143897971

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：数据结构-8.Java. 七大排序算法(上篇)
下一篇：C语言内存结构

深入理解 AWS Route 53 加权路由策略：实现多区域负载均衡
在当今的云计算环境中，有效管理和分配流量对于确保应用程序的高可用性和性能至关重要。AWS Route 53 的加权路由策略提供了一个强大的工具来实现这一目标。在本文中，我们将深入探讨加权路由策略的概念
阅读更多2024-11-21
Elasticsearch：如何部署文本嵌入模型并将其用于语义搜索
如何部署文本嵌入模型并将其用于语义搜索。它使你能够生成文本的向量表示并对生成的向量执行向量相似性搜索。示例中使用的模型在上公开可用。该示例使用来自的公共数据集。它由来自 Microsoft Bing
阅读更多2024-11-21
深入理解CRC：通信可靠性的关键
循环冗余校验（CRC, Cyclic Redundancy Check）是现代数据传输和存储中不可或缺的技术，用于提高系统的可靠性和数据完整性。CRC的强大之处在于其能够高效检测并定位传输错误，确保数
阅读更多2024-11-21
低速接口项目之串口Uart开发(一)——串口UART
在 RS-232 标准中，最常用的配置是八个数据位+无奇偶校验+一个停止位，按照一个完整的字节包括一位起始位、8 位数据位、一位停止位即总共十位数据来算，要想完整的实现这十位数据的发送，就需要 11个
阅读更多2024-11-21
网络安全基础
相对性、时效性、相关性、不确定性、复杂性以及重要性等。指网络系统的硬件、软件及其系统中的数据受到保护，不因偶然的或者恶意的原因而遭受到破坏、更改、泄露，系统连续可靠正常地运行，网络服务不中断。1、机密
阅读更多2024-11-21
Docker1：认识docker、在Linux中安装docker
Docker：快速构建、运行、管理应用的工具。简单来说，帮助我们部署项目、以及项目依赖的各种组件。Docker是一个典型的运维工具。以上就是在Linux中，安装docker的全部详细过程，感兴趣就关注
阅读更多2024-11-21
基于Springboot+Vue医院急诊系统(源码+PPT+LW+调试部署)
基于Spring Boot + Vue的医院急诊系统，旨在为患者提供便捷的医院急诊服务，同时提升医院的信息化管理水平。系统分为前台和后台两部分，前台功能包括医生信息、医院病房、医院论坛、医院资讯、个人
阅读更多2024-11-21
Gin 框架中的表单处理与数据绑定
Gin框架允许注册自定义验证规则。import ("regexp"// 注册自定义验证规则})首先，根据表单的需要定义一个或多个结构体。这些结构体应该清晰地反映表单数据的结构和类型
阅读更多2024-11-21
CAAS 和 IAAS
和是两种云服务模式，分别代表不同层次的云计算服务。
阅读更多2024-11-21
Spark 分布式计算中网络传输和序列化的关系（二）
Spark 中序列化和网络传输的优化直接关系到分布式计算的整体性能。结合高效的序列化工具（如 Kryo）和合理的网络传输策略（如压缩、分区优化），可以显著提高数据处理效率。中，网络传输和序列化是数据处
阅读更多2024-11-21