跳到主要内容

3.7.1 OpenCL 图像预处理加速

最新版本:2025/09/23

简介

在 YOLO 等图像处理网络中,输入图像通常采用 HWC 格式(高度×宽度×通道),而神经网络输入要求为 BCHW 格式(批次×通道×高度×宽度,其中批次 B 通常为 1)。预处理需完成以下操作:

  1. 图像重映射(Remap)
  2. 数据格式转换
  3. 归一化处理

Remap 类初始化

声明 Remap 类变量:

Remap remapper(image.cols, image.rows, map_x, map_y, dst_width, dst_height, kernel_in, kernel_out, bgr_mean=std::tuple(0.0, 0.0, 0.0), bgr_std=std::tuple(1.0, 1.0, 1.0));

参数说明:

变量名称变量类型参数说明
image.colsint原始图像的宽度
image.rowsint原始图像的高度
map_xcv::Mat&映射表 X,目标图像与原始图像横坐标对应关系,尺寸应与 map_y 相同
map_ycv::Mat&映射表 Y,目标图像与原始图像纵坐标对应关系,尺寸应与 map_x 相同
dst_widthint输出图像宽度,应等于 map_x 的宽度
dst_heightint输出图像高度,应等于 map_x 的高度
kernel_incv::Mat&输入图像的地址,声明时应为空图像
kernel_outcv::Mat&输出图像的地址,声明时应为空图像
bgr_meanstd::tuple输入图像 BGR 通道均值,处理时各通道减去该均值
bgr_stdstd::tuple输入图像 BGR 通道方差,处理时各通道除以该均值

处理流程

  1. 根据 map_x、map_y 对原始图像进行重映射,插值方式为线性插值。
  2. 将像素值除以 255.0,缩放至 0~1 范围。
  3. 对每个通道减去均值,再除以方差。

示例操作步骤:

  • 执行核函数前,需将读取的原始图像复制到核函数输入图像中

    image.copyTo(kernel_in);
  • 执行 Remap 类中的 remap() 函数进行预处理操作

    remapper.remap();
  • 将预处理后的图像保存在 dst

性能对比:

  • 原方法: 通过 OpenCVcv::dnn::blobFromImage 函数进行处理,但该函数内部的 cv::split 子函数会显著影响性能,尤其是在大尺寸图像上,导致预处理时间较长。
  • 现方法: 采用 OpenCL 将像素按目标顺序写入内存,并在核函数中完成数据类型转换、双线性插值与归一化等处理。并且仅写入感兴趣区域(Region of Interest, ROI),避免填充区域写入内存,从而减少内存带宽和拷贝开销。

性能对比如下表所示,其中测试图像尺寸为 500×375。

目标尺寸OpenCVOpenCL 写入填充OpenCL 只写入 ROI
192×3203.93 ms1.15 ms1.10 ms
320×3205.45 ms1.29 ms1.22 ms
640×64019.86 ms3.27 ms2.78 ms

测试说明

文件目录结构

测试部分文件目录结构如下:

opencl_image_preprocess
├── cpp
│ ├── CMakeLists.txt
│ └── main.cpp
└── py
└── py_test.py

CPP 测试

环境配置

sudo apt install libopencv-dev pocl-opencl-icd
wget https://archive.spacemit.com/ros2/prebuilt/brdk_libs/opencl_image_preprocess.tar.gz
tar -zxvf opencl_image_preprocess.tar.gz

测试步骤

cd cpp
mkdir build && cd build
cmake ..
make -j4
./test_opencl_image_preproces

测试代码如下

CMakeLists.txt:

cmake_minimum_required(VERSION 3.10)
project(test_opencl_image_preprocess)

# 设置 C++ 标准
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_BUILD_TYPE Release)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -w -fdiagnostics-color=always -pthread")

# OpenCV
find_package(OpenCV 4 REQUIRED)
include_directories(${OPENCV_INSTALL_DIR}/include/opencv4)

# OpenCL
find_package(OpenCL REQUIRED)
include_directories(${OpenCL_INCLUDE_DIRS} )

# 根据实际解压缩路径调整
set(OIP_INCLUDE "${CMAKE_CURRENT_SOURCE_DIR}/opencl_image_preprocess/include")
set(OIP_LIB "${CMAKE_CURRENT_SOURCE_DIR}/opencl_image_preprocess/lib")
include("${OIP_LIB}/cmake/opencl_image_preprocess/opencl_image_preprocessConfig.cmake")
include_directories(${OIP_INCLUDE})
link_directories(${OIP_LIB})

# 链接库
add_executable(${PROJECT_NAME} main.cpp)

target_link_libraries(${PROJECT_NAME} ${LIBS} ${OpenCV_LIBS} OpenCL::OpenCL gbm ${GST_LIBRARIES})
target_link_libraries(${PROJECT_NAME} opencl_image_preprocess)

add_definitions(-D__fp16=_Float16)

main.cpp:

#include "opencl_image_preprocess.h"
#include <iostream>
#include <chrono>
#include <opencv2/opencv.hpp>

# 生成映射表
void GetMapXY(const cv::Mat& src, cv::Mat& map_x, cv::Mat& map_y, int dst_width, int dst_height) {
if (!map_x.empty() or !map_y.empty()) {
std::cerr << "map_x and map_y should both be empty" << std::endl;
}
int src_width = src.cols;
int src_height = src.rows;
// roi
float ratio = fmin(static_cast<float>(dst_width) / static_cast<float>(src_width), static_cast<float>(dst_height) / static_cast<float>(src_height));
int scaled_width = static_cast<int>(src_width * ratio);
int scaled_height = static_cast<int>(src_height * ratio);
cv::Mat map_x_copy(scaled_height, scaled_width, CV_32FC1, cv::Scalar(-1));
cv::Mat map_y_copy(scaled_height, scaled_width, CV_32FC1, cv::Scalar(-1));

for (int h = 0; h < scaled_height; h++) {
for (int w = 0; w < scaled_width; w++) {
map_x_copy.at<float>(h, w) = w / ratio;
map_y_copy.at<float>(h, w) = h / ratio;
}
}

map_x_copy.copyTo(map_x);
map_y_copy.copyTo(map_y);
}


int main() {
cv::Mat image(320, 192, CV_8UC3, cv::Scalar(50, 150, 200)); # 生成测试图像
cv::Mat map_x, map_y;
int dst_width = 640, dst_height = 640;
GetMapXY(image, map_x, map_y, dst_width, dst_height);
cv::Mat kernel_in;
cv::Mat kernel_out;

Remap remapper(image.cols, image.rows, map_x, map_y, dst_width, dst_height, kernel_in, kernel_out, std::tuple(0.0, 0.0, 0.0), std::tuple(1.0, 1.0, 1.0)); # Remap类声明
image.copyTo(kernel_in);
# 统计帧率
auto begin_time = std::chrono::high_resolution_clock::now();
for(int i=0; i<1000; i++) remapper.remap();
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - begin_time).count();
std::cout << "fps: " << 1000 * 1000000 / duration << std::endl;

# 保存图像
cv::Mat kernel_valid;
kernel_out.convertTo(kernel_valid, CV_8U, 255.0);
cv::imwrite("./kernel_valid.jpg", kernel_valid);
std::cout << "./kernel_valid.jpg is saved" << std::endl;

return 0;
}

执行后,在 build 目录中会成功生成 kernel_valid.jpg。为了方便观察,将像素放大 255 倍,并按照 RGB 的顺序依次排列

Python 测试

环境配置

python -m venv .venv # 创建虚拟环境
source .venv/bin/activate # 激活虚拟环境
pip install opencl-image-preprocess numpy tqdm opencv-python --index-url https://git.spacemit.com/api/v4/projects/33/packages/pypi/simple

测试步骤

cd py
python py_test.py

测试代码如下:

py_test.py:

import numpy as np
import cv2
import time
import tqdm
from opencl_image_preprocess import OIP


if __name__ == "__main__":
src_width = 192
src_height = 320
dst_width = 640
dst_height = 640
bgr_mean = (0,0,0)
bgr_std = (1.0, 1.0, 1.0)
np.random.seed(1)

map_x = np.random.randint(0, src_width, (dst_height, dst_width)).astype(np.float32)
map_y = np.random.randint(0, src_height, (dst_height, dst_width)).astype(np.float32)

image_array = np.zeros((src_height, src_width, 3))
image_array[:, :, 0] = 250
image_array[:, :, 1] = 150
image_array[:, :, 2] = 50
image_array.astype(np.uint8)

begin_time = time.time()
num = 1000
for i in tqdm.trange(num):
opencl_out = OIP(image_array, map_x, map_y, bgr_mean, bgr_std)
print("fps : {:.4f}".format(num / (time.time() - begin_time)))

print("opencl_out shape is {}".format(opencl_out.shape))
img_save = (opencl_out.reshape(-1, opencl_out.shape[-1]) * 255).astype(np.uint8)
cv2.imwrite("kernel_valid.jpg", img_save)

执行后,在 py 目录中会生成 kernel_valid.jpg。为了方便观察,将像素放大 255 倍,并按照 RGB 的顺序依次排列