3.7.1 OpenCL Image Preprocessing Acceleration
Last Version: 2025/09/23
Overview
In vision networks like YOLO, the input image is usually in HWC format (Height × Width × Channels), while the network expects BCHW format (Batch × Channels × Height × Width, where Batch is usually 1).
Preprocessing typically involves:
- Image remapping (Remap)
- Data format conversion
- Normalization
Remap Class Initialization
To start, declare a Remap class object:
Remap remapper(image.cols, image.rows, map_x, map_y, dst_width, dst_height, kernel_in, kernel_out, bgr_mean=std::tuple(0.0, 0.0, 0.0), bgr_std=std::tuple(1.0, 1.0, 1.0));
Parameter Description
Name | Type | Description |
---|---|---|
image.cols | int | Width of the original image |
image.rows | int | Height of the original image |
map_x | cv::Mat& | X mapping table - Defines horizontal mapping between source and target image - Must have same size as map_y |
map_y | cv::Mat& | Y mapping table - Defines vertical mapping between source and target image - Must have same size as map_x |
dst_width | int | Width of the output image - Should match the width of map_x |
dst_height | int | Height of the output image - Should match the height of map_x |
kernel_in | cv::Mat& | Input image buffer - Should be empty when declared |
kernel_out | cv::Mat& | Output image buffer - Should be empty when declared |
bgr_mean | std::tuple | Mean values of the BGR channels of the input image; - Each channel is subtracted by this mean during processing |
bgr_std | std::tuple | Standard deviation of the BGR channels of the input image; - Each channel is divided by this value during processing |
Processing Flow
- Use
map_x
andmap_y
to remap the original image (linear interpolation is applied). - Convert pixel values from [0, 255] to [0, 1] by dividing by 255.0.
- Normalize each channel: subtract its mean, then divide by its standard deviation.
Example Usage
-
Copy the original image into the kernel input buffer:
image.copyTo(kernel_in);
-
Run the preprocessing step:
remapper.remap();
-
The processed image is stored in
dst
.
Performance Comparison
-
Traditional method (OpenCV): Uses
cv::dnn::blobFromImage
, but its internalcv::split
is a bottleneck on large images, causing slow preprocessing. -
Optimized method (OpenCL): Writes pixels directly in the target order, handling type conversion, bilinear interpolation, and normalization inside the kernel. Only the Region of Interest (ROI) is written to memory, avoiding extra padding and reducing bandwidth usage.
Benchmark results (Testing image input size: 500×375):
Target Size | OpenCV | OpenCL (with padding) | OpenCL (ROI only) |
---|---|---|---|
192×320 | 3.93 ms | 1.15 ms | 1.10 ms |
320×320 | 5.45 ms | 1.29 ms | 1.22 ms |
640×640 | 19.86 ms | 3.27 ms | 2.78 ms |
Testing Guide
Directory Structure
opencl_image_preprocess
├── cpp
│ ├── CMakeLists.txt
│ └── main.cpp
└── py
└── py_test.py
CPP Test
Environment Setup
sudo apt install libopencv-dev pocl-opencl-icd
wget https://archive.spacemit.com/ros2/prebuilt/brdk_libs/opencl_image_preprocess.tar.gz
tar -zxvf opencl_image_preprocess.tar.gz
Test Steps
cd cpp
mkdir build && cd build
cmake ..
make -j4
./test_opencl_image_preproces
Test Code
CMakeLists.txt
→ build configuration
cmake_minimum_required(VERSION 3.10)
project(test_opencl_image_preprocess)
# Set C++ standard
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_BUILD_TYPE Release)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -w -fdiagnostics-color=always -pthread")
# OpenCV
find_package(OpenCV 4 REQUIRED)
include_directories(${OPENCV_INSTALL_DIR}/include/opencv4)
# OpenCL
find_package(OpenCL REQUIRED)
include_directories(${OpenCL_INCLUDE_DIRS} )
# Adjust paths based on extraction location
set(OIP_INCLUDE "${CMAKE_CURRENT_SOURCE_DIR}/opencl_image_preprocess/include")
set(OIP_LIB "${CMAKE_CURRENT_SOURCE_DIR}/opencl_image_preprocess/lib")
include("${OIP_LIB}/cmake/opencl_image_preprocess/opencl_image_preprocessConfig.cmake")
include_directories(${OIP_INCLUDE})
link_directories(${OIP_LIB})
# Link libraries
add_executable(${PROJECT_NAME} main.cpp)
target_link_libraries(${PROJECT_NAME} ${LIBS} ${OpenCV_LIBS} OpenCL::OpenCL gbm ${GST_LIBRARIES})
target_link_libraries(${PROJECT_NAME} opencl_image_preprocess)
add_definitions(-D__fp16=_Float16)
main.cpp
→ test program
#include "opencl_image_preprocess.h"
#include <iostream>
#include <chrono>
#include <opencv2/opencv.hpp>
# Generate Mapping Tables
void GetMapXY(const cv::Mat& src, cv::Mat& map_x, cv::Mat& map_y, int dst_width, int dst_height) {
if (!map_x.empty() or !map_y.empty()) {
std::cerr << "map_x and map_y should both be empty" << std::endl;
}
int src_width = src.cols;
int src_height = src.rows;
// roi
float ratio = fmin(static_cast<float>(dst_width) / static_cast<float>(src_width), static_cast<float>(dst_height) / static_cast<float>(src_height));
int scaled_width = static_cast<int>(src_width * ratio);
int scaled_height = static_cast<int>(src_height * ratio);
cv::Mat map_x_copy(scaled_height, scaled_width, CV_32FC1, cv::Scalar(-1));
cv::Mat map_y_copy(scaled_height, scaled_width, CV_32FC1, cv::Scalar(-1));
for (int h = 0; h < scaled_height; h++) {
for (int w = 0; w < scaled_width; w++) {
map_x_copy.at<float>(h, w) = w / ratio;
map_y_copy.at<float>(h, w) = h / ratio;
}
}
map_x_copy.copyTo(map_x);
map_y_copy.copyTo(map_y);
}
int main() {
cv::Mat image(320, 192, CV_8UC3, cv::Scalar(50, 150, 200)); # Create test image
cv::Mat map_x, map_y;
int dst_width = 640, dst_height = 640;
GetMapXY(image, map_x, map_y, dst_width, dst_height);
cv::Mat kernel_in;
cv::Mat kernel_out;
Remap remapper(image.cols, image.rows, map_x, map_y, dst_width, dst_height, kernel_in, kernel_out, std::tuple(0.0, 0.0, 0.0), std::tuple(1.0, 1.0, 1.0)); # Remap Class Declare
image.copyTo(kernel_in);
# Measure performance
auto begin_time = std::chrono::high_resolution_clock::now();
for(int i=0; i<1000; i++) remapper.remap();
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - begin_time).count();
std::cout << "fps: " << 1000 * 1000000 / duration << std::endl;
# Save output image
cv::Mat kernel_valid;
kernel_out.convertTo(kernel_valid, CV_8U, 255.0);
cv::imwrite("./kernel_valid.jpg", kernel_valid);
std::cout << "./kernel_valid.jpg is saved" << std::endl;
return 0;
}
Output: The build directory will contain kernel_valid.jpg
. Pixels are scaled by 255× and saved in RGB order for better visibility.
Python Test
Environment Setup
python -m venv .venv # 创建虚拟环境
source .venv/bin/activate # 激活虚拟环境
pip install opencl-image-preprocess numpy tqdm opencv-python --index-url https://git.spacemit.com/api/v4/projects/33/packages/pypi/simple
Test Steps
cd py
python py_test.py
Test Code
py_test.py
:
import numpy as np
import cv2
import time
import tqdm
from opencl_image_preprocess import OIP
if __name__ == "__main__":
src_width = 192
src_height = 320
dst_width = 640
dst_height = 640
bgr_mean = (0,0,0)
bgr_std = (1.0, 1.0, 1.0)
np.random.seed(1)
map_x = np.random.randint(0, src_width, (dst_height, dst_width)).astype(np.float32)
map_y = np.random.randint(0, src_height, (dst_height, dst_width)).astype(np.float32)
image_array = np.zeros((src_height, src_width, 3))
image_array[:, :, 0] = 250
image_array[:, :, 1] = 150
image_array[:, :, 2] = 50
image_array.astype(np.uint8)
begin_time = time.time()
num = 1000
for i in tqdm.trange(num):
opencl_out = OIP(image_array, map_x, map_y, bgr_mean, bgr_std)
print("fps : {:.4f}".format(num / (time.time() - begin_time)))
print("opencl_out shape is {}".format(opencl_out.shape))
img_save = (opencl_out.reshape(-1, opencl_out.shape[-1]) * 255).astype(np.uint8)
cv2.imwrite("kernel_valid.jpg", img_save)
Output: The py directory will contain kernel_valid.jpg
. Pixels are scaled by 255× and saved in RGB order for better visibility.