Skip to main content

3.7.1 OpenCL Image Preprocessing Acceleration

Last Version: 2025/09/23

Overview

In vision networks like YOLO, the input image is usually in HWC format (Height × Width × Channels), while the network expects BCHW format (Batch × Channels × Height × Width, where Batch is usually 1).

Preprocessing typically involves:

  1. Image remapping (Remap)
  2. Data format conversion
  3. Normalization

Remap Class Initialization

To start, declare a Remap class object:

Remap remapper(image.cols, image.rows, map_x, map_y, dst_width, dst_height, kernel_in, kernel_out, bgr_mean=std::tuple(0.0, 0.0, 0.0), bgr_std=std::tuple(1.0, 1.0, 1.0));

Parameter Description

NameTypeDescription
image.colsintWidth of the original image
image.rowsintHeight of the original image
map_xcv::Mat&X mapping table
- Defines horizontal mapping between source and target image
- Must have same size as map_y
map_ycv::Mat&Y mapping table
- Defines vertical mapping between source and target image
- Must have same size as map_x
dst_widthintWidth of the output image
- Should match the width of map_x
dst_heightintHeight of the output image
- Should match the height of map_x
kernel_incv::Mat&Input image buffer
- Should be empty when declared
kernel_outcv::Mat&Output image buffer
- Should be empty when declared
bgr_meanstd::tupleMean values of the BGR channels of the input image;
- Each channel is subtracted by this mean during processing
bgr_stdstd::tupleStandard deviation of the BGR channels of the input image;
- Each channel is divided by this value during processing

Processing Flow

  1. Use map_x and map_y to remap the original image (linear interpolation is applied).
  2. Convert pixel values from [0, 255] to [0, 1] by dividing by 255.0.
  3. Normalize each channel: subtract its mean, then divide by its standard deviation.

Example Usage

  • Copy the original image into the kernel input buffer:

    image.copyTo(kernel_in);
  • Run the preprocessing step:

    remapper.remap();
  • The processed image is stored in dst.

Performance Comparison

  • Traditional method (OpenCV): Uses cv::dnn::blobFromImage, but its internal cv::split is a bottleneck on large images, causing slow preprocessing.

  • Optimized method (OpenCL): Writes pixels directly in the target order, handling type conversion, bilinear interpolation, and normalization inside the kernel. Only the Region of Interest (ROI) is written to memory, avoiding extra padding and reducing bandwidth usage.

Benchmark results (Testing image input size: 500×375):

Target SizeOpenCVOpenCL (with padding)OpenCL (ROI only)
192×3203.93 ms1.15 ms1.10 ms
320×3205.45 ms1.29 ms1.22 ms
640×64019.86 ms3.27 ms2.78 ms

Testing Guide

Directory Structure

opencl_image_preprocess
├── cpp
│ ├── CMakeLists.txt
│ └── main.cpp
└── py
└── py_test.py

CPP Test

Environment Setup

sudo apt install libopencv-dev pocl-opencl-icd
wget https://archive.spacemit.com/ros2/prebuilt/brdk_libs/opencl_image_preprocess.tar.gz
tar -zxvf opencl_image_preprocess.tar.gz

Test Steps

cd cpp
mkdir build && cd build
cmake ..
make -j4
./test_opencl_image_preproces

Test Code

  • CMakeLists.txt → build configuration
cmake_minimum_required(VERSION 3.10)
project(test_opencl_image_preprocess)

# Set C++ standard
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_BUILD_TYPE Release)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -w -fdiagnostics-color=always -pthread")

# OpenCV
find_package(OpenCV 4 REQUIRED)
include_directories(${OPENCV_INSTALL_DIR}/include/opencv4)

# OpenCL
find_package(OpenCL REQUIRED)
include_directories(${OpenCL_INCLUDE_DIRS} )

# Adjust paths based on extraction location
set(OIP_INCLUDE "${CMAKE_CURRENT_SOURCE_DIR}/opencl_image_preprocess/include")
set(OIP_LIB "${CMAKE_CURRENT_SOURCE_DIR}/opencl_image_preprocess/lib")
include("${OIP_LIB}/cmake/opencl_image_preprocess/opencl_image_preprocessConfig.cmake")
include_directories(${OIP_INCLUDE})
link_directories(${OIP_LIB})

# Link libraries
add_executable(${PROJECT_NAME} main.cpp)

target_link_libraries(${PROJECT_NAME} ${LIBS} ${OpenCV_LIBS} OpenCL::OpenCL gbm ${GST_LIBRARIES})
target_link_libraries(${PROJECT_NAME} opencl_image_preprocess)

add_definitions(-D__fp16=_Float16)
  • main.cpp → test program
#include "opencl_image_preprocess.h"
#include <iostream>
#include <chrono>
#include <opencv2/opencv.hpp>

# Generate Mapping Tables
void GetMapXY(const cv::Mat& src, cv::Mat& map_x, cv::Mat& map_y, int dst_width, int dst_height) {
if (!map_x.empty() or !map_y.empty()) {
std::cerr << "map_x and map_y should both be empty" << std::endl;
}
int src_width = src.cols;
int src_height = src.rows;
// roi
float ratio = fmin(static_cast<float>(dst_width) / static_cast<float>(src_width), static_cast<float>(dst_height) / static_cast<float>(src_height));
int scaled_width = static_cast<int>(src_width * ratio);
int scaled_height = static_cast<int>(src_height * ratio);
cv::Mat map_x_copy(scaled_height, scaled_width, CV_32FC1, cv::Scalar(-1));
cv::Mat map_y_copy(scaled_height, scaled_width, CV_32FC1, cv::Scalar(-1));

for (int h = 0; h < scaled_height; h++) {
for (int w = 0; w < scaled_width; w++) {
map_x_copy.at<float>(h, w) = w / ratio;
map_y_copy.at<float>(h, w) = h / ratio;
}
}

map_x_copy.copyTo(map_x);
map_y_copy.copyTo(map_y);
}


int main() {
cv::Mat image(320, 192, CV_8UC3, cv::Scalar(50, 150, 200)); # Create test image
cv::Mat map_x, map_y;
int dst_width = 640, dst_height = 640;
GetMapXY(image, map_x, map_y, dst_width, dst_height);
cv::Mat kernel_in;
cv::Mat kernel_out;

Remap remapper(image.cols, image.rows, map_x, map_y, dst_width, dst_height, kernel_in, kernel_out, std::tuple(0.0, 0.0, 0.0), std::tuple(1.0, 1.0, 1.0)); # Remap Class Declare
image.copyTo(kernel_in);
# Measure performance
auto begin_time = std::chrono::high_resolution_clock::now();
for(int i=0; i<1000; i++) remapper.remap();
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - begin_time).count();
std::cout << "fps: " << 1000 * 1000000 / duration << std::endl;

# Save output image
cv::Mat kernel_valid;
kernel_out.convertTo(kernel_valid, CV_8U, 255.0);
cv::imwrite("./kernel_valid.jpg", kernel_valid);
std::cout << "./kernel_valid.jpg is saved" << std::endl;

return 0;
}

Output: The build directory will contain kernel_valid.jpg. Pixels are scaled by 255× and saved in RGB order for better visibility.

Python Test

Environment Setup

python -m venv .venv # 创建虚拟环境
source .venv/bin/activate # 激活虚拟环境
pip install opencl-image-preprocess numpy tqdm opencv-python --index-url https://git.spacemit.com/api/v4/projects/33/packages/pypi/simple

Test Steps

cd py
python py_test.py

Test Code

  • py_test.py:
import numpy as np
import cv2
import time
import tqdm
from opencl_image_preprocess import OIP


if __name__ == "__main__":
src_width = 192
src_height = 320
dst_width = 640
dst_height = 640
bgr_mean = (0,0,0)
bgr_std = (1.0, 1.0, 1.0)
np.random.seed(1)

map_x = np.random.randint(0, src_width, (dst_height, dst_width)).astype(np.float32)
map_y = np.random.randint(0, src_height, (dst_height, dst_width)).astype(np.float32)

image_array = np.zeros((src_height, src_width, 3))
image_array[:, :, 0] = 250
image_array[:, :, 1] = 150
image_array[:, :, 2] = 50
image_array.astype(np.uint8)

begin_time = time.time()
num = 1000
for i in tqdm.trange(num):
opencl_out = OIP(image_array, map_x, map_y, bgr_mean, bgr_std)
print("fps : {:.4f}".format(num / (time.time() - begin_time)))

print("opencl_out shape is {}".format(opencl_out.shape))
img_save = (opencl_out.reshape(-1, opencl_out.shape[-1]) * 255).astype(np.uint8)
cv2.imwrite("kernel_valid.jpg", img_save)

Output: The py directory will contain kernel_valid.jpg. Pixels are scaled by 255× and saved in RGB order for better visibility.