3.7.1 OpenCL Image Preprocessing Acceleration

Last Version: 2025/09/23

Overview

In vision networks like YOLO, the input image is usually in HWC format (Height × Width × Channels), while the network expects BCHW format (Batch × Channels × Height × Width, where Batch is usually 1).

Preprocessing typically involves:

Image remapping (Remap)
Data format conversion
Normalization

Remap Class Initialization

To start, declare a Remap class object:

Remap remapper(image.cols, image.rows, map_x, map_y, dst_width, dst_height, kernel_in, kernel_out, bgr_mean=std::tuple(0.0, 0.0, 0.0), bgr_std=std::tuple(1.0, 1.0, 1.0));

Parameter Description

Name	Type	Description
`image.cols`	int	Width of the original image
`image.rows`	int	Height of the original image
`map_x`	cv::Mat&	X mapping table - Defines horizontal mapping between source and target image - Must have same size as `map_y`
`map_y`	cv::Mat&	Y mapping table - Defines vertical mapping between source and target image - Must have same size as `map_x`
`dst_width`	int	Width of the output image - Should match the width of `map_x`
`dst_height`	int	Height of the output image - Should match the height of `map_x`
`kernel_in`	cv::Mat&	Input image buffer - Should be empty when declared
`kernel_out`	cv::Mat&	Output image buffer - Should be empty when declared
`bgr_mean`	std::tuple	Mean values of the BGR channels of the input image; - Each channel is subtracted by this mean during processing
`bgr_std`	std::tuple	Standard deviation of the BGR channels of the input image; - Each channel is divided by this value during processing

Processing Flow

Use map_x and map_y to remap the original image (linear interpolation is applied).
Convert pixel values from [0, 255] to [0, 1] by dividing by 255.0.
Normalize each channel: subtract its mean, then divide by its standard deviation.

Example Usage

Copy the original image into the kernel input buffer:
```
image.copyTo(kernel_in);
```
Run the preprocessing step:
```
remapper.remap();
```
The processed image is stored in dst.

Performance Comparison

Traditional method (OpenCV): Uses cv::dnn::blobFromImage, but its internal cv::split is a bottleneck on large images, causing slow preprocessing.
Optimized method (OpenCL): Writes pixels directly in the target order, handling type conversion, bilinear interpolation, and normalization inside the kernel. Only the Region of Interest (ROI) is written to memory, avoiding extra padding and reducing bandwidth usage.

Benchmark results (Testing image input size: 500×375):

Target Size	OpenCV	OpenCL (with padding)	OpenCL (ROI only)
192×320	3.93 ms	1.15 ms	1.10 ms
320×320	5.45 ms	1.29 ms	1.22 ms
640×640	19.86 ms	3.27 ms	2.78 ms

Testing Guide

Directory Structure

opencl_image_preprocess
├── cpp
│   ├── CMakeLists.txt
│   └── main.cpp
└── py
    └── py_test.py 

CPP Test

Environment Setup

sudo apt install libopencv-dev pocl-opencl-icd
wget https://archive.spacemit.com/ros2/prebuilt/brdk_libs/opencl_image_preprocess.tar.gz
tar -zxvf opencl_image_preprocess.tar.gz

Test Steps

cd cpp
mkdir build && cd build
cmake ..
make -j4
./test_opencl_image_preproces

Test Code

CMakeLists.txt → build configuration

cmake_minimum_required(VERSION 3.10)
project(test_opencl_image_preprocess)

# Set C++ standard
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_BUILD_TYPE Release)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -w -fdiagnostics-color=always -pthread")

# OpenCV
find_package(OpenCV 4 REQUIRED)
include_directories(${OPENCV_INSTALL_DIR}/include/opencv4)

# OpenCL
find_package(OpenCL REQUIRED)
include_directories(${OpenCL_INCLUDE_DIRS} )

# Adjust paths based on extraction location
set(OIP_INCLUDE "${CMAKE_CURRENT_SOURCE_DIR}/opencl_image_preprocess/include")
set(OIP_LIB "${CMAKE_CURRENT_SOURCE_DIR}/opencl_image_preprocess/lib")
include("${OIP_LIB}/cmake/opencl_image_preprocess/opencl_image_preprocessConfig.cmake")
include_directories(${OIP_INCLUDE})
link_directories(${OIP_LIB})

# Link libraries
add_executable(${PROJECT_NAME} main.cpp)

target_link_libraries(${PROJECT_NAME} ${LIBS} ${OpenCV_LIBS} OpenCL::OpenCL gbm ${GST_LIBRARIES})
target_link_libraries(${PROJECT_NAME} opencl_image_preprocess)

add_definitions(-D__fp16=_Float16)

main.cpp → test program

#include "opencl_image_preprocess.h"
#include <iostream>
#include <chrono>
#include <opencv2/opencv.hpp>

# Generate Mapping Tables
void GetMapXY(const cv::Mat& src, cv::Mat& map_x, cv::Mat& map_y, int dst_width, int dst_height) {
    if (!map_x.empty() or !map_y.empty()) {
        std::cerr << "map_x and map_y should both be empty" << std::endl;
    }
    int src_width = src.cols;
    int src_height = src.rows;
    // roi
    float ratio = fmin(static_cast<float>(dst_width) / static_cast<float>(src_width), static_cast<float>(dst_height) / static_cast<float>(src_height));
    int scaled_width = static_cast<int>(src_width * ratio);
    int scaled_height = static_cast<int>(src_height * ratio);
    cv::Mat map_x_copy(scaled_height, scaled_width, CV_32FC1, cv::Scalar(-1));
    cv::Mat map_y_copy(scaled_height, scaled_width, CV_32FC1, cv::Scalar(-1));

    for (int h = 0; h < scaled_height; h++) {
        for (int w = 0; w < scaled_width; w++) {
            map_x_copy.at<float>(h, w) = w / ratio;
            map_y_copy.at<float>(h, w) = h / ratio;
        }
    }

    map_x_copy.copyTo(map_x);
    map_y_copy.copyTo(map_y);
}


int main() {
    cv::Mat image(320, 192, CV_8UC3, cv::Scalar(50, 150, 200)); # Create test image
    cv::Mat map_x, map_y;
    int dst_width = 640, dst_height = 640;
    GetMapXY(image, map_x, map_y, dst_width, dst_height);
    cv::Mat kernel_in;
    cv::Mat kernel_out;

    Remap remapper(image.cols, image.rows, map_x, map_y, dst_width, dst_height, kernel_in, kernel_out, std::tuple(0.0, 0.0, 0.0), std::tuple(1.0, 1.0, 1.0)); # Remap Class Declare
    image.copyTo(kernel_in);
    # Measure performance
    auto begin_time = std::chrono::high_resolution_clock::now();
    for(int i=0; i<1000; i++) remapper.remap();
    auto end_time = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end_time - begin_time).count();
    std::cout << "fps: " << 1000 * 1000000 / duration << std::endl;

    # Save output image
    cv::Mat kernel_valid;
    kernel_out.convertTo(kernel_valid, CV_8U, 255.0);
    cv::imwrite("./kernel_valid.jpg", kernel_valid);
    std::cout << "./kernel_valid.jpg is saved" << std::endl;

    return 0;
}

Output: The build directory will contain kernel_valid.jpg. Pixels are scaled by 255× and saved in RGB order for better visibility.

Python Test

Environment Setup

python -m venv .venv # 创建虚拟环境
source .venv/bin/activate # 激活虚拟环境
pip install opencl-image-preprocess numpy tqdm opencv-python --index-url https://git.spacemit.com/api/v4/projects/33/packages/pypi/simple

Test Steps

cd py
python py_test.py

Test Code

py_test.py:

import numpy as np
import cv2
import time
import tqdm
from opencl_image_preprocess import OIP


if __name__ == "__main__":
    src_width = 192
    src_height = 320
    dst_width = 640
    dst_height = 640
    bgr_mean = (0,0,0)
    bgr_std = (1.0, 1.0, 1.0)
    np.random.seed(1)

    map_x = np.random.randint(0, src_width, (dst_height, dst_width)).astype(np.float32)
    map_y = np.random.randint(0, src_height, (dst_height, dst_width)).astype(np.float32)

    image_array = np.zeros((src_height, src_width, 3))
    image_array[:, :, 0] = 250
    image_array[:, :, 1] = 150
    image_array[:, :, 2] = 50
    image_array.astype(np.uint8)
    
    begin_time = time.time()
    num = 1000
    for i in tqdm.trange(num):
        opencl_out = OIP(image_array, map_x, map_y, bgr_mean, bgr_std)
    print("fps : {:.4f}".format(num / (time.time() - begin_time)))
    
    print("opencl_out shape is {}".format(opencl_out.shape))
    img_save = (opencl_out.reshape(-1, opencl_out.shape[-1]) * 255).astype(np.uint8)
    cv2.imwrite("kernel_valid.jpg", img_save)

Output: The py directory will contain kernel_valid.jpg. Pixels are scaled by 255× and saved in RGB order for better visibility.

3.7.1 OpenCL Image Preprocessing Acceleration

Overview​

Remap Class Initialization​

Testing Guide​

Directory Structure​

CPP Test​

Python Test​