Skip to main content

5.2.7 Vision Language Model

Feature Introduction

This chapter introduces how to use Vision-Language Models (VLM) to complete image understanding and text generation tasks. Using SmolVLM as an example, the model has the capability of image input and natural language output, supporting local offline inference.

Clone Repository

git clone https://gitee.com/bianbu/spacemit-demo.git
cd spacemit_demo/examples/NLP

Install Dependencies

Install Model and Ollama Tools

sudo apt install spacemit-ollama-toolkit

Verify installation:

ollama list

The final output NAME ID SIZE MODIFIED indicates successful installation.

Verify version (ensure version 0.0.8 or above):

sudo apt show spacemit-ollama-toolkit

Confirm the version is 0.0.8 or above to support the vision-language model SmolVLM.

Download and prepare SmolVLM model files:

wget https://archive.spacemit.com/spacemit-ai/gguf/mmproj-SmolVLM-256M-Instruct-Q8_0.gguf
wget https://archive.spacemit.com/spacemit-ai/gguf/SmolVLM-256M-Instruct-f16.gguf
wget https://archive.spacemit.com/spacemit-ai/modelfile/smolvlm.modelfile
ollama create smolvlm:256m -f smolvlm.modelfile

⚠️ If you need to change models, please modify the modelfile file content accordingly.

Install Python Environment Dependencies

sudo apt install python3-venv python3-pip

python3 -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

Execute Inference Task

Run the following command to execute vision-language model inference on local images:

python 08_vision_demo.py --image=bus.jpg --stream=True --prompt="describe this image"

The model will output natural language description results based on the input image bus.jpg and text prompt describe this image.