5.2.7 Vision Language Model
Feature Introduction
This chapter introduces how to use Vision-Language Models (VLM) to complete image understanding and text generation tasks. Using SmolVLM as an example, the model has the capability of image input and natural language output, supporting local offline inference.
Clone Repository
git clone https://gitee.com/bianbu/spacemit-demo.git
cd spacemit_demo/examples/NLP
Install Dependencies
Install Model and Ollama Tools
sudo apt install spacemit-ollama-toolkit
Verify installation:
ollama list
The final output NAME ID SIZE MODIFIED
indicates successful installation.
Verify version (ensure version 0.0.8 or above):
sudo apt show spacemit-ollama-toolkit
Confirm the version is 0.0.8 or above to support the vision-language model SmolVLM.
Download and prepare SmolVLM model files:
wget https://archive.spacemit.com/spacemit-ai/gguf/mmproj-SmolVLM-256M-Instruct-Q8_0.gguf
wget https://archive.spacemit.com/spacemit-ai/gguf/SmolVLM-256M-Instruct-f16.gguf
wget https://archive.spacemit.com/spacemit-ai/modelfile/smolvlm.modelfile
ollama create smolvlm:256m -f smolvlm.modelfile
⚠️ If you need to change models, please modify the modelfile
file content accordingly.
Install Python Environment Dependencies
sudo apt install python3-venv python3-pip
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Execute Inference Task
Run the following command to execute vision-language model inference on local images:
python 08_vision_demo.py --image=bus.jpg --stream=True --prompt="describe this image"
The model will output natural language description results based on the input image bus.jpg
and text prompt describe this image
.