5.1.7 Vision Language Model (VLM)
Last Version: 11/09/2025
Overview
This section explains how to use Vision-Language Models (VLMs) to understand images and generate text.
Using SmolVLM as an example, the model can take an image as input and produce natural language output. It also supports running locally without an internet connection.
Clone Repository
Get the project files.
git clone https://gitee.com/bianbu/spacemit-demo.git
cd spacemit-demo/examples/NLP
Install Dependencies
Install Model and Ollama Tools
-
Install the toolkit required to run the SmolVLM model.
sudo apt install spacemit-ollama-toolkit
-
Verify installation by checking the list of available models:
ollama list
If the output shows
NAME ID SIZE MODIFIED
, the installation was successful. -
Check version (must be 0.0.8 or higher)
sudo apt show spacemit-ollama-toolkit
Confirm the version is 0.0.8 or higher to support vision-language model SmolVLM.
Download and Create SmolVLM Model
Download the specific files needed for the SmolVLM model
wget https://archive.spacemit.com/spacemit-ai/gguf/mmproj-SmolVLM-256M-Instruct-Q8_0.gguf
wget https://archive.spacemit.com/spacemit-ai/gguf/SmolVLM-256M-Instruct-f16.gguf
wget https://archive.spacemit.com/spacemit-ai/modelfile/smolvlm.modelfile
ollama create smolvlm:256m -f smolvlm.modelfile
⚠️ Note: For using a different model in the future, please change the content of the modelfile
file to point to that new model.
Install Python Environment Dependencies
Set up a Python virtual environment and installs all required Python packages for running the demos.
sudo apt install python3-venv python3-pip
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Run Inference
Run the following command to execute VLM inference on local images:
python 08_vision_demo.py --image=bus.jpg --stream=True --prompt="describe this image"
After running, the model will output natural language description results based on the input image bus.jpg
with text prompt describe this image
.