Skip to main content

5.1.7 Vision Language Model (VLM)

Last Version: 11/09/2025

Overview

This section explains how to use Vision-Language Models (VLMs) to understand images and generate text.

Using SmolVLM as an example, the model can take an image as input and produce natural language output. It also supports running locally without an internet connection.

Clone Repository

Get the project files.

git clone https://gitee.com/bianbu/spacemit-demo.git
cd spacemit-demo/examples/NLP

Install Dependencies

Install Model and Ollama Tools

  1. Install the toolkit required to run the SmolVLM model.

    sudo apt install spacemit-ollama-toolkit
  2. Verify installation by checking the list of available models:

    ollama list

    If the output shows NAME ID SIZE MODIFIED, the installation was successful.

  3. Check version (must be 0.0.8 or higher)

    sudo apt show spacemit-ollama-toolkit

    Confirm the version is 0.0.8 or higher to support vision-language model SmolVLM.

Download and Create SmolVLM Model

Download the specific files needed for the SmolVLM model

wget https://archive.spacemit.com/spacemit-ai/gguf/mmproj-SmolVLM-256M-Instruct-Q8_0.gguf
wget https://archive.spacemit.com/spacemit-ai/gguf/SmolVLM-256M-Instruct-f16.gguf
wget https://archive.spacemit.com/spacemit-ai/modelfile/smolvlm.modelfile
ollama create smolvlm:256m -f smolvlm.modelfile

⚠️ Note: For using a different model in the future, please change the content of the modelfile file to point to that new model.

Install Python Environment Dependencies

Set up a Python virtual environment and installs all required Python packages for running the demos.

sudo apt install python3-venv python3-pip

python3 -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

Run Inference

Run the following command to execute VLM inference on local images:

python 08_vision_demo.py --image=bus.jpg --stream=True --prompt="describe this image"

After running, the model will output natural language description results based on the input image bus.jpg with text prompt describe this image.