5.1.4 Speech Input with LLM Output
Last Version: 11/09/2025
Overview
This section introduces how to integrate Automatic Speech Recognition (ASR) with Large Language Models (LLM) to build a complete inference pipeline:
speech input → text transcription → text processing → text output
By combining a local ASR engine with LLMs deployed via Ollama, you can build an intelligent voice interaction system that runs entirely offline.
One-Click Deployment (Optional)
We provide an installation package for fast setup.
- Firmware requirement: Version ≥ 2.2
- Download firmware: https://archive.spacemit.com/image/k1/version/bianbu/
Install Package
sudo apt update
sudo apt install asr-llm
Start
# Enter in terminal:
voice
On first run, the Automatic Speech Recognition (ASR) model will download automatically with cache directory located at:
~/.cache/sensevoice
Manual Setup
If you prefer to run from source, follow these steps:
Clone Code
git clone https://gitee.com/bianbu/spacemit-demo.git
cd spacemit-demo/examples/NLP
Install Environment Dependencies
sudo apt install python3-venv
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Model Creation
Download model files (.gguf
) and their corresponding Modelfiles (.modelfile
):
sudo apt install wget
wget https://modelscope.cn/models/second-state/Qwen2.5-0.5B-Instruct-GGUF/resolve/master/Qwen2.5-0.5B-Instruct-Q4_0.gguf -P ./
wget https://archive.spacemit.com/spacemit-ai/modelfile/qwen2.5:0.5b.modelfile -P ./
wget http://archive.spacemit.com/spacemit-ai/gguf/qwen2.5-0.5b-fc-q4_0.gguf -P ./
wget http://archive.spacemit.com/spacemit-ai/modelfile/qwen2.5-0.5b-fc.modelfile -P ./
Create models using Ollama:
ollama create qwen2.5:0.5b -f qwen2.5:0.5b.modelfile
ollama create qwen2.5-0.5b-fc -f qwen2.5-0.5b-fc.modelfile
Detect Recording Device
Follow the instructions in the Detect System Recording Devices section to check the available recording devices in the system.
Run the Pipeline
Execute the following command to run the complete speech-to-text → large model inference pipeline:
After detecting the recording device, modify the device index in the code to match your system (default is 3
).
python 06_asr_llm_demo.py
After speaking into the microphone, the system will:
- Automatically Record and transcribe speech (with integrated VAD).
- Send recognized text to the local LLM (e.g., Qwen).
- Display the LLM output as inference results.