5.2.4 Speech Input LLM Output
Feature Introduction
This section introduces how to integrate Automatic Speech Recognition (ASR) with Large Language Models (LLM) to build a complete inference pipeline: speech input → text transcription → text understanding → text output. By combining local speech recognition engines with local LLMs provided by Ollama, you can build an intelligent speech interaction system that runs completely offline without requiring internet connectivity.
One-Click Deployment (Optional)
We provide a one-click installation deployment package that supports rapid integration and execution.
Please ensure the device firmware version ≥ 2.2 Firmware download address: https://archive.spacemit.com/image/k1/version/bianbu/
Installation
sudo apt update
sudo apt install asr-llm
Startup
# Enter in terminal:
voice
The first run will automatically download the Automatic Speech Recognition (ASR) model, with cache directory located at:
~/.cache/sensevoice
Preparation
If you need to run manually from source code, you can execute the following steps:
Clone Code
git clone https://gitee.com/bianbu/spacemit-demo.git
cd spacemit_demo/examples/NLP
Install Environment Dependencies
sudo apt install python3-venv
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Model Creation
sudo apt install wget
wget https://modelscope.cn/models/second-state/Qwen2.5-0.5B-Instruct-GGUF/resolve/master/Qwen2.5-0.5B-Instruct-Q4_0.gguf -P ./
wget https://archive.spacemit.com/spacemit-ai/modelfile/qwen2.5:0.5b.modelfile -P ./
wget http://archive.spacemit.com/spacemit-ai/gguf/qwen2.5-0.5b-fc-q4_0.gguf -P ./
wget http://archive.spacemit.com/spacemit-ai/modelfile/qwen2.5-0.5b-fc.modelfile -P ./
ollama create qwen2.5:0.5b -f qwen2.5:0.5b.modelfile
ollama create qwen2.5-0.5b-fc -f qwen2.5-0.5b-fc.modelfile
Detect Recording Device
Refer to the Recording Device Detection section to check the available recording devices in the system.
Run Code
Execute the following command to run the complete speech-to-text → large model inference pipeline:
After detecting the recording device, you need to modify the device index in the file to the current actual index. The default device index in the file is 3.
python 06_asr_llm_demo.py
After the user speaks, the system will:
- Automatically record and perform speech recognition (integrated VAD)
- Pass the recognized text to the locally deployed large language model (such as Qwen)
- Return the language model's inference results and display the output