Skip to main content

5.1.4 Speech Input with LLM Output

Last Version: 11/09/2025

Overview

This section introduces how to integrate Automatic Speech Recognition (ASR) with Large Language Models (LLM) to build a complete inference pipeline:

speech input → text transcription → text processing → text output

By combining a local ASR engine with LLMs deployed via Ollama, you can build an intelligent voice interaction system that runs entirely offline.

One-Click Deployment (Optional)

We provide an installation package for fast setup.

Install Package

sudo apt update
sudo apt install asr-llm

Start

# Enter in terminal:
voice

On first run, the Automatic Speech Recognition (ASR) model will download automatically with cache directory located at:

~/.cache/sensevoice

Manual Setup

If you prefer to run from source, follow these steps:

Clone Code

git clone https://gitee.com/bianbu/spacemit-demo.git
cd spacemit-demo/examples/NLP

Install Environment Dependencies

sudo apt install python3-venv

python3 -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

Model Creation

Download model files (.gguf) and their corresponding Modelfiles (.modelfile):

sudo apt install wget
wget https://modelscope.cn/models/second-state/Qwen2.5-0.5B-Instruct-GGUF/resolve/master/Qwen2.5-0.5B-Instruct-Q4_0.gguf -P ./
wget https://archive.spacemit.com/spacemit-ai/modelfile/qwen2.5:0.5b.modelfile -P ./

wget http://archive.spacemit.com/spacemit-ai/gguf/qwen2.5-0.5b-fc-q4_0.gguf -P ./
wget http://archive.spacemit.com/spacemit-ai/modelfile/qwen2.5-0.5b-fc.modelfile -P ./

Create models using Ollama:

ollama create qwen2.5:0.5b -f qwen2.5:0.5b.modelfile
ollama create qwen2.5-0.5b-fc -f qwen2.5-0.5b-fc.modelfile

Detect Recording Device

Follow the instructions in the Detect System Recording Devices section to check the available recording devices in the system.

Run the Pipeline

Execute the following command to run the complete speech-to-text → large model inference pipeline:

After detecting the recording device, modify the device index in the code to match your system (default is 3).

python 06_asr_llm_demo.py

After speaking into the microphone, the system will:

  1. Automatically Record and transcribe speech (with integrated VAD).
  2. Send recognized text to the local LLM (e.g., Qwen).
  3. Display the LLM output as inference results.