Skip to main content

5.2.4 Speech Input LLM Output

Feature Introduction

This section introduces how to integrate Automatic Speech Recognition (ASR) with Large Language Models (LLM) to build a complete inference pipeline: speech input → text transcription → text understanding → text output. By combining local speech recognition engines with local LLMs provided by Ollama, you can build an intelligent speech interaction system that runs completely offline without requiring internet connectivity.

One-Click Deployment (Optional)

We provide a one-click installation deployment package that supports rapid integration and execution.

Please ensure the device firmware version ≥ 2.2 Firmware download address: https://archive.spacemit.com/image/k1/version/bianbu/

Installation

sudo apt update
sudo apt install asr-llm

Startup

# Enter in terminal:
voice

The first run will automatically download the Automatic Speech Recognition (ASR) model, with cache directory located at:

~/.cache/sensevoice

Preparation

If you need to run manually from source code, you can execute the following steps:

Clone Code

git clone https://gitee.com/bianbu/spacemit-demo.git
cd spacemit_demo/examples/NLP

Install Environment Dependencies

sudo apt install python3-venv

python3 -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

Model Creation

sudo apt install wget
wget https://modelscope.cn/models/second-state/Qwen2.5-0.5B-Instruct-GGUF/resolve/master/Qwen2.5-0.5B-Instruct-Q4_0.gguf -P ./
wget https://archive.spacemit.com/spacemit-ai/modelfile/qwen2.5:0.5b.modelfile -P ./

wget http://archive.spacemit.com/spacemit-ai/gguf/qwen2.5-0.5b-fc-q4_0.gguf -P ./
wget http://archive.spacemit.com/spacemit-ai/modelfile/qwen2.5-0.5b-fc.modelfile -P ./
ollama create qwen2.5:0.5b -f qwen2.5:0.5b.modelfile
ollama create qwen2.5-0.5b-fc -f qwen2.5-0.5b-fc.modelfile

Detect Recording Device

Refer to the Recording Device Detection section to check the available recording devices in the system.

Run Code

Execute the following command to run the complete speech-to-text → large model inference pipeline:

After detecting the recording device, you need to modify the device index in the file to the current actual index. The default device index in the file is 3.

python 06_asr_llm_demo.py

After the user speaks, the system will:

  1. Automatically record and perform speech recognition (integrated VAD)
  2. Pass the recognized text to the locally deployed large language model (such as Qwen)
  3. Return the language model's inference results and display the output