Skip to main content

5.1.2 Speech to Text (ASR)

Last Version: 11/09/2025

Overview

This guide explains how to use Automatic Speech Recognition (ASR) to convert spoken words into text. The process involves capturing audio from a microphone and using a model to transcribe it automatically.

Project repository: ⭐ Bianbu AI Demo Zoo | NLP

Preparation

Clone Code

Clone the repository and navigate to the correct directory:

git clone https://gitee.com/bianbu/spacemit-demo.git
cd spacemit-demo/examples/NLP

Install Environment Dependencies

It is recommended to use a virtual environment for dependency isolation:

# Install the virtual environment package
sudo apt install python3-venv

# Create and activate the virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install project dependencies
pip install -r requirements.txt

Detect System Recording Devices

Follow the instructions in the Detect System Recording Devices section to check the available recording devices in the system.

Execute Example Code

Run the ASR example:

python 03_asr_demo.py

After the program starts, press Enter to begin recording. The integrated VAD functionality will automatically determine if there is human speech and stop recording during silence. The program will start and wait for your command.

  • Press Enter to begin recording.
  • The built-in Voice Activity Detection (VAD) will automatically detect speech and stop during silence.

Parameter Description

Parameter NameDescriptionUsage
sldSilence Duration Threshold (seconds)Speech ends if silence lasts ≥ sld seconds;
- Set to 0 to disable
max_timeMaximum Recording Time (seconds)Automatically stops recording after this duration to prevent long recordings.
channelsAudio Channel CountUsually set to 1 (mono).
- mono input is recommended for speech recognition
rateSample Rate (Hz)Number of samples per second, e.g., 16000 or 48000.
- Must match the model input
device_indexInput Device IndexSpecify recording device.
- Find index using arecord or search_device.py