Skip to content

cfcosta/mutranscriber

Repository files navigation

mutranscriber

Native Rust audio transcription using Qwen3-ASR, powered by Candle.

Features

  • Pure Rust implementation with no Python dependencies
  • Automatic model download from HuggingFace Hub
  • GPU acceleration via CUDA or Metal
  • Audio extraction from video files via GStreamer
  • Both CLI tool and library API

Installation

From source

# Standard build (includes GStreamer support)
cargo install --path .

# With GPU support
cargo install --path . --features cuda    # NVIDIA
cargo install --path . --features metal   # macOS

# Without GStreamer (library-only, no file loading)
cargo install --path . --no-default-features

With Nix

# CPU-only build
nix build github:cfcosta/mutranscriber#mutranscriber-cpu

# CUDA-enabled build (NixOS with NVIDIA drivers)
nix build github:cfcosta/mutranscriber#mutranscriber-cuda

# Run directly without installing
nix run github:cfcosta/mutranscriber#mutranscriber-cpu -- audio.wav

The Nix packages include all dependencies (GStreamer plugins, CUDA libraries) and work out of the box.

Requirements

  • Rust 1.70+
  • GStreamer development libraries (enabled by default)
  • For cuda feature: CUDA toolkit
  • For metal feature: macOS with Metal support

Usage

CLI

# Transcribe an audio file
mutranscriber recording.wav

# Transcribe a video file
mutranscriber video.mp4

# Use the larger model
mutranscriber audio.wav --model large

# Force CPU mode
mutranscriber audio.wav --cpu

# Print to stdout only
mutranscriber audio.wav --stdout-only

# Custom output path
mutranscriber audio.wav --output transcript.txt

Library

use mutranscriber::{Transcriber, TranscriberConfig, ModelVariant};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create transcriber with default settings
    let transcriber = Transcriber::from_env();

    // Or with custom configuration
    let config = TranscriberConfig {
        variant: ModelVariant::Small,
        use_gpu: true,
        sample_rate: 16000,
        output_dir: None,
    };
    let transcriber = Transcriber::with_config(config);

    // Preload the model
    transcriber.preload().await?;

    // Transcribe raw audio samples (16kHz, f32)
    let audio_samples: Vec<f32> = load_audio_somehow();
    let text = transcriber.transcribe_audio(&audio_samples).await?;

    println!("{}", text);
    Ok(())
}

Model Variants

Variant Parameters VRAM HuggingFace ID
Small 0.6B ~2GB Qwen/Qwen3-ASR-0.6B
Large 1.7B ~4GB Qwen/Qwen3-ASR-1.7B

Models are automatically downloaded from HuggingFace Hub on first use and cached locally.

Audio Requirements

  • Sample rate: 16kHz
  • Format: f32 mono samples
  • The library handles padding to 30 seconds internally (matching WhisperFeatureExtractor)

When using the gstreamer feature, audio is automatically extracted and resampled from any format GStreamer supports.

Build Features

Feature Description Default
gstreamer Audio extraction from video/audio files Yes
cuda NVIDIA GPU acceleration No
metal Apple Metal GPU acceleration No

Development

# Enter dev environment (requires Nix)
nix develop

# Run tests
cargo test --lib

# Run integration tests (downloads model)
cargo test --test integration_test -- --ignored

# Format and lint
cargo fmt
cargo clippy

Architecture

Audio Input (16kHz f32)
    │
    ▼
Mel Spectrogram (128 bins, 30s padded)
    │
    ▼
Audio Encoder (Qwen3-AuT Transformer)
    │
    ▼
Audio Features (projected to LLM dim)
    │
    ▼
Qwen3 LLM Decoder (with audio embeddings)
    │
    ▼
Text Output

License

MIT OR Apache-2.0

About

No description, website, or topics provided.

Resources

License

MIT, Apache-2.0 licenses found

Licenses found

MIT
LICENSE
Apache-2.0
LICENSE-APACHE

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •