A production-ready FastAPI wrapper for Microsoft's VibeVoice model, enabling high-quality multi-speaker text-to-speech generation through a REST API with status tracking and queue management.
- π― Multi-Speaker TTS: Generate conversations with up to 4 distinct speakers
- β‘ Asynchronous Processing: Queue-based system handles multiple requests efficiently
- π Status Tracking: Real-time job status and queue position monitoring
- π Rate Limiting: Built-in protection against API abuse (10 requests/minute)
- π΅ Voice Presets: Pre-configured voice samples for immediate use
- π File Management: Automatic output file handling and cleanup
- π³ Docker Ready: Easy deployment with containerization support
- Python 3.8+
- NVIDIA GPU with CUDA support (recommended)
- At least 8GB GPU memory for optimal performance
# Install uv if you haven't already
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone this repository
git clone https://github.com/dontriskit/VibeVoice-FastAPI
cd VibeVoice-FastAPI
# Create and activate virtual environment with uv
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
uv pip install -r requirements.txt
uv pip install -e .# Clone this repository
git clone https://github.com/dontriskit/VibeVoice-FastAPI
cd VibeVoice-FastAPI
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
pip install -e .# Using NVIDIA PyTorch Container
sudo docker run --privileged --net=host --ipc=host \
--ulimit memlock=-1:-1 --ulimit stack=-1:-1 --gpus all \
--rm -it nvcr.io/nvidia/pytorch:24.07-py3
# Inside container
git clone <your-repo-url>
cd VibeVoice-FastAPI
pip install -r requirements.txt
pip install -e .# Basic usage (defaults to port 8000)
python main.py
# Custom host and port
uvicorn main:app --host 0.0.0.0 --port 8000 --reloadThe API will be available at http://localhost:8000 with interactive documentation at http://localhost:8000/docs.
| Model | Context Length | Generation Length | Hugging Face |
|---|---|---|---|
| VibeVoice-1.5B | 64K | ~90 min | microsoft/VibeVoice-1.5B |
| VibeVoice-7B-Preview | 32K | ~45 min | WestZhang/VibeVoice-Large-pt |
POST /generate
Submit a text-to-speech generation request with multiple speakers.
Request Body:
{
"script": "Speaker 1: Hello, how are you today?\nSpeaker 2: I'm doing great, thanks for asking!",
"speaker_names": ["en-Alice_woman", "en-Carter_man"],
"cfg_scale": 1.3
}Response:
{
"task_id": "123e4567-e89b-12d3-a456-426614174000",
"status": "queued",
"message": "Job accepted and placed in queue.",
"queue_position": 1
}GET /status/{task_id}
Monitor the progress of your generation job.
Response:
{
"task_id": "123e4567-e89b-12d3-a456-426614174000",
"status": "running",
"queue_position": 0,
"generation_time": null
}Status Values:
queued: Job is waiting in queuerunning: Job is currently being processedcompleted: Job finished successfullyfailed: Job encountered an error
GET /result/{task_id}
Download the generated audio file (WAV format).
Response: Audio file download or error message.
GET /voices
Get all available voice presets.
Response:
[
"en-Alice_woman",
"en-Carter_man",
"en-Frank_man",
"en-Mary_woman_bgm",
"zh-Bowen_man",
"zh-Xinran_woman"
]Your script should follow this format:
Speaker 1: First person's dialogue here.
Speaker 2: Second person's response.
Speaker 1: More dialogue from first person.
Important Notes:
- Each speaker line must start with "Speaker" followed by a number
- Speaker numbers should be consistent throughout the script
- Provide voice names in
speaker_namesarray matching the speaker order
import requests
import time
# Submit generation request
response = requests.post("http://localhost:8000/generate", json={
"script": "Speaker 1: Welcome to our podcast!\nSpeaker 2: Thanks for having me!",
"speaker_names": ["en-Alice_woman", "en-Carter_man"],
"cfg_scale": 1.3
})
task_id = response.json()["task_id"]
print(f"Task submitted: {task_id}")
# Poll for completion
while True:
status_response = requests.get(f"http://localhost:8000/status/{task_id}")
status = status_response.json()["status"]
if status == "completed":
# Download the result
audio_response = requests.get(f"http://localhost:8000/result/{task_id}")
with open("output.wav", "wb") as f:
f.write(audio_response.content)
print("Audio saved as output.wav")
break
elif status == "failed":
print("Generation failed")
break
else:
print(f"Status: {status}")
time.sleep(5)# Submit generation request
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{
"script": "Speaker 1: Hello world!\nSpeaker 2: How are you?",
"speaker_names": ["en-Alice_woman", "en-Carter_man"],
"cfg_scale": 1.3
}'
# Check status
curl "http://localhost:8000/status/YOUR_TASK_ID"
# Download result
curl -o output.wav "http://localhost:8000/result/YOUR_TASK_ID"
# List voices
curl "http://localhost:8000/voices"By default, the API uses microsoft/VibeVoice-1.5B. To use the 7B model, modify line 165 in main.py:
model_path = "WestZhang/VibeVoice-Large-pt"Place your custom voice samples in demo/voices/ directory. Supported format: WAV files.
Current limit: 10 requests per minute per IP. Modify in main.py:
@limiter.limit("10/minute") # Change this valueFor optimal Chinese speech generation:
- Use English punctuation (commas and periods only)
- Consider using the 7B model for better stability
- Avoid special Chinese quotation marks
The model may spontaneously generate background music:
- Voice samples with BGM increase the likelihood
- Introductory phrases ("Welcome to", "Hello") may trigger BGM
- Using "Alice" voice preset has higher BGM probability
- This is an intentional feature, not a bug
- 1.5B model: ~8GB GPU memory
- 7B model: ~16GB GPU memory
- CPU inference is supported but significantly slower
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββββββ
β FastAPI βββββΆβ Task Queue βββββΆβ Worker Thread β
β Web Server β β (in-memory) β β (GPU Inference) β
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββββββ
β Rate Limiter β β Job Tracking β β Audio Output β
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββββββ
Key Features:
- Single-process architecture for simplicity
- In-memory queue ensures FIFO processing
- Background worker prevents blocking the web server
- Status tracking provides real-time updates
This project is a FastAPI wrapper around Microsoft's VibeVoice model. Please refer to the original VibeVoice repository for licensing terms and model details.
Responsible AI Usage:
- Disclose AI-generated content when sharing
- Ensure compliance with local laws and regulations
- Verify content accuracy and avoid misleading applications
- Do not use for deepfakes or disinformation
Technical Limitations:
- English and Chinese only
- No overlapping speech generation
- Speech synthesis only (no background noise/music control)
- Not recommended for commercial use without additional testing
Model is intended for research and development purposes. Use responsibly.