Converse with large language models using speech. DEMO
- Open: Powered by state-of-the-art open-source speech processing models.
- Efficient: Light enough to run on consumer hardware, with low latency.
- Self-hosted: Entire pipeline runs offline, limited only by compute power.
- Modular: Switching LLM providers is as simple as changing an environment variable.
![Sage architecture](https://github.com/farshed/sage/raw/main/assets/architecture-dark.png?raw=true)
-
For text generation, you can either self-host an LLM using Ollama, or opt for a third-party provider. This can be configured using a .env file in the project root.
-
If you're using Ollama, add the
OLLAMA_MODEL
variable to the .env file to specify the model you'd like to use. (Example:OLLAMA_MODEL=deepseek-r1:7b
) -
Among the third-party providers, Sage supports the following out of the box:
- Deepseek
- OpenAI
- Anthropic
- Together.ai
-
To use a provider, add a
<PROVIDER>_API_KEY
variable to the .env file. (Example:OPENAI_API_KEY=xxxxxxxxxxxxxxxxxxxxxxx
) -
To choose which model should be used for a given provider, use the
<PROVIDER>_MODEL
variable. (Example:DEEPSEEK_MODEL=deepseek-chat
)
-
-
Next, you have two choices: Run Sage as a Docker container (the easy way) or natively (the hard way). Note that running it with Docker may have a performance penalty (Inference with whisper is 4-5x slower compared to native).
-
With Docker: Install Docker and start the daemon. Download the following files and place them inside a
models
directory at the project root.Run
bun docker-build
to build the image and thenbun docker-run
to spin a container. The UI is exposed athttp://localhost:3000
. -
Without Docker: Install Bun, Rust, OpenSSL, LLVM, Clang, and CMake. Make sure all of these are accessible via
$PATH
. Then, runsetup-unix.sh
orsetup-win.bat
depending on your platform. This will download the required model weights and compile the binaries needed for Sage. Once finished, start the project withbun start
. The first run on macOS is slow (~20 minutes on M1 Pro), since the ANE service compiles the Whisper CoreML model to a device-specific format. Next runs are faster.
-
- Make it easier to run (Dockerize?)
- CUDA support
- Allow custom Ollama endpoint
- Multilingual support
- Allow Whisper configuration
- Allow customization of system prompt
- Optimize the pipeline
- Release as a library?