OllamaFlow is a lightweight, intelligent orchestration layer that transforms multiple Ollama instances into a unified, high-availability AI inference cluster. Whether you're scaling AI workloads across multiple GPUs or ensuring zero-downtime model serving, OllamaFlow has you covered.
- π― Multiple Virtual Endpoints: Create multiple frontend endpoints, each mapping to their own set of Ollama backends
- βοΈ Smart Load Balancing: Distribute requests intelligently across healthy backends
- π Automatic Model Sync: Ensure all backends have the required models - automatically
- β€οΈ Health Monitoring: Real-time health checks with configurable thresholds
- π Zero Downtime: Seamlessly handle backend failures without dropping requests
- π οΈ RESTful Admin API: Full control through a comprehensive management API
- Round-robin and random distribution strategies
- Request routing based on backend health and capacity
- Automatic failover for unhealthy backends
- Configurable rate limiting per backend
- Automatic model discovery across all backends
- Intelligent synchronization - pulls missing models automatically
- Dynamic model requirements - update required models on the fly
- Parallel downloads with configurable concurrency
- Real-time health monitoring with customizable check intervals
- Automatic failover for unhealthy backends
- Request queuing during high load
- Connection pooling for optimal performance
- Bearer token authentication for admin APIs
- Comprehensive logging with syslog support
- Docker and Docker Compose ready
- SQLite database for configuration persistence
# Pull the image
docker pull jchristn/ollamaflow
# Run with default configuration
docker run -d \
-p 43411:43411 \
-v $(pwd)/ollamaflow.json:/app/ollamaflow.json \
jchristn/ollamaflow# Clone the repository
git clone https://github.com/jchristn/ollamaflow.git
cd ollamaflow/src
# Build and run
dotnet build
cd OllamaFlow.Server/bin/Debug/net8.0
dotnet OllamaFlow.Server.dllOllamaFlow uses a simple JSON configuration file. Here's a minimal example:
{
"Webserver": {
"Hostname": "localhost",
"Port": 43411
},
"Logging": {
"MinimumSeverity": "Info",
"ConsoleLogging": true
}
}Frontends define your virtual Ollama endpoints:
{
"Identifier": "main-frontend",
"Name": "Production Ollama Frontend",
"Hostname": "*",
"LoadBalancing": "RoundRobin",
"Backends": ["gpu-1", "gpu-2", "gpu-3"],
"RequiredModels": ["llama3", "mistral", "codellama"]
}Backends represent your actual Ollama instances:
{
"Identifier": "gpu-1",
"Name": "GPU Server 1",
"Hostname": "192.168.1.100",
"Port": 11434,
"MaxParallelRequests": 4,
"HealthCheckUrl": "/",
"UnhealthyThreshold": 2
}OllamaFlow is fully compatible with the Ollama API, supporting:
- β
/api/generate- Text generation - β
/api/chat- Chat completions - β
/api/pull- Model pulling - β
/api/push- Model pushing - β
/api/show- Model information - β
/api/tags- List models - β
/api/ps- Running models - β
/api/embed- Embeddings - β
/api/delete- Model deletion
Test with multiple Ollama instances using Docker Compose:
cd Docker
docker compose -f compose-ollama.yaml up -dThis spins up 4 Ollama instances on ports 11435-11438 for testing.
Manage your cluster programmatically:
# List all backends
curl -H "Authorization: Bearer your-token" \
http://localhost:43411/v1.0/backends
# Add a new backend
curl -X PUT \
-H "Authorization: Bearer your-token" \
-H "Content-Type: application/json" \
-d '{"Identifier": "gpu-4", "Hostname": "192.168.1.104", "Port": 11434}' \
http://localhost:43411/v1.0/backendsA complete Postman collection (OllamaFlow.postman_collection.json) is included in the repository root with examples for all API endpoints, both Ollama-compatible and administrative APIs.
We welcome contributions! Whether it's:
- π Bug fixes
- β¨ New features
- π Documentation improvements
- π‘ Feature requests
Please check out our Contributing Guidelines and feel free to:
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
OllamaFlow adds minimal overhead to your Ollama requests:
- < 1ms routing decision time
- Negligible memory footprint (~50MB)
- High throughput - handles thousands of requests per second
- Efficient streaming support for real-time responses
- Bearer token authentication for administrative APIs
- Request source IP forwarding for audit trails
- Configurable request size limits
- No external dependencies for core functionality
- GPU Cluster Management: Distribute AI workloads across multiple GPU servers
- CPU Infrastructure: Perfect for dense CPU systems like Ampere processors
- High Availability: Ensure your AI services stay online 24/7
- Development & Testing: Easily switch between different model configurations
- Cost Optimization: Maximize hardware utilization across your infrastructure
- Multi-Tenant Scenarios: Isolate workloads while sharing infrastructure
This project is licensed under the MIT License - see the LICENSE file for details.
- The Ollama team for creating an amazing local AI runtime
- All our contributors and users who make this project possible
Get started with OllamaFlow today!