TritonML Framework

A powerful framework for deploying machine learning models to NVIDIA Triton Inference Server with built-in optimizations, quantization, and easy-to-use APIs.

Features

🚀 Easy Deployment: Deploy any HuggingFace model with a single command
🔧 Automatic Optimization: Built-in quantization and optimization for 4x model compression
🎯 Task-Specific Models: Pre-built support for text classification, image classification, and more
📦 Model Conversion: Automatic conversion to ONNX, TorchScript, or TensorRT
🔌 Simple API: Intuitive Python API and CLI tools
🐳 Docker Ready: Generate Docker deployment packages automatically
📊 Benchmarking: Built-in performance benchmarking tools

Installation

pip install tritonml

Or install from source:

git clone https://github.com/aaanshshah/tritonml
cd tritonml
pip install -e .

Quick Start

1. Deploy a Model in 3 Lines

from tritonml import deploy

# Deploy any HuggingFace model
client = deploy("cardiffnlp/twitter-roberta-base-emotion")

# Make predictions
result = client.predict("I love this framework!")
print(result)  # Output: "joy"

2. Using the CLI

# Deploy a model
tritonml deploy cardiffnlp/twitter-roberta-base-emotion --server localhost:8000

# Make predictions
tritonml predict emotion-classifier "I'm so happy!" --server localhost:8000

# Benchmark performance
tritonml benchmark emotion-classifier --batch-sizes 1,8,16,32

Core Concepts

TritonModel

The base class for all deployable models:

from tritonml import TritonModel

# Load a model from HuggingFace
model = TritonModel.from_huggingface(
    "bert-base-uncased",
    task="text-classification"  # Auto-detected if not specified
)

# Convert and optimize
model.convert()              # Convert to ONNX
model.quantize()            # Apply INT8 quantization
model.optimize()            # Apply graph optimizations

# Deploy
client = model.deploy(server_url="localhost:8000")

# Use the model
result = model.predict("Hello world!")

Task-Specific Models

Pre-configured models for common tasks:

from tritonml.tasks import TextClassificationModel, EmotionClassifier

# Generic text classification
model = TextClassificationModel.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english",
    labels=["negative", "positive"]
)

# Specialized emotion classifier
emotion_model = EmotionClassifier.from_pretrained()
emotions = emotion_model.predict([
    "I'm furious!",
    "Best day ever!",
    "Things will improve",
    "Feeling down..."
])

Model Conversion

Convert models to optimized formats:

from tritonml.core.converter import get_converter

# Get appropriate converter
converter = get_converter("onnx", model, config)

# Convert with options
converter.convert(
    output_path="./models/my-model",
    opset_version=14,
    optimize_for_gpu=True
)

# Quantize for better performance
converter.quantize(
    method="dynamic",  # or "static" with calibration data
    per_channel=True
)

Benchmarking

TritonML now supports benchmarking models with Hugging Face datasets:

Basic Benchmarking

from tritonml import TextClassificationModel, BenchmarkRunner, HuggingFaceDatasetLoader

# Load and deploy your model
model = TextClassificationModel.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model.deploy()

# Create benchmark runner
runner = BenchmarkRunner(model)

# Load a dataset
dataset_loader = HuggingFaceDatasetLoader("imdb", split="test")

# Run benchmark
results = runner.benchmark_dataset(
    dataset_loader,
    batch_sizes=[1, 8, 16, 32],
    num_samples=1000
)

# Print summary
runner.print_summary()

# Save results
runner.save_results("benchmark_results.json")

CLI Benchmarking

Use the CLI to benchmark deployed models with Hugging Face datasets:

# Benchmark with IMDB dataset
tritonml benchmark my-model --dataset imdb --num-samples 1000 --output results.json

# Custom batch sizes
tritonml benchmark my-model --dataset emotion --batch-sizes "1,4,8,16" --output results.csv

Multiple Datasets

Benchmark across multiple datasets:

# Define dataset configurations
datasets = [
    {"dataset_name": "imdb", "split": "test"},
    {"dataset_name": "rotten_tomatoes", "split": "test"},
    {"dataset_name": "emotion", "split": "test"}
]

# Run benchmarks
results = runner.benchmark_multiple_datasets(
    datasets,
    batch_sizes=[1, 8, 16],
    num_samples=500
)

Available Datasets

Popular datasets for benchmarking:

Text Classification:

imdb - Movie review sentiment
rotten_tomatoes - Movie reviews
emotion - Emotion classification
ag_news - News categorization
tweet_eval - Tweet sentiment

Other Tasks:

See HuggingFaceDatasetLoader.list_popular_datasets() for more

Advanced Usage

Custom Models

Create custom model implementations:

from tritonml.core.model import TritonModel
from tritonml.core.config import TritonConfig

class MyCustomModel(TritonModel):
    @classmethod
    def from_pretrained(cls, model_path, **kwargs):
        # Load your model
        config = TritonConfig(
            model_name="my-model",
            input_shapes={"input": [512]},
            output_shapes={"output": [10]}
        )
        return cls(config)
    
    def preprocess(self, inputs):
        # Custom preprocessing
        return {"input": process_inputs(inputs)}
    
    def postprocess(self, outputs):
        # Custom postprocessing
        return outputs["output"].argmax()

Deployment Configuration

Fine-tune deployment settings:

from tritonml.core.config import TritonConfig

config = TritonConfig(
    model_name="my-model",
    max_batch_size=64,
    instance_group={"kind": "KIND_GPU", "count": 2},
    dynamic_batching={
        "preferred_batch_size": [8, 16, 32],
        "max_queue_delay_microseconds": 100
    }
)

model = MyCustomModel(config)

Docker Deployment

Generate complete Docker deployment packages:

from tritonml.deploy.docker import create_deployment_package

create_deployment_package(
    model_name="emotion-classifier",
    output_path="./deploy",
    include_client=True
)

This creates:

Dockerfile - Custom Triton server image
docker-compose.yml - Complete deployment configuration
client_example.py - Example client code
README.md - Deployment instructions

Benchmarking

Benchmark model performance:

# Built-in benchmarking
results = model.benchmark(
    test_inputs=["sample text"] * 100,
    batch_sizes=[1, 8, 16, 32, 64]
)

for batch_size, metrics in results.items():
    print(f"{batch_size}: {metrics['avg_latency_ms']:.2f}ms, "
          f"{metrics['throughput']:.2f} samples/sec")

Architecture

TritonML follows a modular architecture:

tritonml/
├── core/               # Core framework components
│   ├── model.py       # Base TritonModel class
│   ├── client.py      # Enhanced Triton client
│   ├── converter.py   # Model conversion utilities
│   └── config.py      # Configuration management
├── tasks/              # Task-specific implementations
│   ├── text_classification.py
│   ├── image_classification.py
│   └── converters/    # Task-specific converters
├── utils/              # Utility functions
├── deploy/             # Deployment utilities
└── cli/                # Command-line interface

Supported Models

Text Models

BERT, RoBERTa, DistilBERT, ALBERT
GPT-2, GPT-Neo, T5 (coming soon)
Any HuggingFace AutoModelForSequenceClassification

Image Models (coming soon)

Vision Transformer (ViT)
ResNet, EfficientNet
Any torchvision model

Custom Models

ONNX models
TorchScript models
TensorFlow SavedModel (coming soon)

Performance

TritonML automatically applies optimizations:

Quantization: 4x model size reduction with INT8
Graph Optimization: ONNX runtime optimizations
Batching: Dynamic batching for better throughput
Multi-Instance: GPU/CPU instance scaling

Example results for emotion classification:

Original model: 476MB
Quantized model: 120MB (4x compression)
Latency: 2-4x faster inference
Accuracy: 93.8% maintained

CI/CD Setup

GitHub Actions

The repository uses GitHub Actions for continuous integration. The workflow runs:

Linting with flake8
Code formatting checks with black and isort
Type checking with mypy
Unit tests with pytest
Code coverage reporting with Codecov

Setting up Codecov

To enable Codecov integration for your fork:

Sign up at codecov.io using your GitHub account
Add your repository to Codecov
Copy your repository's upload token
Add the token as a GitHub secret:
- Go to Settings → Secrets → Actions
- Add a new secret named CODECOV_TOKEN
- Paste your token as the value

Note: The CI will still pass even if Codecov upload fails, to handle rate limiting for public repositories.

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.

Acknowledgments

Built on top of:

NVIDIA Triton Inference Server
HuggingFace Transformers
ONNX Runtime
Microsoft Optimum

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
tests		tests
tritonml		tritonml
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
MIGRATION_GUIDE.md		MIGRATION_GUIDE.md
Makefile		Makefile
README.md		README.md
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

TritonML Framework

Features

Installation

Quick Start

1. Deploy a Model in 3 Lines

2. Using the CLI

Core Concepts

TritonModel

Task-Specific Models

Model Conversion

Benchmarking

Basic Benchmarking

CLI Benchmarking

Multiple Datasets

Available Datasets

Advanced Usage

Custom Models

Deployment Configuration

Docker Deployment

Benchmarking

Architecture

Supported Models

Text Models

Image Models (coming soon)

Custom Models

Performance

CI/CD Setup

GitHub Actions

Setting up Codecov

Contributing

License

Acknowledgments

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages