Why OpenJudge?

OpenJudge is a unified framework designed to drive LLM and Agent application excellence through Holistic Evaluation and Quality Rewards.

Evaluation and reward signals are the cornerstones of application excellence. Holistic evaluation enables the systematic analysis of shortcomings to drive rapid iteration, while high-quality rewards provide the essential foundation for advanced optimization and fine-tuning.

OpenJudge unifies evaluation metrics and reward signals into a single, standardized Grader interface, offering pre-built graders, flexible customization, and seamless framework integration.

Key Features

  • Systematic & Quality-Assured Grader Library: Access 50+ production-ready graders featuring a comprehensive taxonomy, rigorously validated for reliable performance.

    • Multi-Scenario Coverage: Extensive support for diverse domains including Agent, text, code, math, and multimodal tasks via specialized graders. Explore Supported Scenarios
    • Holistic Agent Evaluation: Beyond final outcomes, we assess the entire lifecycle—including trajectories and specific components (Memory, Reflection, Tool Use). Agent Lifecycle Evaluation
    • Quality Assurance: Built for reliability. Every grader comes with benchmark datasets and pytest integration for immediate quality validation. View Benchmark Datasets
  • Flexible Grader Building: Choose the build method that fits your requirements:

    • Customization: Clear requirements, but no existing grader? If you have explicit rules or logic, use our Python interfaces or Prompt templates to quickly define your own grader. Custom Grader Development Guide
    • Zero-shot Rubrics Generation: Not sure what criteria to use, and no labeled data yet? Just provide a task description and optional sample queries—the LLM will automatically generate evaluation rubrics for you. Ideal for rapid prototyping. Zero-shot Rubrics Generation Guide
    • Data-driven Rubrics Generation: Ambiguous requirements, but have few examples? Use the GraderGenerator to automatically summarize evaluation Rubrics from your annotated data, and generate a llm-based grader. Data-driven Rubrics Generation Guide
    • Training Judge Models: Massive data and need peak performance? Use our training pipeline to train a dedicated Judge Model. This is ideal for complex scenarios where prompt-based grading falls short. Train Judge Models
  • Easy Integration: Using mainstream observability platforms like LangSmith or Langfuse? We offer seamless integration to enhance their evaluators and automated evaluation capabilities. We're also building integrations with training frameworks like verl.

Quick Tutorials

More Tutorials

Built-in Graders

Build Graders

Integrations

Applications

Running Graders

Validating Graders