Skip to content

[PERFORMANCE]: Metrics aggregation queries cause full table scans under load #1906

@crivetimihai

Description

@crivetimihai

Summary

Metrics aggregation queries perform full table scans instead of using indexes, causing significant database load under high concurrency. During load testing with 3000 users, tool_metrics alone accumulated 2 billion tuples scanned in 40 minutes.

Problem

The aggregate_metrics_combined() function in mcpgateway/services/metrics_query_service.py uses a 7-day retention cutoff filter:

cutoff = now - timedelta(days=retention_days)  # Default: 7 days
raw_filters = [raw_model.timestamp >= cutoff]

When most/all data is recent (common during load testing or fresh deployments), this filter matches 100% of rows, causing PostgreSQL to choose sequential scan over index scan.

Evidence

Table Scan Statistics (40-minute load test)

Table Rows Seq Scans Tuples Read Tuples/Sec
tool_metrics 541K 17,553 2,027,507,024 806,572
resource_metrics 72K 5,829 210,189,716 83,617
prompt_metrics 59K 5,827 177,700,450 70,692

EXPLAIN ANALYZE

Plan: Parallel Seq Scan on tool_metrics
      Filter: (timestamp >= (now() - '7 days'::interval))
      Rows Removed by Filter: 0  <-- ALL rows match
      Execution Time: 63.734 ms

With a more selective filter (1 hour), the index IS used:

Plan: Parallel Index Only Scan using ix_tool_metrics_timestamp

Contributing Factor

The metrics cache TTL is only 10 seconds (metrics_cache_ttl_seconds). Under load, cache expires constantly, triggering repeated full table scans.

Impact

  • /admin/ endpoint latency: 5.6s average (target: <500ms)
  • Database CPU: 130-158%
  • I/O pressure from continuous full table scans

Proposed Solution

Phase 1: Configuration (Quick Win)

  1. Increase default metrics_cache_ttl_seconds from 10 to 60-120 seconds
  2. Add Redis-backed cache option for distributed deployments

Phase 2: Database Optimization

  1. Add covering index for aggregation queries:

    CREATE INDEX idx_tool_metrics_agg_covering
    ON tool_metrics (timestamp) INCLUDE (is_success, response_time);
  2. Consider materialized view for pre-computed aggregations

Phase 3: Architecture (Optional)

  1. Partition metrics tables by time for efficient pruning
  2. Background job to refresh aggregation cache

Files to Modify

  • mcpgateway/config.py - Increase default cache TTL
  • mcpgateway/cache/metrics_cache.py - Add Redis backend option
  • mcpgateway/alembic/versions/ - Add covering index migration
  • docker-compose.yml / .env.example - Document settings

Related Issues

Metadata

Metadata

Assignees

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releasedatabaseperformancePerformance related items

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions