-
Notifications
You must be signed in to change notification settings - Fork 511
Description
Summary
Metrics aggregation queries perform full table scans instead of using indexes, causing significant database load under high concurrency. During load testing with 3000 users, tool_metrics alone accumulated 2 billion tuples scanned in 40 minutes.
Problem
The aggregate_metrics_combined() function in mcpgateway/services/metrics_query_service.py uses a 7-day retention cutoff filter:
cutoff = now - timedelta(days=retention_days) # Default: 7 days
raw_filters = [raw_model.timestamp >= cutoff]When most/all data is recent (common during load testing or fresh deployments), this filter matches 100% of rows, causing PostgreSQL to choose sequential scan over index scan.
Evidence
Table Scan Statistics (40-minute load test)
| Table | Rows | Seq Scans | Tuples Read | Tuples/Sec |
|---|---|---|---|---|
tool_metrics |
541K | 17,553 | 2,027,507,024 | 806,572 |
resource_metrics |
72K | 5,829 | 210,189,716 | 83,617 |
prompt_metrics |
59K | 5,827 | 177,700,450 | 70,692 |
EXPLAIN ANALYZE
Plan: Parallel Seq Scan on tool_metrics
Filter: (timestamp >= (now() - '7 days'::interval))
Rows Removed by Filter: 0 <-- ALL rows match
Execution Time: 63.734 ms
With a more selective filter (1 hour), the index IS used:
Plan: Parallel Index Only Scan using ix_tool_metrics_timestamp
Contributing Factor
The metrics cache TTL is only 10 seconds (metrics_cache_ttl_seconds). Under load, cache expires constantly, triggering repeated full table scans.
Impact
/admin/endpoint latency: 5.6s average (target: <500ms)- Database CPU: 130-158%
- I/O pressure from continuous full table scans
Proposed Solution
Phase 1: Configuration (Quick Win)
- Increase default
metrics_cache_ttl_secondsfrom 10 to 60-120 seconds - Add Redis-backed cache option for distributed deployments
Phase 2: Database Optimization
-
Add covering index for aggregation queries:
CREATE INDEX idx_tool_metrics_agg_covering ON tool_metrics (timestamp) INCLUDE (is_success, response_time);
-
Consider materialized view for pre-computed aggregations
Phase 3: Architecture (Optional)
- Partition metrics tables by time for efficient pruning
- Background job to refresh aggregation cache
Files to Modify
mcpgateway/config.py- Increase default cache TTLmcpgateway/cache/metrics_cache.py- Add Redis backend optionmcpgateway/alembic/versions/- Add covering index migrationdocker-compose.yml/.env.example- Document settings
Related Issues
- [PERFORMANCE]: N+1 query pattern in EmailTeam.get_member_count() #1892 - N+1 in team counts (fixed)
- [PERFORMANCE]: Admin UI endpoints have high tail latency (5-10s p95) #1894 - Admin UI high latency (partially addressed)