[PERFORMANCE]: Metrics aggregation queries cause full table scans under load

## Summary

Metrics aggregation queries perform full table scans instead of using indexes, causing significant database load under high concurrency. During load testing with 3000 users, `tool_metrics` alone accumulated **2 billion tuples scanned** in 40 minutes.

## Problem

The `aggregate_metrics_combined()` function in `mcpgateway/services/metrics_query_service.py` uses a 7-day retention cutoff filter:

```python
cutoff = now - timedelta(days=retention_days)  # Default: 7 days
raw_filters = [raw_model.timestamp >= cutoff]
```

When most/all data is recent (common during load testing or fresh deployments), this filter matches 100% of rows, causing PostgreSQL to choose sequential scan over index scan.

## Evidence

### Table Scan Statistics (40-minute load test)

| Table | Rows | Seq Scans | Tuples Read | Tuples/Sec |
|-------|------|-----------|-------------|------------|
| `tool_metrics` | 541K | 17,553 | 2,027,507,024 | 806,572 |
| `resource_metrics` | 72K | 5,829 | 210,189,716 | 83,617 |
| `prompt_metrics` | 59K | 5,827 | 177,700,450 | 70,692 |

### EXPLAIN ANALYZE

```
Plan: Parallel Seq Scan on tool_metrics
      Filter: (timestamp >= (now() - '7 days'::interval))
      Rows Removed by Filter: 0  <-- ALL rows match
      Execution Time: 63.734 ms
```

With a more selective filter (1 hour), the index IS used:
```
Plan: Parallel Index Only Scan using ix_tool_metrics_timestamp
```

### Contributing Factor

The metrics cache TTL is only **10 seconds** (`metrics_cache_ttl_seconds`). Under load, cache expires constantly, triggering repeated full table scans.

## Impact

- `/admin/` endpoint latency: **5.6s average** (target: <500ms)
- Database CPU: 130-158%
- I/O pressure from continuous full table scans

## Proposed Solution

### Phase 1: Configuration (Quick Win)

1. Increase default `metrics_cache_ttl_seconds` from 10 to 60-120 seconds
2. Add Redis-backed cache option for distributed deployments

### Phase 2: Database Optimization

3. Add covering index for aggregation queries:
   ```sql
   CREATE INDEX idx_tool_metrics_agg_covering
   ON tool_metrics (timestamp) INCLUDE (is_success, response_time);
   ```

4. Consider materialized view for pre-computed aggregations

### Phase 3: Architecture (Optional)

5. Partition metrics tables by time for efficient pruning
6. Background job to refresh aggregation cache

## Files to Modify

- `mcpgateway/config.py` - Increase default cache TTL
- `mcpgateway/cache/metrics_cache.py` - Add Redis backend option
- `mcpgateway/alembic/versions/` - Add covering index migration
- `docker-compose.yml` / `.env.example` - Document settings

## Related Issues

- #1892 - N+1 in team counts (fixed)
- #1894 - Admin UI high latency (partially addressed)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERFORMANCE]: Metrics aggregation queries cause full table scans under load #1906

Summary

Problem

Evidence

Table Scan Statistics (40-minute load test)

EXPLAIN ANALYZE

Contributing Factor

Impact

Proposed Solution

Phase 1: Configuration (Quick Win)

Phase 2: Database Optimization

Phase 3: Architecture (Optional)

Files to Modify

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Table	Rows	Seq Scans	Tuples Read	Tuples/Sec
`tool_metrics`	541K	17,553	2,027,507,024	806,572
`resource_metrics`	72K	5,829	210,189,716	83,617
`prompt_metrics`	59K	5,827	177,700,450	70,692

[PERFORMANCE]: Metrics aggregation queries cause full table scans under load #1906

Description

Summary

Problem

Evidence

Table Scan Statistics (40-minute load test)

EXPLAIN ANALYZE

Contributing Factor

Impact

Proposed Solution

Phase 1: Configuration (Quick Win)

Phase 2: Database Optimization

Phase 3: Architecture (Optional)

Files to Modify

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions