Skip to content

Add circuit breaker to redis version cache#7536

Merged
Linker44 merged 6 commits intomainfrom
redis_version_cache_circuit_breaker
Mar 2, 2026
Merged

Add circuit breaker to redis version cache#7536
Linker44 merged 6 commits intomainfrom
redis_version_cache_circuit_breaker

Conversation

@Linker44
Copy link
Copy Markdown
Contributor

@Linker44 Linker44 commented Mar 2, 2026

Ticket

[TODO: fill in ticket number]

Description Of Changes

When Redis is unavailable, the redis_version_cached decorator's fallback logic works correctly (returns stale cache or calls the function directly), but each invocation still attempts a Redis connection before falling back. During startup, update_saas_configs called ConnectorRegistry.connector_types() and then ConnectorRegistry.get_connector_template() per connector type -- each of which called _get_combined_templates(), triggering a Redis version check via CustomConnectorTemplateLoader. This meant ~50+ redundant Redis calls in a loop, even though the Redis cache only serves custom templates (not the file-based ones). With the standalone Redis client having no socket timeouts (defaulting to None / block indefinitely), this could cause startup to hang for an extended period when Redis is down.

Code Changes

  • src/fides/api/util/cache.py: Added socket_connect_timeout=10.0 and socket_timeout=10.0 to the standalone Redis client constructor, matching the cluster client which already had these. Prevents indefinite blocking on a single Redis call.
  • src/fides/api/util/redis_version_cache.py: Added a circuit breaker to _get_redis_version and _bump_redis_version. After the first Redis failure, subsequent calls within a 30-second cooldown window raise immediately without contacting Redis. The circuit resets automatically after the cooldown expires or upon a successful Redis call.
  • src/fides/api/service/connectors/saas/connector_registry_service.py: Added get_all_templates() public method to ConnectorRegistry to allow callers to load all templates once.
  • src/fides/api/util/saas_config_updater.py: Changed update_saas_configs to call ConnectorRegistry.get_all_templates() once before the loop instead of calling connector_types() + get_connector_template() per iteration. This reduces Redis version checks from ~50+ to exactly 1.
  • tests/ops/util/test_redis_version_cache.py: Added TestCircuitBreaker class with 4 tests (circuit opens after failure, resets after success, closes after cooldown, prevents repeated calls in a startup loop). Updated the autouse fixture to reset circuit breaker state between tests.

Steps to Confirm

  1. Run the unit tests: pytest --no-cov tests/ops/util/test_redis_version_cache.py -v -- all 19 tests should pass (15 existing + 4 new)
  2. Run the connector registry tests to verify no regressions: pytest --no-cov tests/ops/service/connectors/test_connector_registry_service.py -v
  3. Run the connector template endpoint tests: pytest --no-cov tests/ops/api/v1/endpoints/test_connector_template_endpoints.py -v
  4. To verify the startup resilience manually: stop Redis (docker stop fides-redis), restart the fides server, and confirm it starts up without hanging. The redis_version_cache decorator should log a debug message about Redis being unavailable and fall back gracefully. Restart Redis afterward (docker start fides-redis) and confirm normal caching resumes.
  • Issue requirements met
  • All CI pipelines succeeded
  • CHANGELOG.md updated
    • Add a db-migration This indicates that a change includes a database migration label to the entry if your change includes a DB migration
    • Add a high-risk This issue suggests changes that have a high-probability of breaking existing code label to the entry if your change includes a high-risk change (i.e. potential for performance impact or unexpected regression) that should be flagged
    • Updates unreleased work already in Changelog, no new entry necessary
  • UX feedback:
    • All UX related changes have been reviewed by a designer
    • No UX review needed
  • Followup issues:
    • Followup issues created
    • No followup issues
  • Database migrations:
    • Ensure that your downrev is up to date with the latest revision on main
    • Ensure that your downgrade() migration is correct and works
      • If a downgrade migration is not possible for this change, please call this out in the PR description!
    • No migrations
  • Documentation:
    • Documentation complete, PR opened in fidesdocs
    • Documentation issue created in fidesdocs
    • If there are any new client scopes created as part of the pull request, remember to update public-facing documentation that references our scope registry
    • No documentation updates required

@Linker44 Linker44 requested a review from a team as a code owner March 2, 2026 14:29
@Linker44 Linker44 requested review from thabofletcher and removed request for a team March 2, 2026 14:29
@vercel
Copy link
Copy Markdown
Contributor

vercel bot commented Mar 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
fides-plus-nightly Ready Ready Preview, Comment Mar 2, 2026 5:48pm
1 Skipped Deployment
Project Deployment Actions Updated (UTC)
fides-privacy-center Ignored Ignored Mar 2, 2026 5:48pm

Request Review

@Linker44 Linker44 self-assigned this Mar 2, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 2, 2026

Greptile Summary

This PR implements a circuit breaker pattern to prevent repeated slow timeouts when Redis is down, improving application resilience during Redis outages.

  • Added 10-second socket timeouts (socket_connect_timeout and socket_timeout) to Redis client configuration for both single-instance and cluster modes
  • Implemented a circuit breaker in redis_version_cache.py that opens after the first Redis failure and stays open for 30 seconds, preventing subsequent slow timeout attempts
  • The circuit breaker uses thread-safe global state (_redis_last_failure with _redis_failure_lock) and automatically resets after successful Redis calls
  • Added comprehensive test coverage in TestCircuitBreaker class covering all circuit breaker edge cases including startup loops, cooldown expiration, and reset behavior

The implementation is clean, well-documented, and solves a real production problem where Redis outages could cause cascading timeouts.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk - it adds defensive resilience without changing core business logic
  • The implementation is solid with proper thread safety, comprehensive test coverage, and no breaking changes. The circuit breaker pattern is a well-established reliability pattern, and the code is clean and well-documented. Socket timeouts are applied consistently to both Redis configurations.
  • No files require special attention

Important Files Changed

Filename Overview
src/fides/api/util/cache.py Added 10-second socket timeouts to both Redis cluster and single-instance configurations to prevent indefinite hangs when Redis is unreachable
src/fides/api/util/redis_version_cache.py Implemented circuit breaker pattern with 30-second cooldown to prevent repeated slow timeouts during Redis outages, with proper thread-safe global state management
tests/ops/util/test_redis_version_cache.py Added comprehensive test coverage for circuit breaker functionality including open/close/reset behavior and startup loop scenario

Last reviewed commit: a316875

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@Linker44 Linker44 removed the request for review from thabofletcher March 2, 2026 14:51
@Linker44 Linker44 requested review from adamsachs and erosselli March 2, 2026 15:04
Copy link
Copy Markdown
Contributor

@adamsachs adamsachs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, i think this should hopefully help the problem we're seeing. but i'd at least recommend we fix a major gap in our logging that is preventing us from learning more about what's actually happening on our nightly environment! i've called that out in a comment

@Linker44 Linker44 added this pull request to the merge queue Mar 2, 2026
Merged via the queue into main with commit c625f38 Mar 2, 2026
57 checks passed
@Linker44 Linker44 deleted the redis_version_cache_circuit_breaker branch March 2, 2026 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants