Skip to content

chore(ci): Improve reliability of retries in TracingE2ET#2018

Merged
phipag merged 8 commits intomainfrom
phipag/fix-tracing-e2e-timeouts
Aug 7, 2025
Merged

chore(ci): Improve reliability of retries in TracingE2ET#2018
phipag merged 8 commits intomainfrom
phipag/fix-tracing-e2e-timeouts

Conversation

@phipag
Copy link
Contributor

@phipag phipag commented Aug 6, 2025

Summary

This PR addresses two reliability issues of the TracingE2E tests which were failing occasionally:

  1. The search horizon was too narrow. When the trace started at the end of minute x and some subsegments got populated in minute y, they were not found. I added a one minute padding to the search horizon.
  2. After finding trace ids for the search, we immediately attempted to fetch the (sub-)segments of that trace without accounting for that fact that they are populated async with a delay. I added retries for the second query for the sub-segments as well. See comment Maintenance: Fix TracingE2E test to avoid occasional timeouts #1846 (comment)

Also adds some more debug logs and more useful logging statements to make debugging easier in the future.

Changes

Issue number: #1846


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Disclaimer: We value your time and bandwidth. As such, any pull requests created on non-triaged issues might not be successful.

@phipag phipag self-assigned this Aug 6, 2025
@phipag
Copy link
Contributor Author

phipag commented Aug 6, 2025

Ran too many tests in parallel. Will do one by one.

@phipag
Copy link
Contributor Author

phipag commented Aug 7, 2025

The tests are now failing due to exceeding the AppConfig limit after running too many in parallel. Will cleanup and retry.

This AWS::AppConfig::DeploymentStrategy resource is in a CREATE_FAILED state.

Resource handler returned message: "You have exceeded your deployment strategy limit of 20. Your current count is 20. Go to AWS Service Quotas to request a limit increase.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Aug 7, 2025

@phipag
Copy link
Contributor Author

phipag commented Aug 7, 2025

I added another retry loop to a flaky MetricsE2ET which seems to pass reliably now. I will run some more E2E tests sequentially now.

The issue was that the metricsFetcher.fetchMetrics call completed as soon as the metric is found but it does not wait long enough for all datapoints to be populated in CloudWatch. In this case, we expect 2 datapoints and sometimes it already returned after finding the first datapoint.

        List<Double> orderMetrics = RetryUtils.withRetry(() -> {
            List<Double> metrics = metricsFetcher.fetchMetrics(invocationResult.getStart(), invocationResult.getEnd(),
                    60, NAMESPACE, "orders", Collections.singletonMap("Environment", "test"));
            if (metrics.get(0) != 2.0) {
                throw new DataNotReadyException("Expected 2.0 orders but got " + metrics.get(0));
            }
            return metrics;
        }, "orderMetricsRetry", DataNotReadyException.class).get();

@phipag
Copy link
Contributor Author

phipag commented Aug 7, 2025

The last 6 consecutive runs of E2E tests succeeded. It looks like we have resolved all flaky tests with appropriate retry logic for now.

Copy link
Contributor

@dreamorosi dreamorosi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work with these tests!

@phipag
Copy link
Contributor Author

phipag commented Aug 7, 2025

Great work with these tests!

Thanks. This is the 3rd time now that I think they are fixed. Let's see if the tests prove me wrong in the next couple of weeks 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Coming soon

Development

Successfully merging this pull request may close these issues.

Maintenance: Fix TracingE2E test to avoid occasional timeouts

2 participants