chore(ci): Improve reliability of retries in TracingE2ET by phipag · Pull Request #2018 · aws-powertools/powertools-lambda-java

phipag · 2025-08-06T15:01:39Z

Summary

This PR addresses two reliability issues of the TracingE2E tests which were failing occasionally:

The search horizon was too narrow. When the trace started at the end of minute x and some subsegments got populated in minute y, they were not found. I added a one minute padding to the search horizon.
After finding trace ids for the search, we immediately attempted to fetch the (sub-)segments of that trace without accounting for that fact that they are populated async with a delay. I added retries for the second query for the sub-segments as well. See comment Maintenance: Fix TracingE2E test to avoid occasional timeouts #1846 (comment)

Also adds some more debug logs and more useful logging statements to make debugging easier in the future.

Changes

Issue number: #1846

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Disclaimer: We value your time and bandwidth. As such, any pull requests created on non-triaged issues might not be successful.

phipag · 2025-08-06T16:09:22Z

Ran too many tests in parallel. Will do one by one.

phipag · 2025-08-07T07:37:13Z

The tests are now failing due to exceeding the AppConfig limit after running too many in parallel. Will cleanup and retry.

This AWS::AppConfig::DeploymentStrategy resource is in a CREATE_FAILED state.

Resource handler returned message: "You have exceeded your deployment strategy limit of 20. Your current count is 20. Go to AWS Service Quotas to request a limit increase.

…y config to RetryUtils.

sonarqubecloud · 2025-08-07T11:56:35Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

phipag · 2025-08-07T13:29:47Z

I added another retry loop to a flaky MetricsE2ET which seems to pass reliably now. I will run some more E2E tests sequentially now.

The issue was that the metricsFetcher.fetchMetrics call completed as soon as the metric is found but it does not wait long enough for all datapoints to be populated in CloudWatch. In this case, we expect 2 datapoints and sometimes it already returned after finding the first datapoint.

        List<Double> orderMetrics = RetryUtils.withRetry(() -> {
            List<Double> metrics = metricsFetcher.fetchMetrics(invocationResult.getStart(), invocationResult.getEnd(),
                    60, NAMESPACE, "orders", Collections.singletonMap("Environment", "test"));
            if (metrics.get(0) != 2.0) {
                throw new DataNotReadyException("Expected 2.0 orders but got " + metrics.get(0));
            }
            return metrics;
        }, "orderMetricsRetry", DataNotReadyException.class).get();

phipag · 2025-08-07T15:09:05Z

The last 6 consecutive runs of E2E tests succeeded. It looks like we have resolved all flaky tests with appropriate retry logic for now.

dreamorosi

Great work with these tests!

phipag · 2025-08-07T16:13:58Z

Great work with these tests!

Thanks. This is the 3rd time now that I think they are fixed. Let's see if the tests prove me wrong in the next couple of weeks 😁

phipag added 2 commits August 6, 2025 12:21

Expand timewindow for X-RAY trace summary search during E2E tests.

71fd5f0

Add retries for trace subsegments as well.

0d52313

phipag self-assigned this Aug 6, 2025

phipag added maintenance governance labels Aug 6, 2025

phipag added this to Powertools for AWS Lambda (Java) Aug 6, 2025

pull-request-size bot added the size/M label Aug 6, 2025

phipag temporarily deployed to E2E August 6, 2025 15:02 — with GitHub Actions Inactive

phipag had a problem deploying to E2E August 6, 2025 15:02 — with GitHub Actions Failure

phipag temporarily deployed to E2E August 6, 2025 15:02 — with GitHub Actions Inactive

phipag had a problem deploying to E2E August 6, 2025 15:34 — with GitHub Actions Error

Add retries for custom subsegment population.

90218d7

phipag temporarily deployed to E2E August 6, 2025 15:42 — with GitHub Actions Inactive

phipag had a problem deploying to E2E August 6, 2025 15:42 — with GitHub Actions Failure

phipag added 4 commits August 7, 2025 10:15

Add longer retries for TracingE2ET and capability to pass custom retr…

120d062

…y config to RetryUtils.

Add 1 minute padding around MetricsFetcher.

4a534cd

Revert 1 minute time padding for MetricsFetcher.

9393659

Add retry loop around datapoint fetching for first metrics datapoints.

e046ac3

dreamorosi approved these changes Aug 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(ci): Improve reliability of retries in TracingE2ET#2018

chore(ci): Improve reliability of retries in TracingE2ET#2018
phipag merged 8 commits intomainfrom
phipag/fix-tracing-e2e-timeouts

phipag commented Aug 6, 2025 •

edited

Loading

Uh oh!

phipag commented Aug 6, 2025

Uh oh!

phipag commented Aug 7, 2025

Uh oh!

sonarqubecloud bot commented Aug 7, 2025

Uh oh!

phipag commented Aug 7, 2025

Uh oh!

phipag commented Aug 7, 2025

Uh oh!

dreamorosi left a comment

Uh oh!

phipag commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

phipag commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

phipag commented Aug 6, 2025

Uh oh!

phipag commented Aug 7, 2025

Uh oh!

sonarqubecloud bot commented Aug 7, 2025

Quality Gate passed

Uh oh!

phipag commented Aug 7, 2025

Uh oh!

phipag commented Aug 7, 2025

Uh oh!

dreamorosi left a comment

Choose a reason for hiding this comment

Uh oh!

phipag commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

phipag commented Aug 6, 2025 •

edited

Loading