Skip to content

KAFKA-17433 Add a deflake Github action #17019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Aug 31, 2024

Conversation

mumrah
Copy link
Member

@mumrah mumrah commented Aug 27, 2024

This patch adds a "deflake" github action which can be used to run a single JUnit test or suites. It works by parameterizing the --tests Gradle option. If the test extends ClusterTest, the "deflake" workflow can repeat number of times by setting the kafka.cluster.test.repeat system property.

This can be done locally as well:

./gradlew -Pkafka.cluster.test.repeat=3 :core:test --tests "*ZkMigrationIntegrationTest*"

For local testing, IDEA also has options for repeating a test until failure.

@mumrah
Copy link
Member Author

mumrah commented Aug 27, 2024

An example of the new workflow https://github.com/mumrah/kafka/actions/runs/10582749491/job/29325023681

Screenshot of the action dispatch dialog
image

@mumrah mumrah requested a review from chia7712 August 27, 2024 21:42
@chia7712
Copy link
Member

@mumrah not sure whether we should encourage developers loop flaky on Github CI. The quota is limited and so it could impact the other flow (normal PR and CI). Also, README offers a simple command to loop tests (https://github.com/apache/kafka?tab=readme-ov-file#repeatedly-running-a-particular-unitintegration-test) on local.

@mumrah
Copy link
Member Author

mumrah commented Aug 28, 2024

@chia7712 thanks for the feedback.

The quota is limited and so it could impact the other flow (normal PR and CI)

Since this workflow is run manually, I think the impact would be limited. Also, as long as the caller isn't running a whole module's tests, it should only run of a few minutes. I've set a timeout of 1hr to the job to prevent using up too much run time.

Also, README offers a simple command

I didn't realize that :) I think this method of repeating a test is not actually very useful since it's just running the Gradle command over and over. Often times, flaky tests only appear when the system is under load. Invoking Gradle in a loop gives too much time for the system to "settle" in between runs.

This is why I normally use (and recommend) the IntelliJ "Run Until Failed" option while running tests in IntelliJ (not Gradle). It runs in a tight loop and puts some load on the system.

Even still, I've run into plenty of cases where a test is only failing in CI. Not having the ability to run a single test in CI makes it really hard to debug such cases. It essentially means each trial of your bugfix requires waiting for a full CI run. A workaround to this I've used in the past is to alter the Jenkinsfile to just run the tests I want.

I think a dedicated job for running a single test is a better option for developers.

Another benefit of having this new workflow is it gives us easily shareable evidence when submitting test fixes. Instead of a reviewer looking at a single CI run for pass/fail, the author can give a link to a 10x deflake run which gives stronger evidence of the fix.

Let me know what you think

@chia7712
Copy link
Member

Even still, I've run into plenty of cases where a test is only failing in CI. Not having the ability to run a single test in CI makes it really hard to debug such cases. It essentially means each trial of your bugfix requires waiting for a full CI run. A workaround to this I've used in the past is to alter the Jenkinsfile to just run the tests I want.

that is true

Another benefit of having this new workflow is it gives us easily shareable evidence when submitting test fixes. Instead of a reviewer looking at a single CI run for pass/fail, the author can give a link to a 10x deflake run which gives stronger evidence of the fix.

I love this :)

Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumrah thanks for this useful action!

required: true
type: string
test-repeat:
description: 'Number of times to invoke the test'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remind users that it works only for ClusterTest/ClusterTemplate/ClusterTests?

}
List<TestTemplateInvocationContext> repeatedContexts = new ArrayList<>(contexts.size() * count);
for (int i = 0; i < count; i++) {
repeatedContexts.addAll(contexts);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't create copy of contexts, so those contexts must a kind of immutable objects to avoid corruption caused by other run. Hence, could you add comments to ZkClusterInvocationContext and RaftClusterInvocationContext to highlight the requisite. Also, could you please move the clusterReference of ZkClusterInvocationContext to be a local variable to make ZkClusterInvocationContext be immutable.

https://github.com/apache/kafka/blob/trunk/core/src/test/java/kafka/test/junit/ZkClusterInvocationContext.java#L78

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I'll modify the annotation processing to create unique instances of the invocation contexts.

@mumrah mumrah added the build Gradle build or GitHub Actions label Aug 28, 2024
Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumrah thanks for this patch

@mumrah mumrah requested a review from chia7712 August 29, 2024 13:47
@mumrah
Copy link
Member Author

mumrah commented Aug 29, 2024

@chia7712 thanks for the reviews! I've incorporated your feedback and tested it locally. Seems to be working 👍

Copy link
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumrah thanks for this patch!

@mumrah mumrah requested a review from chia7712 August 30, 2024 18:38
@chia7712 chia7712 merged commit 7005a1f into apache:trunk Aug 31, 2024
4 of 7 checks passed
@mumrah
Copy link
Member Author

mumrah commented Sep 2, 2024

I wrote a short guide on flaky tests on the Kafka wiki https://cwiki.apache.org/confluence/display/KAFKA/Flaky+Tests

bboyleonp666 pushed a commit to bboyleonp666/kafka that referenced this pull request Sep 4, 2024
This patch adds a "deflake" github action which can be used to run a single JUnit test or suites. It works by parameterizing the --tests Gradle option. If the test extends ClusterTest, the "deflake" workflow can repeat number of times by setting the kafka.cluster.test.repeat system property.

Reviewers: Chia-Ping Tsai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Gradle build or GitHub Actions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants