-
Notifications
You must be signed in to change notification settings - Fork 14.5k
KAFKA-17433 Add a deflake Github action #17019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
An example of the new workflow https://github.com/mumrah/kafka/actions/runs/10582749491/job/29325023681 |
@mumrah not sure whether we should encourage developers loop flaky on Github CI. The quota is limited and so it could impact the other flow (normal PR and CI). Also, |
@chia7712 thanks for the feedback.
Since this workflow is run manually, I think the impact would be limited. Also, as long as the caller isn't running a whole module's tests, it should only run of a few minutes. I've set a timeout of 1hr to the job to prevent using up too much run time.
I didn't realize that :) I think this method of repeating a test is not actually very useful since it's just running the Gradle command over and over. Often times, flaky tests only appear when the system is under load. Invoking Gradle in a loop gives too much time for the system to "settle" in between runs. This is why I normally use (and recommend) the IntelliJ "Run Until Failed" option while running tests in IntelliJ (not Gradle). It runs in a tight loop and puts some load on the system. Even still, I've run into plenty of cases where a test is only failing in CI. Not having the ability to run a single test in CI makes it really hard to debug such cases. It essentially means each trial of your bugfix requires waiting for a full CI run. A workaround to this I've used in the past is to alter the Jenkinsfile to just run the tests I want. I think a dedicated job for running a single test is a better option for developers. Another benefit of having this new workflow is it gives us easily shareable evidence when submitting test fixes. Instead of a reviewer looking at a single CI run for pass/fail, the author can give a link to a 10x deflake run which gives stronger evidence of the fix. Let me know what you think |
that is true
I love this :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mumrah thanks for this useful action!
required: true | ||
type: string | ||
test-repeat: | ||
description: 'Number of times to invoke the test' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remind users that it works only for ClusterTest
/ClusterTemplate
/ClusterTests
?
} | ||
List<TestTemplateInvocationContext> repeatedContexts = new ArrayList<>(contexts.size() * count); | ||
for (int i = 0; i < count; i++) { | ||
repeatedContexts.addAll(contexts); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't create copy of contexts, so those contexts must a kind of immutable objects to avoid corruption caused by other run. Hence, could you add comments to ZkClusterInvocationContext
and RaftClusterInvocationContext
to highlight the requisite. Also, could you please move the clusterReference
of ZkClusterInvocationContext
to be a local variable to make ZkClusterInvocationContext
be immutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. I'll modify the annotation processing to create unique instances of the invocation contexts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mumrah thanks for this patch
@chia7712 thanks for the reviews! I've incorporated your feedback and tested it locally. Seems to be working 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mumrah thanks for this patch!
I wrote a short guide on flaky tests on the Kafka wiki https://cwiki.apache.org/confluence/display/KAFKA/Flaky+Tests |
This patch adds a "deflake" github action which can be used to run a single JUnit test or suites. It works by parameterizing the --tests Gradle option. If the test extends ClusterTest, the "deflake" workflow can repeat number of times by setting the kafka.cluster.test.repeat system property. Reviewers: Chia-Ping Tsai <[email protected]>
This patch adds a "deflake" github action which can be used to run a single JUnit test or suites. It works by parameterizing the
--tests
Gradle option. If the test extends ClusterTest, the "deflake" workflow can repeat number of times by setting thekafka.cluster.test.repeat
system property.This can be done locally as well:
For local testing, IDEA also has options for repeating a test until failure.