Add --skip-known-failures flag for smart rerun filtering #2094

deepssin · 2025-10-16T11:59:38Z

This commit implements a new --skip-known-failures flag that allows teuthology-suite to skip jobs with known failure patterns during reruns, only scheduling jobs with unknown failures.

Changes:

Add --skip-known-failures flag to scripts/suite.py with documentation
Add --known-failure-patterns option to specify custom patterns file (defaults to suite/patterns/known-failures.json)
Implement filter_out_known_failures() function in teuthology/suite/init.py
Modify get_rerun_conf_overrides() to use unknown failure filtering
Add early return when no unknown failures are found
Add teuthology/suite/patterns/known-failures.json with example failure patterns
Add known-failures.json to setup.cfg package_data for distribution
Fix seed type conversion issue (string to int)
Create docs/known_failure_patterns.rst with comprehensive documentation

The feature works by:

Loading known failure patterns from a JSON or YAML file (default: teuthology/suite/patterns/known-failures.json)
Checking each failed job's failure_reason against known patterns
Only scheduling jobs with failure reasons that don't match any known pattern
Early exit if no unknown failures are found

This provides a smart rerun mechanism that avoids rerunning jobs with known infrastructure issues while still catching new failures.

kshtsk · 2025-10-29T21:37:26Z

scripts/suite.py

                              but can be overide if passed again.
                              This is important for tests involving random facet
                              (path ends with '$' operator).
+ --rerun-skip-known <bool>    Skip jobs with known failure patterns during rerun.


In my test the option name is better without rerun prefix and in command form, i.e. --skip-known-failures, and add similar comment as for -R like to be used with --rerun only. Referring to known patterns json is not clear what is it and where is it. I would probably suggest to add another option with which there can be provided the path to the the know-patterns json or yaml, defaults to some file bundled with teuthology package.

kshtsk · 2025-10-29T21:41:21Z

teuthology/suite/patterns/known-failures.json

@@ -0,0 +1,7 @@
+{


This file should not be located in the root of teuthology, but better to put it in the related directory. Isn't it better to put it to teuthology/suite/patterns/known-failures.json and add it to setup.cfg (or later to pyproject) to package_data section?

Also, the format of the patterns should be documented somewhere in docs.

teuthology/suite/__init__.py

kshtsk · 2025-10-29T21:50:20Z

teuthology/suite/__init__.py

+        log.info("Using backward-compatible job filtering for rerun with %d job descriptions", len(original_descriptions))
+
+
+def filter_unknown_failures(run, statuses, known_patterns_file='known_patterns.json'):


Since you're using known (not unknown patterns) I would suggest to use it naming functions, so the signature is more readable:
def filter_out_known_failures(run, status, patterns_file=...)
Also would be great to give description for each function argument.

kshtsk · 2025-10-29T21:51:14Z

teuthology/suite/__init__.py

+        with open(known_patterns_file, 'r') as f:
+            patterns_data = json.load(f)
+        known_patterns = patterns_data.get('patterns', [])


Just an idea, to support both, yaml and json.

teuthology/suite/__init__.py

kshtsk · 2025-10-29T21:52:44Z

teuthology/suite/__init__.py

+                unknown_jobs.append(job['description'])
+                log.debug(f"Job {job['description']} has unknown failure: {failure_reason}")
+
+    return unknown_jobs


May we have a unittest for this function?

kshtsk · 2025-10-29T21:55:24Z

teuthology/suite/__init__.py


-    conf.filter_in.extend(rerun_filters['descriptions'])
+    if getattr(conf, 'rerun_skip_known', False):
+        unknown_descriptions = filter_unknown_failures(run, conf.rerun_statuses)


It makes misunderstandings, what are the unknowns descriptions? All the descriptions are known.

kshtsk · 2025-10-29T22:01:56Z

teuthology/suite/__init__.py

+        log.warning(f"Could not load known patterns from {known_patterns_file}, treating all as unknown")
+        known_patterns = []
+
+    unknown_jobs = []


This list is not an unknown_jobs, it is a job_descrtions in your code.

kshtsk · 2025-10-29T22:22:11Z

teuthology/suite/__init__.py

+    for job in run['jobs']:
+        if job['status'] in statuses and job.get('description'):
+            failure_reason = job.get('failure_reason', '')


I think this function overall can be simplified for usability, so you it can accept job list instead of statuses, for example, check it out:

# using generator filter by rerun statuses jobs rerun_statuses_jobs = (job for job in run['jobs'] if job['status'] in conf.rerun_statuses) # filter out known failures rerun_jobs = filter_out_known_failure(rerun_statuses_jobs) if rerun_jobs: conf.filter_in.extend(job['description'] for job in rerun_jobs)

Note, btw, does your code respect jobs which are actually passed?
Anyways, this function should have unit tests.

kshtsk

See comments above.

deepssin · 2025-11-10T14:13:16Z

Have addressed comments -- made changes , PTAL @kshtsk

kshtsk · 2025-11-12T20:29:51Z

scripts/suite.py

+                              file [default: false].
+ --known-patterns-file <path> Path to known failure patterns file (JSON or YAML).
+                              Defaults to bundled known-failures.json in teuthology
+                              package. To be used with --skip-known-failures.


Don't you want to provide defaults here to give a user clue where to see what will be used?

kshtsk · 2025-11-12T20:31:31Z

scripts/suite.py

+                              To be used with --rerun only. Only rerun jobs with
+                              unknown failures by checking against known patterns
+                              file [default: false].
+ --known-patterns-file <path> Path to known failure patterns file (JSON or YAML).


Hm, maybe --known-failure-patterns is better, because it is easy to guess which options it belongs to.

kshtsk · 2025-11-12T20:40:16Z

docs/siteconfig.rst

+
+Known Failure Patterns File Format
+==================================
+
+When using the ``--skip-known-failures`` option with ``teuthology-suite --rerun``,
+teuthology can filter out jobs with known failure patterns. The known patterns are
+loaded from a JSON or YAML file specified with ``--known-patterns-file``, or from
+the default bundled file ``teuthology/suite/patterns/known-failures.json``.
+
+The file format supports both JSON and YAML. The file must contain a ``patterns``
+key with a list of regex patterns (strings) that will be matched against job
+failure reasons.
+
+JSON format example::
+
+    {
+        "patterns": [
+            "Command failed on .* with status 1:",
+            "clocks not synchronized",
+            "cluster \\[WRN\\] Health check failed:.*OBJECT_UNFOUND"
+        ]
+    }
+
+YAML format example::
+
+    patterns:
+      - "Command failed on .* with status 1:"
+      - "clocks not synchronized"
+      - "cluster \\[WRN\\] Health check failed:.*OBJECT_UNFOUND"
+
+Patterns are matched using Python's ``re.search()`` function, so they support
+full regular expression syntax. If a job's ``failure_reason`` matches any of
+the patterns, it is considered a known failure and will be skipped during
+rerun when ``--skip-known-failures`` is enabled.
+
+Only jobs with failure reasons that don't match any known pattern will be
+scheduled for rerun.


Thanks for adding this, but siteconfig is only for description of teuthology.yaml config file format hints. We need to find better place or dedicate a separate file.

kshtsk · 2025-11-12T20:46:30Z

teuthology/suite/__init__.py

+    """Load known failure patterns from a JSON or YAML file.
+    
+    Args:
+        known_patterns_file: Path to the patterns file. If None, uses the default
+                           bundled file.
+    
+    Returns:
+        List of regex patterns (strings) to match against failure reasons.
+    
+    The file format should be:
+        JSON: {"patterns": ["pattern1", "pattern2", ...]}
+        YAML: patterns: ["pattern1", "pattern2", ...]


Thanks for adding docstring for the function, but could you please stick to the conventions being used with teuthology, like use :param param_name: param description and :returns: descriptions what returns etc. For example, refer other functions in teuthology code base.

kshtsk · 2025-11-12T20:49:28Z

teuthology/suite/__init__.py

        return

-    conf.filter_in.extend(rerun_filters['descriptions'])
+    if getattr(conf, 'skip_known_failures', False):


Why is it used getattr here instead simply conf.skip_known_failures?

kshtsk · 2025-11-12T20:49:34Z

teuthology/suite/__init__.py

+    return os.path.join(os.path.dirname(__file__), 'patterns', 'known-failures.json')
+
+
+def _load_known_patterns(known_patterns_file):


May we have short parameters' naming. I guess just path would be enough. It will also be great to using typing in signatures (c.f. https://docs.python.org/3/library/typing.html)

I can add type hints but that would be inconsistent to the file , IMO we should have a separate PR for adding annotations.

kshtsk · 2025-11-12T20:51:53Z

teuthology/suite/__init__.py

+    if known_patterns_file is None:
+        known_patterns_file = _get_default_known_patterns_file()


Won't it be better to set the default value in docopt of the teuthology-suite script.

kshtsk · 2025-11-12T20:52:27Z

teuthology/suite/__init__.py

+            try:
+                patterns_data = json.load(f)
+            except json.JSONDecodeError:
+                # if JSON fails, try YAML


It is obvious, no need to explain.

kshtsk · 2025-11-12T20:52:50Z

teuthology/suite/__init__.py

+
+    try:
+        with open(known_patterns_file, 'r') as f:
+            # try JSON first


It is obvious, no need to explain.

kshtsk · 2025-11-12T20:56:43Z

teuthology/suite/__init__.py

+        if not isinstance(known_patterns, list):
+            log.warning(f"Patterns in {known_patterns_file} should be a list, treating as empty")
+            return []
+        return known_patterns


It is always the best to use straight logic:

if isinstance(known_patterns, list): return known_patterns else: log.warning(f"Patterns...")

I used early-return here to keep the happy path flat and avoid extra nesting. It’s functionally the same, just a guard-clause style for readability. Happy to switch to an explicit if/else if you prefer

kshtsk · 2025-11-12T21:00:45Z

teuthology/suite/__init__.py

+        if not job.get('description'):
+            continue
+
+        failure_reason = job.get('failure_reason', '')


Why should we try to search in empty string below? Wouldn't it better to return None here and do re.search in the text?

kshtsk · 2025-11-12T21:04:22Z

teuthology/suite/__init__.py

+    """Filter out jobs with known failure patterns.
+    
+    Args:
+        jobs: Iterable of job dictionaries with 'failure_reason' and 'description' keys.
+              Can be a list, generator, or any iterable.
+        known_patterns_file: Optional path to known patterns file. If None, uses
+                           the default bundled file.
+    
+    Returns:
+        List of job dictionaries for jobs with unknown failures (i.e., failures
+        that don't match any known pattern). Only includes jobs that were passed
+        in the input and have descriptions.
+    
+    The known patterns file format is documented in docs/siteconfig.rst.
+    """


Same as for the function above, would be great to follow teuthology practice of docstring formatting.

kshtsk · 2025-11-12T21:10:28Z

Have addressed comments -- made changes , PTAL @kshtsk

Added new comments, and... the PR title and description need to be updated as well.

This commit implements a new --skip-known-failures flag that allows teuthology-suite to skip jobs with known failure patterns during reruns, only scheduling jobs with unknown failures. Changes: - Add --skip-known-failures flag to scripts/suite.py with documentation - Add --known-failure-patterns option for custom patterns file (defaults to suite/patterns/known-failures.json) - Implement filter_out_known_failures() function in teuthology/suite/__init__.py - Support both JSON and YAML pattern file formats - Modify get_rerun_conf_overrides() to use unknown failure filtering - Add early return when no unknown failures are found - Add teuthology/suite/patterns/known-failures.json with example failure patterns - Add known-failures.json to setup.cfg package_data section - Create docs/known_failure_patterns.rst with comprehensive documentation - Fix seed type conversion issue (string to int) The feature works by: 1. Loading known failure patterns from teuthology/suite/patterns/known-failures.json (or custom file) 2. Checking each failed job's failure_reason against known patterns 3. Only scheduling jobs with failure reasons that don't match any known pattern 4. Early exit if no unknown failures are found This provides a smart rerun mechanism that avoids rerunning jobs with known infrastructure issues while still catching new failures. Signed-off-by: deepssin <[email protected]>

deepssin · 2025-11-20T09:46:43Z

Have addressed comments -- made changes , PTAL @kshtsk

Added new comments, and... the PR title and description need to be updated as well.

Updated with changes.

kshtsk · 2025-11-22T00:54:50Z

scripts/suite.py

                            config.get_ceph_qa_suite_git_url()),
    default_ceph_branch=defaults('--ceph-branch', 'main'),
    default_job_threshold=config.job_threshold,
+    default_known_failure_patterns=_get_default_known_failure_patterns_file(),


This looks great, but look, do you really want to always provide custom pattern file as an argument, or maybe it would be great to fix it in teuthology config file or possible via environment variable. If so, then we do not need to import _get_default_known_failure_patterns_file directly, and get using defaults or get it from the config.

deepssin requested a review from a team as a code owner October 16, 2025 11:59

deepssin requested review from VallariAg, kamoltat, kshtsk and zmc and removed request for a team October 16, 2025 11:59

kshtsk reviewed Oct 29, 2025

View reviewed changes

teuthology/suite/__init__.py Show resolved Hide resolved

kshtsk reviewed Oct 29, 2025

View reviewed changes

teuthology/suite/__init__.py Show resolved Hide resolved

kshtsk reviewed Oct 29, 2025

View reviewed changes

kshtsk requested changes Oct 29, 2025

View reviewed changes

deepssin force-pushed the rerun-unknown branch from 8604a4c to 365f741 Compare November 10, 2025 14:12

kshtsk reviewed Nov 12, 2025

View reviewed changes

deepssin changed the title ~~Add --rerun-skip-known flag for smart rerun filtering~~ Add --skip-known-failures flag for smart rerun filtering Nov 19, 2025

deepssin force-pushed the rerun-unknown branch from 365f741 to 85bae47 Compare November 19, 2025 12:49

kshtsk reviewed Nov 22, 2025

View reviewed changes

kshtsk approved these changes Nov 22, 2025

View reviewed changes

		log.info("Using backward-compatible job filtering for rerun with %d job descriptions", len(original_descriptions))


		def filter_unknown_failures(run, statuses, known_patterns_file='known_patterns.json'):

		return os.path.join(os.path.dirname(__file__), 'patterns', 'known-failures.json')


		def _load_known_patterns(known_patterns_file):

		if known_patterns_file is None:
		known_patterns_file = _get_default_known_patterns_file()

Add --skip-known-failures flag for smart rerun filtering #2094

Are you sure you want to change the base?

Add --skip-known-failures flag for smart rerun filtering #2094

Uh oh!

Conversation

deepssin commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kshtsk left a comment

Choose a reason for hiding this comment

Uh oh!

deepssin commented Nov 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kshtsk commented Nov 12, 2025

Uh oh!

deepssin commented Nov 20, 2025

Uh oh!

kshtsk Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

deepssin commented Oct 16, 2025 •

edited

Loading

kshtsk Nov 22, 2025 •

edited

Loading