Skip to content

Conversation

@deepssin
Copy link
Contributor

@deepssin deepssin commented Oct 16, 2025

This commit implements a new --skip-known-failures flag that allows teuthology-suite to skip jobs with known failure patterns during reruns, only scheduling jobs with unknown failures.

Changes:

  • Add --skip-known-failures flag to scripts/suite.py with documentation
  • Add --known-failure-patterns option to specify custom patterns file (defaults to suite/patterns/known-failures.json)
  • Implement filter_out_known_failures() function in teuthology/suite/init.py
  • Modify get_rerun_conf_overrides() to use unknown failure filtering
  • Add early return when no unknown failures are found
  • Add teuthology/suite/patterns/known-failures.json with example failure patterns
  • Add known-failures.json to setup.cfg package_data for distribution
  • Fix seed type conversion issue (string to int)
  • Create docs/known_failure_patterns.rst with comprehensive documentation

The feature works by:

  1. Loading known failure patterns from a JSON or YAML file (default: teuthology/suite/patterns/known-failures.json)
  2. Checking each failed job's failure_reason against known patterns
  3. Only scheduling jobs with failure reasons that don't match any known pattern
  4. Early exit if no unknown failures are found

This provides a smart rerun mechanism that avoids rerunning jobs with known infrastructure issues while still catching new failures.

@deepssin deepssin requested a review from a team as a code owner October 16, 2025 11:59
@deepssin deepssin requested review from VallariAg, kamoltat, kshtsk and zmc and removed request for a team October 16, 2025 11:59
scripts/suite.py Outdated
but can be overide if passed again.
This is important for tests involving random facet
(path ends with '$' operator).
--rerun-skip-known <bool> Skip jobs with known failure patterns during rerun.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my test the option name is better without rerun prefix and in command form, i.e. --skip-known-failures, and add similar comment as for -R like to be used with --rerun only. Referring to known patterns json is not clear what is it and where is it. I would probably suggest to add another option with which there can be provided the path to the the know-patterns json or yaml, defaults to some file bundled with teuthology package.

@@ -0,0 +1,7 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should not be located in the root of teuthology, but better to put it in the related directory. Isn't it better to put it to teuthology/suite/patterns/known-failures.json and add it to setup.cfg (or later to pyproject) to package_data section?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the format of the patterns should be documented somewhere in docs.

log.info("Using backward-compatible job filtering for rerun with %d job descriptions", len(original_descriptions))


def filter_unknown_failures(run, statuses, known_patterns_file='known_patterns.json'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're using known (not unknown patterns) I would suggest to use it naming functions, so the signature is more readable:
def filter_out_known_failures(run, status, patterns_file=...)
Also would be great to give description for each function argument.

Comment on lines 227 to 264
with open(known_patterns_file, 'r') as f:
patterns_data = json.load(f)
known_patterns = patterns_data.get('patterns', [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea, to support both, yaml and json.

unknown_jobs.append(job['description'])
log.debug(f"Job {job['description']} has unknown failure: {failure_reason}")

return unknown_jobs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May we have a unittest for this function?


conf.filter_in.extend(rerun_filters['descriptions'])
if getattr(conf, 'rerun_skip_known', False):
unknown_descriptions = filter_unknown_failures(run, conf.rerun_statuses)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes misunderstandings, what are the unknowns descriptions? All the descriptions are known.

log.warning(f"Could not load known patterns from {known_patterns_file}, treating all as unknown")
known_patterns = []

unknown_jobs = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list is not an unknown_jobs, it is a job_descrtions in your code.

Comment on lines 235 to 237
for job in run['jobs']:
if job['status'] in statuses and job.get('description'):
failure_reason = job.get('failure_reason', '')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function overall can be simplified for usability, so you it can accept job list instead of statuses, for example, check it out:

# using generator filter by rerun statuses jobs
rerun_statuses_jobs = (job for job in run['jobs'] if job['status'] in conf.rerun_statuses)
# filter out known failures
rerun_jobs = filter_out_known_failure(rerun_statuses_jobs)
if rerun_jobs:
  conf.filter_in.extend(job['description'] for job in rerun_jobs)

Note, btw, does your code respect jobs which are actually passed?
Anyways, this function should have unit tests.

Copy link
Contributor

@kshtsk kshtsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments above.

@deepssin
Copy link
Contributor Author

Have addressed comments -- made changes , PTAL @kshtsk

scripts/suite.py Outdated
file [default: false].
--known-patterns-file <path> Path to known failure patterns file (JSON or YAML).
Defaults to bundled known-failures.json in teuthology
package. To be used with --skip-known-failures.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to provide defaults here to give a user clue where to see what will be used?

scripts/suite.py Outdated
To be used with --rerun only. Only rerun jobs with
unknown failures by checking against known patterns
file [default: false].
--known-patterns-file <path> Path to known failure patterns file (JSON or YAML).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, maybe --known-failure-patterns is better, because it is easy to guess which options it belongs to.

Comment on lines 261 to 297

Known Failure Patterns File Format
==================================

When using the ``--skip-known-failures`` option with ``teuthology-suite --rerun``,
teuthology can filter out jobs with known failure patterns. The known patterns are
loaded from a JSON or YAML file specified with ``--known-patterns-file``, or from
the default bundled file ``teuthology/suite/patterns/known-failures.json``.

The file format supports both JSON and YAML. The file must contain a ``patterns``
key with a list of regex patterns (strings) that will be matched against job
failure reasons.

JSON format example::

{
"patterns": [
"Command failed on .* with status 1:",
"clocks not synchronized",
"cluster \\[WRN\\] Health check failed:.*OBJECT_UNFOUND"
]
}

YAML format example::

patterns:
- "Command failed on .* with status 1:"
- "clocks not synchronized"
- "cluster \\[WRN\\] Health check failed:.*OBJECT_UNFOUND"

Patterns are matched using Python's ``re.search()`` function, so they support
full regular expression syntax. If a job's ``failure_reason`` matches any of
the patterns, it is considered a known failure and will be skipped during
rerun when ``--skip-known-failures`` is enabled.

Only jobs with failure reasons that don't match any known pattern will be
scheduled for rerun.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this, but siteconfig is only for description of teuthology.yaml config file format hints. We need to find better place or dedicate a separate file.

Comment on lines 238 to 249
"""Load known failure patterns from a JSON or YAML file.
Args:
known_patterns_file: Path to the patterns file. If None, uses the default
bundled file.
Returns:
List of regex patterns (strings) to match against failure reasons.
The file format should be:
JSON: {"patterns": ["pattern1", "pattern2", ...]}
YAML: patterns: ["pattern1", "pattern2", ...]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding docstring for the function, but could you please stick to the conventions being used with teuthology, like use :param param_name: param description and :returns: descriptions what returns etc. For example, refer other functions in teuthology code base.

return

conf.filter_in.extend(rerun_filters['descriptions'])
if getattr(conf, 'skip_known_failures', False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it used getattr here instead simply conf.skip_known_failures?

return os.path.join(os.path.dirname(__file__), 'patterns', 'known-failures.json')


def _load_known_patterns(known_patterns_file):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May we have short parameters' naming. I guess just path would be enough. It will also be great to using typing in signatures (c.f. https://docs.python.org/3/library/typing.html)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add type hints but that would be inconsistent to the file , IMO we should have a separate PR for adding annotations.

Comment on lines 251 to 252
if known_patterns_file is None:
known_patterns_file = _get_default_known_patterns_file()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't it be better to set the default value in docopt of the teuthology-suite script.

try:
patterns_data = json.load(f)
except json.JSONDecodeError:
# if JSON fails, try YAML
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is obvious, no need to explain.


try:
with open(known_patterns_file, 'r') as f:
# try JSON first
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is obvious, no need to explain.

Comment on lines 265 to 268
if not isinstance(known_patterns, list):
log.warning(f"Patterns in {known_patterns_file} should be a list, treating as empty")
return []
return known_patterns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is always the best to use straight logic:

if isinstance(known_patterns, list):
  return known_patterns
else:
  log.warning(f"Patterns...")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used early-return here to keep the happy path flat and avoid extra nesting. It’s functionally the same, just a guard-clause style for readability. Happy to switch to an explicit if/else if you prefer

if not job.get('description'):
continue

failure_reason = job.get('failure_reason', '')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we try to search in empty string below? Wouldn't it better to return None here and do re.search in the text?

Comment on lines 278 to 284
"""Filter out jobs with known failure patterns.
Args:
jobs: Iterable of job dictionaries with 'failure_reason' and 'description' keys.
Can be a list, generator, or any iterable.
known_patterns_file: Optional path to known patterns file. If None, uses
the default bundled file.
Returns:
List of job dictionaries for jobs with unknown failures (i.e., failures
that don't match any known pattern). Only includes jobs that were passed
in the input and have descriptions.
The known patterns file format is documented in docs/siteconfig.rst.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as for the function above, would be great to follow teuthology practice of docstring formatting.

@kshtsk
Copy link
Contributor

kshtsk commented Nov 12, 2025

Have addressed comments -- made changes , PTAL @kshtsk

Added new comments, and... the PR title and description need to be updated as well.

@deepssin deepssin changed the title Add --rerun-skip-known flag for smart rerun filtering Add --skip-known-failures flag for smart rerun filtering Nov 19, 2025
This commit implements a new --skip-known-failures flag that allows
teuthology-suite to skip jobs with known failure patterns during
reruns, only scheduling jobs with unknown failures.

Changes:
- Add --skip-known-failures flag to scripts/suite.py with documentation
- Add --known-failure-patterns option for custom patterns file (defaults to suite/patterns/known-failures.json)
- Implement filter_out_known_failures() function in teuthology/suite/__init__.py
- Support both JSON and YAML pattern file formats
- Modify get_rerun_conf_overrides() to use unknown failure filtering
- Add early return when no unknown failures are found
- Add teuthology/suite/patterns/known-failures.json with example failure patterns
- Add known-failures.json to setup.cfg package_data section
- Create docs/known_failure_patterns.rst with comprehensive documentation
- Fix seed type conversion issue (string to int)

The feature works by:
1. Loading known failure patterns from teuthology/suite/patterns/known-failures.json (or custom file)
2. Checking each failed job's failure_reason against known patterns
3. Only scheduling jobs with failure reasons that don't match any known pattern
4. Early exit if no unknown failures are found

This provides a smart rerun mechanism that avoids rerunning jobs
with known infrastructure issues while still catching new failures.

Signed-off-by: deepssin <[email protected]>
@deepssin
Copy link
Contributor Author

Have addressed comments -- made changes , PTAL @kshtsk

Added new comments, and... the PR title and description need to be updated as well.

Updated with changes.

config.get_ceph_qa_suite_git_url()),
default_ceph_branch=defaults('--ceph-branch', 'main'),
default_job_threshold=config.job_threshold,
default_known_failure_patterns=_get_default_known_failure_patterns_file(),
Copy link
Contributor

@kshtsk kshtsk Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, but look, do you really want to always provide custom pattern file as an argument, or maybe it would be great to fix it in teuthology config file or possible via environment variable. If so, then we do not need to import _get_default_known_failure_patterns_file directly, and get using defaults or get it from the config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants