-
Notifications
You must be signed in to change notification settings - Fork 306
Add --skip-known-failures flag for smart rerun filtering #2094
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
scripts/suite.py
Outdated
| but can be overide if passed again. | ||
| This is important for tests involving random facet | ||
| (path ends with '$' operator). | ||
| --rerun-skip-known <bool> Skip jobs with known failure patterns during rerun. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my test the option name is better without rerun prefix and in command form, i.e. --skip-known-failures, and add similar comment as for -R like to be used with --rerun only. Referring to known patterns json is not clear what is it and where is it. I would probably suggest to add another option with which there can be provided the path to the the know-patterns json or yaml, defaults to some file bundled with teuthology package.
| @@ -0,0 +1,7 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should not be located in the root of teuthology, but better to put it in the related directory. Isn't it better to put it to teuthology/suite/patterns/known-failures.json and add it to setup.cfg (or later to pyproject) to package_data section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the format of the patterns should be documented somewhere in docs.
teuthology/suite/__init__.py
Outdated
| log.info("Using backward-compatible job filtering for rerun with %d job descriptions", len(original_descriptions)) | ||
|
|
||
|
|
||
| def filter_unknown_failures(run, statuses, known_patterns_file='known_patterns.json'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're using known (not unknown patterns) I would suggest to use it naming functions, so the signature is more readable:
def filter_out_known_failures(run, status, patterns_file=...)
Also would be great to give description for each function argument.
teuthology/suite/__init__.py
Outdated
| with open(known_patterns_file, 'r') as f: | ||
| patterns_data = json.load(f) | ||
| known_patterns = patterns_data.get('patterns', []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an idea, to support both, yaml and json.
teuthology/suite/__init__.py
Outdated
| unknown_jobs.append(job['description']) | ||
| log.debug(f"Job {job['description']} has unknown failure: {failure_reason}") | ||
|
|
||
| return unknown_jobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May we have a unittest for this function?
teuthology/suite/__init__.py
Outdated
|
|
||
| conf.filter_in.extend(rerun_filters['descriptions']) | ||
| if getattr(conf, 'rerun_skip_known', False): | ||
| unknown_descriptions = filter_unknown_failures(run, conf.rerun_statuses) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes misunderstandings, what are the unknowns descriptions? All the descriptions are known.
teuthology/suite/__init__.py
Outdated
| log.warning(f"Could not load known patterns from {known_patterns_file}, treating all as unknown") | ||
| known_patterns = [] | ||
|
|
||
| unknown_jobs = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list is not an unknown_jobs, it is a job_descrtions in your code.
teuthology/suite/__init__.py
Outdated
| for job in run['jobs']: | ||
| if job['status'] in statuses and job.get('description'): | ||
| failure_reason = job.get('failure_reason', '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function overall can be simplified for usability, so you it can accept job list instead of statuses, for example, check it out:
# using generator filter by rerun statuses jobs
rerun_statuses_jobs = (job for job in run['jobs'] if job['status'] in conf.rerun_statuses)
# filter out known failures
rerun_jobs = filter_out_known_failure(rerun_statuses_jobs)
if rerun_jobs:
conf.filter_in.extend(job['description'] for job in rerun_jobs)
Note, btw, does your code respect jobs which are actually passed?
Anyways, this function should have unit tests.
kshtsk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments above.
8604a4c to
365f741
Compare
|
Have addressed comments -- made changes , PTAL @kshtsk |
scripts/suite.py
Outdated
| file [default: false]. | ||
| --known-patterns-file <path> Path to known failure patterns file (JSON or YAML). | ||
| Defaults to bundled known-failures.json in teuthology | ||
| package. To be used with --skip-known-failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want to provide defaults here to give a user clue where to see what will be used?
scripts/suite.py
Outdated
| To be used with --rerun only. Only rerun jobs with | ||
| unknown failures by checking against known patterns | ||
| file [default: false]. | ||
| --known-patterns-file <path> Path to known failure patterns file (JSON or YAML). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, maybe --known-failure-patterns is better, because it is easy to guess which options it belongs to.
docs/siteconfig.rst
Outdated
|
|
||
| Known Failure Patterns File Format | ||
| ================================== | ||
|
|
||
| When using the ``--skip-known-failures`` option with ``teuthology-suite --rerun``, | ||
| teuthology can filter out jobs with known failure patterns. The known patterns are | ||
| loaded from a JSON or YAML file specified with ``--known-patterns-file``, or from | ||
| the default bundled file ``teuthology/suite/patterns/known-failures.json``. | ||
|
|
||
| The file format supports both JSON and YAML. The file must contain a ``patterns`` | ||
| key with a list of regex patterns (strings) that will be matched against job | ||
| failure reasons. | ||
|
|
||
| JSON format example:: | ||
|
|
||
| { | ||
| "patterns": [ | ||
| "Command failed on .* with status 1:", | ||
| "clocks not synchronized", | ||
| "cluster \\[WRN\\] Health check failed:.*OBJECT_UNFOUND" | ||
| ] | ||
| } | ||
|
|
||
| YAML format example:: | ||
|
|
||
| patterns: | ||
| - "Command failed on .* with status 1:" | ||
| - "clocks not synchronized" | ||
| - "cluster \\[WRN\\] Health check failed:.*OBJECT_UNFOUND" | ||
|
|
||
| Patterns are matched using Python's ``re.search()`` function, so they support | ||
| full regular expression syntax. If a job's ``failure_reason`` matches any of | ||
| the patterns, it is considered a known failure and will be skipped during | ||
| rerun when ``--skip-known-failures`` is enabled. | ||
|
|
||
| Only jobs with failure reasons that don't match any known pattern will be | ||
| scheduled for rerun. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this, but siteconfig is only for description of teuthology.yaml config file format hints. We need to find better place or dedicate a separate file.
teuthology/suite/__init__.py
Outdated
| """Load known failure patterns from a JSON or YAML file. | ||
| Args: | ||
| known_patterns_file: Path to the patterns file. If None, uses the default | ||
| bundled file. | ||
| Returns: | ||
| List of regex patterns (strings) to match against failure reasons. | ||
| The file format should be: | ||
| JSON: {"patterns": ["pattern1", "pattern2", ...]} | ||
| YAML: patterns: ["pattern1", "pattern2", ...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding docstring for the function, but could you please stick to the conventions being used with teuthology, like use :param param_name: param description and :returns: descriptions what returns etc. For example, refer other functions in teuthology code base.
teuthology/suite/__init__.py
Outdated
| return | ||
|
|
||
| conf.filter_in.extend(rerun_filters['descriptions']) | ||
| if getattr(conf, 'skip_known_failures', False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it used getattr here instead simply conf.skip_known_failures?
teuthology/suite/__init__.py
Outdated
| return os.path.join(os.path.dirname(__file__), 'patterns', 'known-failures.json') | ||
|
|
||
|
|
||
| def _load_known_patterns(known_patterns_file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May we have short parameters' naming. I guess just path would be enough. It will also be great to using typing in signatures (c.f. https://docs.python.org/3/library/typing.html)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add type hints but that would be inconsistent to the file , IMO we should have a separate PR for adding annotations.
teuthology/suite/__init__.py
Outdated
| if known_patterns_file is None: | ||
| known_patterns_file = _get_default_known_patterns_file() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't it be better to set the default value in docopt of the teuthology-suite script.
teuthology/suite/__init__.py
Outdated
| try: | ||
| patterns_data = json.load(f) | ||
| except json.JSONDecodeError: | ||
| # if JSON fails, try YAML |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is obvious, no need to explain.
teuthology/suite/__init__.py
Outdated
|
|
||
| try: | ||
| with open(known_patterns_file, 'r') as f: | ||
| # try JSON first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is obvious, no need to explain.
teuthology/suite/__init__.py
Outdated
| if not isinstance(known_patterns, list): | ||
| log.warning(f"Patterns in {known_patterns_file} should be a list, treating as empty") | ||
| return [] | ||
| return known_patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is always the best to use straight logic:
if isinstance(known_patterns, list):
return known_patterns
else:
log.warning(f"Patterns...")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used early-return here to keep the happy path flat and avoid extra nesting. It’s functionally the same, just a guard-clause style for readability. Happy to switch to an explicit if/else if you prefer
teuthology/suite/__init__.py
Outdated
| if not job.get('description'): | ||
| continue | ||
|
|
||
| failure_reason = job.get('failure_reason', '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we try to search in empty string below? Wouldn't it better to return None here and do re.search in the text?
| """Filter out jobs with known failure patterns. | ||
| Args: | ||
| jobs: Iterable of job dictionaries with 'failure_reason' and 'description' keys. | ||
| Can be a list, generator, or any iterable. | ||
| known_patterns_file: Optional path to known patterns file. If None, uses | ||
| the default bundled file. | ||
| Returns: | ||
| List of job dictionaries for jobs with unknown failures (i.e., failures | ||
| that don't match any known pattern). Only includes jobs that were passed | ||
| in the input and have descriptions. | ||
| The known patterns file format is documented in docs/siteconfig.rst. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as for the function above, would be great to follow teuthology practice of docstring formatting.
Added new comments, and... the PR title and description need to be updated as well. |
This commit implements a new --skip-known-failures flag that allows teuthology-suite to skip jobs with known failure patterns during reruns, only scheduling jobs with unknown failures. Changes: - Add --skip-known-failures flag to scripts/suite.py with documentation - Add --known-failure-patterns option for custom patterns file (defaults to suite/patterns/known-failures.json) - Implement filter_out_known_failures() function in teuthology/suite/__init__.py - Support both JSON and YAML pattern file formats - Modify get_rerun_conf_overrides() to use unknown failure filtering - Add early return when no unknown failures are found - Add teuthology/suite/patterns/known-failures.json with example failure patterns - Add known-failures.json to setup.cfg package_data section - Create docs/known_failure_patterns.rst with comprehensive documentation - Fix seed type conversion issue (string to int) The feature works by: 1. Loading known failure patterns from teuthology/suite/patterns/known-failures.json (or custom file) 2. Checking each failed job's failure_reason against known patterns 3. Only scheduling jobs with failure reasons that don't match any known pattern 4. Early exit if no unknown failures are found This provides a smart rerun mechanism that avoids rerunning jobs with known infrastructure issues while still catching new failures. Signed-off-by: deepssin <[email protected]>
365f741 to
85bae47
Compare
Updated with changes. |
| config.get_ceph_qa_suite_git_url()), | ||
| default_ceph_branch=defaults('--ceph-branch', 'main'), | ||
| default_job_threshold=config.job_threshold, | ||
| default_known_failure_patterns=_get_default_known_failure_patterns_file(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, but look, do you really want to always provide custom pattern file as an argument, or maybe it would be great to fix it in teuthology config file or possible via environment variable. If so, then we do not need to import _get_default_known_failure_patterns_file directly, and get using defaults or get it from the config.
This commit implements a new --skip-known-failures flag that allows teuthology-suite to skip jobs with known failure patterns during reruns, only scheduling jobs with unknown failures.
Changes:
The feature works by:
This provides a smart rerun mechanism that avoids rerunning jobs with known infrastructure issues while still catching new failures.