Skip to content

update parsers and classifiers for v0.6.7a2 demo data#93

Merged
gitronald merged 6 commits intodevfrom
parser-updates
Feb 6, 2026
Merged

update parsers and classifiers for v0.6.7a2 demo data#93
gitronald merged 6 commits intodevfrom
parser-updates

Conversation

@gitronald
Copy link
Owner

Summary

Fix 5 classification issues and 3 parser issues found in data/demo-ws-v0.6.7a2/ demo data. All were producing unknown types, parse errors, or null fields — every change is a net improvement with no working behavior regressed.

Classifier fixes

  • Tag-agnostic heading search — headings changed from <h2>/<div> to <span> with aria-level + role="heading". Replaced two div-specific searches with one tag-agnostic find_all(attrs=...). Fixes 3 misclassifications: people_also_ask, perspectives, searches_related.
  • discussions_and_forums text check — classifier matched any div.IFnjPb[role=heading] regardless of text. Now requires heading text to start with "Discussions and forums".
  • "What people are saying" mapping — moved from knowledge to perspectives.

Parser fixes

  • general.py — extracted find_subcomponents() to separate discovery from parsing. Added PmEWq video format so video results no longer fall through to the wrapper.
  • knowledge.py — deduplicated URLs in AI overview (24 links with repeats down to ~13 unique).
  • top_stories.py — added get_title() helper with multi-selector fallback for perspectives titles (eAaXgc in addition to n0jPhd).

Tests

  • Rewrote test_parse_serp.py with syrupy snapshot test (parametrized by serp_id) + 8 structural validation tests. All 9 pass.

Tooling

  • Added scripts/demo_screenshot.py for visual SERP inspection with component type highlights injected via BeautifulSoup.

Notes for future parser updates

  • Use scripts/demo_screenshot.py to visually inspect component boundaries and classification before diving into code.
  • Heading tags change frequently — the tag-agnostic attrs={"aria-level": ..., "role": "heading"} pattern should keep working, but watch for attribute changes too.
  • Classifier order mattersClassifyMain.classify() runs an ordered chain; first non-"unknown" wins. Check the full chain when adding or modifying classifiers.
  • Multi-selector helpers (like get_title(), get_cite()) are the preferred pattern for handling CSS class changes — add new selectors to the list rather than replacing old ones.
  • find_subcomponents() is the single place to add new general result formats.
  • Run .claude/commands/parser-update.md for the full 7-phase diagnostic workflow.

Test plan

  • poetry run pytest tests/test_parse_serp.py -v — all 9 tests pass
  • Visual inspection via demo_screenshot.py confirms correct component boundaries
  • No regressions — all previously correct classifications unchanged

@gitronald gitronald merged commit 076dd8e into dev Feb 6, 2026
@gitronald gitronald deleted the parser-updates branch February 19, 2026 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant