update parsers and classifiers for v0.6.7a2 demo data#93
Merged
Conversation
…and_forums - replace div-specific aria-level searches with tag-agnostic attrs= search - move "What people are saying" from knowledge to perspectives mapping - require heading text match for discussions_and_forums classifier
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix 5 classification issues and 3 parser issues found in
data/demo-ws-v0.6.7a2/demo data. All were producingunknowntypes, parse errors, or null fields — every change is a net improvement with no working behavior regressed.Classifier fixes
<h2>/<div>to<span>witharia-level+role="heading". Replaced twodiv-specific searches with one tag-agnosticfind_all(attrs=...). Fixes 3 misclassifications:people_also_ask,perspectives,searches_related.div.IFnjPb[role=heading]regardless of text. Now requires heading text to start with "Discussions and forums".knowledgetoperspectives.Parser fixes
find_subcomponents()to separate discovery from parsing. AddedPmEWqvideo format so video results no longer fall through to the wrapper.get_title()helper with multi-selector fallback for perspectives titles (eAaXgcin addition ton0jPhd).Tests
test_parse_serp.pywith syrupy snapshot test (parametrized by serp_id) + 8 structural validation tests. All 9 pass.Tooling
scripts/demo_screenshot.pyfor visual SERP inspection with component type highlights injected via BeautifulSoup.Notes for future parser updates
scripts/demo_screenshot.pyto visually inspect component boundaries and classification before diving into code.attrs={"aria-level": ..., "role": "heading"}pattern should keep working, but watch for attribute changes too.ClassifyMain.classify()runs an ordered chain; first non-"unknown" wins. Check the full chain when adding or modifying classifiers.get_title(),get_cite()) are the preferred pattern for handling CSS class changes — add new selectors to the list rather than replacing old ones.find_subcomponents()is the single place to add new general result formats..claude/commands/parser-update.mdfor the full 7-phase diagnostic workflow.Test plan
poetry run pytest tests/test_parse_serp.py -v— all 9 tests passdemo_screenshot.pyconfirms correct component boundaries