Improved Error Handling and Performance 🚧
- 🚫 Refactored
crawler_strategy.py
to handle exceptions and provide better error messages, making it more robust and reliable. - 💻 Optimized the
get_content_of_website_optimized
function inutils.py
for improved performance, reducing potential bottlenecks. - 💻 Updated
utils.py
with the latest changes, ensuring consistency and accuracy. - 🚫 Migrated to
ChromeDriverManager
to resolve Chrome driver download issues, providing a smoother user experience.
These changes focus on refining the existing codebase, resulting in a more stable, efficient, and user-friendly experience. With these improvements, you can expect fewer errors and better performance in the crawler strategy and utility functions.
- Speed up twice the extraction function.
- Fix issue #19: Update Dockerfile to ensure compatibility across multiple platforms.
- Added five important hooks to the crawler:
- on_driver_created: Called when the driver is ready for initializations.
- before_get_url: Called right before Selenium fetches the URL.
- after_get_url: Called after Selenium fetches the URL.
- before_return_html: Called when the data is parsed and ready.
- on_user_agent_updated: Called when the user changes the user_agent, causing the driver to reinitialize.
- Added an example in
quickstart.py
in the example folder under the docs. - Enhancement issue #24: Replaced inline HTML tags (e.g., DEL, INS, SUB, ABBR) with textual format for better context handling in LLM.
- Maintaining the semantic context of inline tags (e.g., abbreviation, DEL, INS) for improved LLM-friendliness.
- Updated Dockerfile to ensure compatibility across multiple platforms (Hopefully!).
- Fix issue #22: Use MD5 hash for caching HTML files to handle long URLs