Skip to content

Latest commit

 

History

History
38 lines (29 loc) · 1.94 KB

CHANGELOG.md

File metadata and controls

38 lines (29 loc) · 1.94 KB

Changelog

[0.2.71] - 2024-06-26

Improved Error Handling and Performance 🚧

  • 🚫 Refactored crawler_strategy.py to handle exceptions and provide better error messages, making it more robust and reliable.
  • 💻 Optimized the get_content_of_website_optimized function in utils.py for improved performance, reducing potential bottlenecks.
  • 💻 Updated utils.py with the latest changes, ensuring consistency and accuracy.
  • 🚫 Migrated to ChromeDriverManager to resolve Chrome driver download issues, providing a smoother user experience.

These changes focus on refining the existing codebase, resulting in a more stable, efficient, and user-friendly experience. With these improvements, you can expect fewer errors and better performance in the crawler strategy and utility functions.

[0.2.71] - 2024-06-25

Fixed

  • Speed up twice the extraction function.

[0.2.6] - 2024-06-22

Fixed

  • Fix issue #19: Update Dockerfile to ensure compatibility across multiple platforms.

[0.2.5] - 2024-06-18

Added

  • Added five important hooks to the crawler:
    • on_driver_created: Called when the driver is ready for initializations.
    • before_get_url: Called right before Selenium fetches the URL.
    • after_get_url: Called after Selenium fetches the URL.
    • before_return_html: Called when the data is parsed and ready.
    • on_user_agent_updated: Called when the user changes the user_agent, causing the driver to reinitialize.
  • Added an example in quickstart.py in the example folder under the docs.
  • Enhancement issue #24: Replaced inline HTML tags (e.g., DEL, INS, SUB, ABBR) with textual format for better context handling in LLM.
  • Maintaining the semantic context of inline tags (e.g., abbreviation, DEL, INS) for improved LLM-friendliness.
  • Updated Dockerfile to ensure compatibility across multiple platforms (Hopefully!).

[0.2.4] - 2024-06-17

Fixed

  • Fix issue #22: Use MD5 hash for caching HTML files to handle long URLs