Crawl4AI v0.2.74 🕷️🤖

Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐

Try it Now!

Use as REST API:
Use as Python library: This collab is a bit outdated. I'm updating it with the newest versions, so please refer to the website for the latest documentation. This will be updated in a few days, and you'll have the latest version here. Thank you so much.

✨ visit our Documentation Website

Features ✨

🆓 Completely free and open-source
🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
🌍 Supports crawling multiple URLs simultaneously
🎨 Extracts and returns all media tags (Images, Audio, and Video)
🔗 Extracts all external and internal links
📚 Extracts metadata from the page
🔄 Custom hooks for authentication, headers, and page modifications before crawling
🕵️ User-agent customization
🖼️ Takes screenshots of the page
📜 Executes multiple custom JavaScripts before crawling
📚 Various chunking strategies: topic-based, regex, sentence, and more
🧠 Advanced extraction strategies: cosine clustering, LLM, and more
🎯 CSS selector support
📝 Passes instructions/keywords to refine extraction

Cool Examples 🚀

Quick Start

from crawl4ai import WebCrawler

# Create an instance of WebCrawler
crawler = WebCrawler()

# Warm up the crawler (load necessary models)
crawler.warmup()

# Run the crawler on a URL
result = crawler.run(url="https://www.nbcnews.com/business")

# Print the extracted content
print(result.markdown)

How to install 🛠

virtualenv venv
source venv/bin/activate
pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git"
```️

### Speed-First Design 🚀

Perhaps the most important design principle for this library is speed. We need to ensure it can handle many links and resources in parallel as quickly as possible. By combining this speed with fast LLMs like Groq, the results will be truly amazing.

```python
import time
from crawl4ai.web_crawler import WebCrawler
crawler = WebCrawler()
crawler.warmup()

start = time.time()
url = r"https://www.nbcnews.com/business"
result = crawler.run( url, word_count_threshold=10, bypass_cache=True)
end = time.time()
print(f"Time taken: {end - start}")

Let's take a look the calculated time for the above code snippet:

[LOG] 🚀 Crawling done, success: True, time taken: 1.3623387813568115 seconds
[LOG] 🚀 Content extracted, success: True, time taken: 0.05715131759643555 seconds
[LOG] 🚀 Extraction, time taken: 0.05750393867492676 seconds.
Time taken: 1.439958095550537

Fetching the content from the page took 1.3623 seconds, and extracting the content took 0.0575 seconds. 🚀

Extract Structured Data from Web Pages 📊

Crawl all OpenAI models and their fees from the official page.

import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")

url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
        url=url,
        word_count_threshold=1,
        extraction_strategy= LLMExtractionStrategy(
            provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
            schema=OpenAIModelFee.schema(),
            extraction_type="schema",
            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
            Do not miss any models in the entire content. One extracted model JSON format should look like this: 
            {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
        ),            
        bypass_cache=True,
    )

print(result.extracted_content)

Execute JS, Filter Data with CSS Selector, and Clustering

from crawl4ai import WebCrawler
from crawl4ai.chunking_strategy import CosineStrategy

js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]

crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
    url="https://www.nbcnews.com/business",
    js=js_code,
    css_selector="p",
    extraction_strategy=CosineStrategy(semantic_filter="technology")
)

print(result.extracted_content)

Documentation 📚

For detailed documentation, including installation instructions, advanced features, and API reference, visit our Documentation Website.

Contributing 🤝

We welcome contributions from the open-source community. Check out our contribution guidelines for more information.

License 📄

Crawl4AI is released under the Apache 2.0 License.

Contact 📧

For questions, suggestions, or feedback, feel free to reach out:

GitHub: unclecode
Twitter: @unclecode
Website: crawl4ai.com

Happy Crawling! 🕸️🚀

Name		Name	Last commit message	Last commit date
Latest commit History 191 Commits
crawl4ai		crawl4ai
docs		docs
pages		pages
tests		tests
.env.txt		.env.txt
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
Dockerfile_mac		Dockerfile_mac
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
middlewares.py		middlewares.py
mkdocs.yml		mkdocs.yml
requirements.crawl.txt		requirements.crawl.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crawl4AI v0.2.74 🕷️🤖

Try it Now!

Features ✨

Cool Examples 🚀

Quick Start

How to install 🛠

Extract Structured Data from Web Pages 📊

Execute JS, Filter Data with CSS Selector, and Clustering

Documentation 📚

Contributing 🤝

License 📄

Contact 📧

Star History

About

Releases

Packages

Languages

License

lk1ngaa7/crawl4ailk

Folders and files

Latest commit

History

Repository files navigation

Crawl4AI v0.2.74 🕷️🤖

Try it Now!

Features ✨

Cool Examples 🚀

Quick Start

How to install 🛠

Extract Structured Data from Web Pages 📊

Execute JS, Filter Data with CSS Selector, and Clustering

Documentation 📚

Contributing 🤝

License 📄

Contact 📧

Star History

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages