Spaces:
Sleeping
Sleeping
| # Crawl4AI | |
| Welcome to the official documentation for Crawl4AI! π·οΈπ€ Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI. | |
| ## Introduction | |
| Crawl4AI has one clear task: to make crawling and data extraction from web pages easy and efficient, especially for large language models (LLMs) and AI applications. Whether you are using it as a REST API or a Python library, Crawl4AI offers a robust and flexible solution with full asynchronous support. | |
| ## Quick Start | |
| Here's a quick example to show you how easy it is to use Crawl4AI with its asynchronous capabilities: | |
| ```python | |
| import asyncio | |
| from crawl4ai import AsyncWebCrawler | |
| async def main(): | |
| # Create an instance of AsyncWebCrawler | |
| async with AsyncWebCrawler(verbose=True) as crawler: | |
| # Run the crawler on a URL | |
| result = await crawler.arun(url="https://www.nbcnews.com/business") | |
| # Print the extracted content | |
| print(result.markdown) | |
| # Run the async main function | |
| asyncio.run(main()) | |
| ``` | |
| ## Key Features β¨ | |
| - π Completely free and open-source | |
| - π Blazing fast performance, outperforming many paid services | |
| - π€ LLM-friendly output formats (JSON, cleaned HTML, markdown) | |
| - π Fit markdown generation for extracting main article content. | |
| - π Multi-browser support (Chromium, Firefox, WebKit) | |
| - π Supports crawling multiple URLs simultaneously | |
| - π¨ Extracts and returns all media tags (Images, Audio, and Video) | |
| - π Extracts all external and internal links | |
| - π Extracts metadata from the page | |
| - π Custom hooks for authentication, headers, and page modifications | |
| - π΅οΈ User-agent customization | |
| - πΌοΈ Takes screenshots of pages with enhanced error handling | |
| - π Executes multiple custom JavaScripts before crawling | |
| - π Generates structured output without LLM using JsonCssExtractionStrategy | |
| - π Various chunking strategies: topic-based, regex, sentence, and more | |
| - π§ Advanced extraction strategies: cosine clustering, LLM, and more | |
| - π― CSS selector support for precise data extraction | |
| - π Passes instructions/keywords to refine extraction | |
| - π Proxy support with authentication for enhanced access | |
| - π Session management for complex multi-page crawling | |
| - π Asynchronous architecture for improved performance | |
| - πΌοΈ Improved image processing with lazy-loading detection | |
| - π°οΈ Enhanced handling of delayed content loading | |
| - π Custom headers support for LLM interactions | |
| - πΌοΈ iframe content extraction for comprehensive analysis | |
| - β±οΈ Flexible timeout and delayed content retrieval options | |
| ## Documentation Structure | |
| Our documentation is organized into several sections: | |
| ### Basic Usage | |
| - [Installation](basic/installation.md) | |
| - [Quick Start](basic/quickstart.md) | |
| - [Simple Crawling](basic/simple-crawling.md) | |
| - [Browser Configuration](basic/browser-config.md) | |
| - [Content Selection](basic/content-selection.md) | |
| - [Output Formats](basic/output-formats.md) | |
| - [Page Interaction](basic/page-interaction.md) | |
| ### Advanced Features | |
| - [Magic Mode](advanced/magic-mode.md) | |
| - [Session Management](advanced/session-management.md) | |
| - [Hooks & Authentication](advanced/hooks-auth.md) | |
| - [Proxy & Security](advanced/proxy-security.md) | |
| - [Content Processing](advanced/content-processing.md) | |
| ### Extraction & Processing | |
| - [Extraction Strategies Overview](extraction/overview.md) | |
| - [LLM Integration](extraction/llm.md) | |
| - [CSS-Based Extraction](extraction/css.md) | |
| - [Cosine Strategy](extraction/cosine.md) | |
| - [Chunking Strategies](extraction/chunking.md) | |
| ### API Reference | |
| - [AsyncWebCrawler](api/async-webcrawler.md) | |
| - [CrawlResult](api/crawl-result.md) | |
| - [Extraction Strategies](api/strategies.md) | |
| - [arun() Method Parameters](api/arun.md) | |
| ### Examples | |
| - Coming soon! | |
| ## Getting Started | |
| 1. Install Crawl4AI: | |
| ```bash | |
| pip install crawl4ai | |
| ``` | |
| 2. Check out our [Quick Start Guide](basic/quickstart.md) to begin crawling web pages. | |
| 3. Explore our [examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) to see Crawl4AI in action. | |
| ## Support | |
| For questions, suggestions, or issues: | |
| - GitHub Issues: [Report a Bug](https://github.com/unclecode/crawl4ai/issues) | |
| - Twitter: [@unclecode](https://twitter.com/unclecode) | |
| - Website: [crawl4ai.com](https://crawl4ai.com) | |
| Happy Crawling! πΈοΈπ |