| # Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution | |
| Crawl4AI, the **#1 trending GitHub repository**, streamlines web content extraction into AI-ready formats. Perfect for AI assistants, semantic search engines, or data pipelines, Crawl4AI transforms raw HTML into structured Markdown or JSON effortlessly. Integrate with LLMs, open-source models, or your own retrieval-augmented generation workflows. | |
| **What Crawl4AI is not:** | |
| Crawl4AI is not a replacement for traditional web scraping libraries, Selenium, or Playwright. It's not designed as a general-purpose web automation tool. Instead, Crawl4AI has a specific, focused goal: | |
| - To generate perfect, AI-friendly data (particularly for LLMs) from web content | |
| - To maximize speed and efficiency in data extraction and processing | |
| - To operate at scale, from Raspberry Pi to cloud infrastructures | |
| Crawl4AI is engineered with a "scale-first" mindset, aiming to handle millions of links while maintaining exceptional performance. It's super efficient and fast, optimized to: | |
| 1. Transform raw web content into structured, LLM-ready formats (Markdown/JSON) | |
| 2. Implement intelligent extraction strategies to reduce reliance on costly API calls | |
| 3. Provide a streamlined pipeline for AI data preparation and ingestion | |
| In essence, Crawl4AI bridges the gap between web content and AI systems, focusing on delivering high-quality, processed data rather than offering broad web automation capabilities. | |
| **Key Links:** | |
| - **Website:** [https://crawl4ai.com](https://crawl4ai.com) | |
| - **GitHub:** [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai) | |
| - **Colab Notebook:** [Try on Google Colab](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing) | |
| - **Quickstart Code Example:** [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) | |
| - **Examples Folder:** [Crawl4AI Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) | |
| --- | |
| ## Table of Contents | |
| - [Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling \& AI Integration Solution](#crawl4ai-quick-start-guide-your-all-in-one-ai-ready-web-crawling--ai-integration-solution) | |
| - [Table of Contents](#table-of-contents) | |
| - [1. Introduction \& Key Concepts](#1-introduction--key-concepts) | |
| - [2. Installation \& Environment Setup](#2-installation--environment-setup) | |
| - [Test Your Installation](#test-your-installation) | |
| - [3. Core Concepts \& Configuration](#3-core-concepts--configuration) | |
| - [4. Basic Crawling \& Simple Extraction](#4-basic-crawling--simple-extraction) | |
| - [5. Markdown Generation \& AI-Optimized Output](#5-markdown-generation--ai-optimized-output) | |
| - [6. Structured Data Extraction (CSS, XPath, LLM)](#6-structured-data-extraction-css-xpath-llm) | |
| - [7. Advanced Extraction: LLM \& Open-Source Models](#7-advanced-extraction-llm--open-source-models) | |
| - [8. Page Interactions, JS Execution, \& Dynamic Content](#8-page-interactions-js-execution--dynamic-content) | |
| - [9. Media, Links, \& Metadata Handling](#9-media-links--metadata-handling) | |
| - [10. Authentication \& Identity Preservation](#10-authentication--identity-preservation) | |
| - [Manual Setup via User Data Directory](#manual-setup-via-user-data-directory) | |
| - [Using `storage_state`](#using-storage_state) | |
| - [11. Proxy \& Security Enhancements](#11-proxy--security-enhancements) | |
| - [12. Screenshots, PDFs \& File Downloads](#12-screenshots-pdfs--file-downloads) | |
| - [13. Caching \& Performance Optimization](#13-caching--performance-optimization) | |
| - [14. Hooks for Custom Logic](#14-hooks-for-custom-logic) | |
| - [15. Dockerization \& Scaling](#15-dockerization--scaling) | |
| - [16. Troubleshooting \& Common Pitfalls](#16-troubleshooting--common-pitfalls) | |
| - [17. Comprehensive End-to-End Example](#17-comprehensive-end-to-end-example) | |
| - [18. Further Resources \& Community](#18-further-resources--community) | |
| --- | |
| ## 1. Introduction & Key Concepts | |
| Crawl4AI transforms websites into structured, AI-friendly data. It efficiently handles large-scale crawling, integrates with both proprietary and open-source LLMs, and optimizes content for semantic search or RAG pipelines. | |
| **Quick Test:** | |
| ```python | |
| import asyncio | |
| from crawl4ai import AsyncWebCrawler | |
| async def test_run(): | |
| async with AsyncWebCrawler() as crawler: | |
| result = await crawler.arun("https://example.com") | |
| print(result.markdown) | |
| asyncio.run(test_run()) | |
| ``` | |
| If you see Markdown output, everything is working! | |
| **More info:** [See /docs/introduction](#) or [1_introduction.ex.md](https://github.com/unclecode/crawl4ai/blob/main/introduction.ex.md) | |
| --- | |
| ## 2. Installation & Environment Setup | |
| ```bash | |
| # Install the package | |
| pip install crawl4ai | |
| crawl4ai-setup | |
| # Install Playwright with system dependencies (recommended) | |
| playwright install --with-deps # Installs all browsers | |
| # Or install specific browsers: | |
| playwright install --with-deps chrome # Recommended for Colab/Linux | |
| playwright install --with-deps firefox | |
| playwright install --with-deps webkit | |
| playwright install --with-deps chromium | |
| # Keep Playwright updated periodically | |
| playwright install | |
| ``` | |
| > **Note**: For Google Colab and some Linux environments, use `chrome` instead of `chromium` - it tends to work more reliably. | |
| ### Test Your Installation | |
| Try these one-liners: | |
| ```python | |
| # Visible browser test | |
| python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')" | |
| # Headless test (for servers/CI) | |
| python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=True); page = browser.new_page(); page.goto('https://example.com'); print(f'Title: {page.title()}'); browser.close()" | |
| ``` | |
| You should see a browser window (in visible test) loading example.com. If you get errors, try with Firefox using `playwright install --with-deps firefox`. | |
| **Try in Colab:** | |
| [Open Colab Notebook](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing) | |
| **More info:** [See /docs/configuration](#) or [2_configuration.md](https://github.com/unclecode/crawl4ai/blob/main/configuration.md) | |
| --- | |
| ## 3. Core Concepts & Configuration | |
| Use `AsyncWebCrawler`, `CrawlerRunConfig`, and `BrowserConfig` to control crawling. | |
| **Example config:** | |
| ```python | |
| from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig | |
| browser_config = BrowserConfig( | |
| headless=True, | |
| verbose=True, | |
| viewport_width=1080, | |
| viewport_height=600, | |
| text_mode=False, | |
| ignore_https_errors=True, | |
| java_script_enabled=True | |
| ) | |
| run_config = CrawlerRunConfig( | |
| css_selector="article.main", | |
| word_count_threshold=50, | |
| excluded_tags=['nav','footer'], | |
| exclude_external_links=True, | |
| wait_for="css:.article-loaded", | |
| page_timeout=60000, | |
| delay_before_return_html=1.0, | |
| mean_delay=0.1, | |
| max_range=0.3, | |
| process_iframes=True, | |
| remove_overlay_elements=True, | |
| js_code=""" | |
| (async () => { | |
| window.scrollTo(0, document.body.scrollHeight); | |
| await new Promise(r => setTimeout(r, 2000)); | |
| document.querySelector('.load-more')?.click(); | |
| })(); | |
| """ | |
| ) | |
| # Use: ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLY | |
| # run_config.cache_mode = CacheMode.ENABLED | |
| ``` | |
| **Prefixes:** | |
| - `http://` or `https://` for live pages | |
| - `file://local.html` for local | |
| - `raw:<html>` for raw HTML strings | |
| **More info:** [See /docs/async_webcrawler](#) or [3_async_webcrawler.ex.md](https://github.com/unclecode/crawl4ai/blob/main/async_webcrawler.ex.md) | |
| --- | |
| ## 4. Basic Crawling & Simple Extraction | |
| ```python | |
| async with AsyncWebCrawler(config=browser_config) as crawler: | |
| result = await crawler.arun("https://news.example.com/article", config=run_config) | |
| print(result.markdown) # Basic markdown content | |
| ``` | |
| **More info:** [See /docs/browser_context_page](#) or [4_browser_context_page.ex.md](https://github.com/unclecode/crawl4ai/blob/main/browser_context_page.ex.md) | |
| --- | |
| ## 5. Markdown Generation & AI-Optimized Output | |
| After crawling, `result.markdown_v2` provides: | |
| - `raw_markdown`: Unfiltered markdown | |
| - `markdown_with_citations`: Links as references at the bottom | |
| - `references_markdown`: A separate list of reference links | |
| - `fit_markdown`: Filtered, relevant markdown (e.g., after BM25) | |
| - `fit_html`: The HTML used to produce `fit_markdown` | |
| **Example:** | |
| ```python | |
| print("RAW:", result.markdown_v2.raw_markdown[:200]) | |
| print("CITED:", result.markdown_v2.markdown_with_citations[:200]) | |
| print("REFERENCES:", result.markdown_v2.references_markdown) | |
| print("FIT MARKDOWN:", result.markdown_v2.fit_markdown) | |
| ``` | |
| For AI training, `fit_markdown` focuses on the most relevant content. | |
| **More info:** [See /docs/markdown_generation](#) or [5_markdown_generation.ex.md](https://github.com/unclecode/crawl4ai/blob/main/markdown_generation.ex.md) | |
| --- | |
| ## 6. Structured Data Extraction (CSS, XPath, LLM) | |
| Extract JSON data without LLMs: | |
| **CSS:** | |
| ```python | |
| from crawl4ai.extraction_strategy import JsonCssExtractionStrategy | |
| schema = { | |
| "name": "Products", | |
| "baseSelector": ".product", | |
| "fields": [ | |
| {"name": "title", "selector": "h2", "type": "text"}, | |
| {"name": "price", "selector": ".price", "type": "text"} | |
| ] | |
| } | |
| run_config.extraction_strategy = JsonCssExtractionStrategy(schema) | |
| ``` | |
| **XPath:** | |
| ```python | |
| from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy | |
| xpath_schema = { | |
| "name": "Articles", | |
| "baseSelector": "//div[@class='article']", | |
| "fields": [ | |
| {"name":"headline","selector":".//h1","type":"text"}, | |
| {"name":"summary","selector":".//p[@class='summary']","type":"text"} | |
| ] | |
| } | |
| run_config.extraction_strategy = JsonXPathExtractionStrategy(xpath_schema) | |
| ``` | |
| **More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md) | |
| --- | |
| ## 7. Advanced Extraction: LLM & Open-Source Models | |
| Use LLMExtractionStrategy for complex tasks. Works with OpenAI or open-source models (e.g., Ollama). | |
| ```python | |
| from pydantic import BaseModel | |
| from crawl4ai.extraction_strategy import LLMExtractionStrategy | |
| class TravelData(BaseModel): | |
| destination: str | |
| attractions: list | |
| run_config.extraction_strategy = LLMExtractionStrategy( | |
| provider="ollama/nemotron", | |
| schema=TravelData.schema(), | |
| instruction="Extract destination and top attractions." | |
| ) | |
| ``` | |
| **More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md) | |
| --- | |
| ## 8. Page Interactions, JS Execution, & Dynamic Content | |
| Insert `js_code` and use `wait_for` to ensure content loads. Example: | |
| ```python | |
| run_config.js_code = """ | |
| (async () => { | |
| document.querySelector('.load-more')?.click(); | |
| await new Promise(r => setTimeout(r, 2000)); | |
| })(); | |
| """ | |
| run_config.wait_for = "css:.item-loaded" | |
| ``` | |
| **More info:** [See /docs/page_interaction](#) or [11_page_interaction.md](https://github.com/unclecode/crawl4ai/blob/main/page_interaction.md) | |
| --- | |
| ## 9. Media, Links, & Metadata Handling | |
| `result.media["images"]`: List of images with `src`, `score`, `alt`. Score indicates relevance. | |
| `result.media["videos"]`, `result.media["audios"]` similarly hold media info. | |
| `result.links["internal"]`, `result.links["external"]`, `result.links["social"]`: Categorized links. Each link has `href`, `text`, `context`, `type`. | |
| `result.metadata`: Title, description, keywords, author. | |
| **Example:** | |
| ```python | |
| # Images | |
| for img in result.media["images"]: | |
| print("Image:", img["src"], "Score:", img["score"], "Alt:", img.get("alt","N/A")) | |
| # Links | |
| for link in result.links["external"]: | |
| print("External Link:", link["href"], "Text:", link["text"]) | |
| # Metadata | |
| print("Page Title:", result.metadata["title"]) | |
| print("Description:", result.metadata["description"]) | |
| ``` | |
| **More info:** [See /docs/content_selection](#) or [8_content_selection.ex.md](https://github.com/unclecode/crawl4ai/blob/main/content_selection.ex.md) | |
| --- | |
| ## 10. Authentication & Identity Preservation | |
| ### Manual Setup via User Data Directory | |
| 1. **Open Chrome with a custom user data dir:** | |
| ```bash | |
| "C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\MyChromeProfile" | |
| ``` | |
| On macOS: | |
| ```bash | |
| "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/MyProfile" | |
| ``` | |
| 2. **Log in to sites, solve CAPTCHAs, adjust settings manually.** | |
| The browser saves cookies/localStorage in that directory. | |
| 3. **Use `user_data_dir` in `BrowserConfig`:** | |
| ```python | |
| browser_config = BrowserConfig( | |
| headless=True, | |
| user_data_dir="/Users/username/ChromeProfiles/MyProfile" | |
| ) | |
| ``` | |
| Now the crawler starts with those cookies, sessions, etc. | |
| ### Using `storage_state` | |
| Alternatively, export and reuse storage states: | |
| ```python | |
| browser_config = BrowserConfig( | |
| headless=True, | |
| storage_state="mystate.json" # Pre-saved state | |
| ) | |
| ``` | |
| No repeated logins needed. | |
| **More info:** [See /docs/storage_state](#) or [16_storage_state.md](https://github.com/unclecode/crawl4ai/blob/main/storage_state.md) | |
| --- | |
| ## 11. Proxy & Security Enhancements | |
| Use `proxy_config` for authenticated proxies: | |
| ```python | |
| browser_config.proxy_config = { | |
| "server": "http://proxy.example.com:8080", | |
| "username": "proxyuser", | |
| "password": "proxypass" | |
| } | |
| ``` | |
| Combine with `headers` or `ignore_https_errors` as needed. | |
| **More info:** [See /docs/proxy_security](#) or [14_proxy_security.md](https://github.com/unclecode/crawl4ai/blob/main/proxy_security.md) | |
| --- | |
| ## 12. Screenshots, PDFs & File Downloads | |
| Enable `screenshot=True` or `pdf=True` in `CrawlerRunConfig`: | |
| ```python | |
| run_config.screenshot = True | |
| run_config.pdf = True | |
| ``` | |
| After crawling: | |
| ```python | |
| if result.screenshot: | |
| with open("page.png", "wb") as f: | |
| f.write(result.screenshot) | |
| if result.pdf: | |
| with open("page.pdf", "wb") as f: | |
| f.write(result.pdf) | |
| ``` | |
| **File Downloads:** | |
| ```python | |
| browser_config.accept_downloads = True | |
| browser_config.downloads_path = "./downloads" | |
| run_config.js_code = """document.querySelector('a.download')?.click();""" | |
| # After crawl: | |
| print("Downloaded files:", result.downloaded_files) | |
| ``` | |
| **More info:** [See /docs/screenshot_and_pdf_export](#) or [15_screenshot_and_pdf_export.md](https://github.com/unclecode/crawl4ai/blob/main/screenshot_and_pdf_export.md) | |
| Also [10_file_download.md](https://github.com/unclecode/crawl4ai/blob/main/file_download.md) | |
| --- | |
| ## 13. Caching & Performance Optimization | |
| Set `cache_mode` to reuse fetch results: | |
| ```python | |
| from crawl4ai import CacheMode | |
| run_config.cache_mode = CacheMode.ENABLED | |
| ``` | |
| Adjust delays, increase concurrency, or use `text_mode=True` for faster extraction. | |
| **More info:** [See /docs/cache_modes](#) or [9_cache_modes.md](https://github.com/unclecode/crawl4ai/blob/main/cache_modes.md) | |
| --- | |
| ## 14. Hooks for Custom Logic | |
| Hooks let you run code at specific lifecycle events without creating pages manually in `on_browser_created`. | |
| Use `on_page_context_created` to apply routing or modify page contexts before crawling the URL: | |
| **Example Hook:** | |
| ```python | |
| async def on_page_context_created_hook(context, page, **kwargs): | |
| # Block all images to speed up load | |
| await context.route("**/*.{png,jpg,jpeg}", lambda route: route.abort()) | |
| print("[HOOK] Image requests blocked") | |
| async with AsyncWebCrawler(config=browser_config) as crawler: | |
| crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created_hook) | |
| result = await crawler.arun("https://imageheavy.example.com", config=run_config) | |
| print("Crawl finished with images blocked.") | |
| ``` | |
| This hook is clean and doesn’t create a separate page itself—it just modifies the current context/page setup. | |
| **More info:** [See /docs/hooks_auth](#) or [13_hooks_auth.md](https://github.com/unclecode/crawl4ai/blob/main/hooks_auth.md) | |
| --- | |
| ## 15. Dockerization & Scaling | |
| Use Docker images: | |
| - AMD64 basic: | |
| ```bash | |
| docker pull unclecode/crawl4ai:basic-amd64 | |
| docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64 | |
| ``` | |
| - ARM64 for M1/M2: | |
| ```bash | |
| docker pull unclecode/crawl4ai:basic-arm64 | |
| docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64 | |
| ``` | |
| - GPU support: | |
| ```bash | |
| docker pull unclecode/crawl4ai:gpu-amd64 | |
| docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu-amd64 | |
| ``` | |
| Scale with load balancers or Kubernetes. | |
| **More info:** [See /docs/proxy_security (for proxy) or relevant Docker instructions in README](#) | |
| --- | |
| ## 16. Troubleshooting & Common Pitfalls | |
| - Empty results? Relax filters, check selectors. | |
| - Timeouts? Increase `page_timeout` or refine `wait_for`. | |
| - CAPTCHAs? Use `user_data_dir` or `storage_state` after manual solving. | |
| - JS errors? Try headful mode for debugging. | |
| Check [examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) & [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for more code. | |
| --- | |
| ## 17. Comprehensive End-to-End Example | |
| Combine hooks, JS execution, PDF saving, LLM extraction—see [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for a full example. | |
| --- | |
| ## 18. Further Resources & Community | |
| - **Docs:** [https://crawl4ai.com](https://crawl4ai.com) | |
| - **Issues & PRs:** [https://github.com/unclecode/crawl4ai/issues](https://github.com/unclecode/crawl4ai/issues) | |
| Follow [@unclecode](https://x.com/unclecode) for news & community updates. | |
| **Happy Crawling!** | |
| Leverage Crawl4AI to feed your AI models with clean, structured web data today. | |