| # Getting Started with Crawl4AI | |
| Welcome to **Crawl4AI**, an open-source LLM friendly Web Crawler & Scraper. In this tutorial, you’ll: | |
| 1. **Install** Crawl4AI (both via pip and Docker, with notes on platform challenges). | |
| 2. Run your **first crawl** using minimal configuration. | |
| 3. Generate **Markdown** output (and learn how it’s influenced by content filters). | |
| 4. Experiment with a simple **CSS-based extraction** strategy. | |
| 5. See a glimpse of **LLM-based extraction** (including open-source and closed-source model options). | |
| --- | |
| ## 1. Introduction | |
| Crawl4AI provides: | |
| - An asynchronous crawler, **`AsyncWebCrawler`**. | |
| - Configurable browser and run settings via **`BrowserConfig`** and **`CrawlerRunConfig`**. | |
| - Automatic HTML-to-Markdown conversion via **`DefaultMarkdownGenerator`** (supports additional filters). | |
| - Multiple extraction strategies (LLM-based or “traditional” CSS/XPath-based). | |
| By the end of this guide, you’ll have installed Crawl4AI, performed a basic crawl, generated Markdown, and tried out two extraction strategies. | |
| --- | |
| ## 2. Installation | |
| ### 2.1 Python + Playwright | |
| #### Basic Pip Installation | |
| ```bash | |
| pip install crawl4ai | |
| crawl4ai-setup | |
| # Verify your installation | |
| crawl4ai-doctor | |
| ``` | |
| If you encounter any browser-related issues, you can install them manually: | |
| ```bash | |
| python -m playwright install --with-deps chrome chromium | |
| ``` | |
| - **`crawl4ai-setup`** installs and configures Playwright (Chromium by default). | |
| We cover advanced installation and Docker in the [Installation](#installation) section. | |
| --- | |
| ## 3. Your First Crawl | |
| Here’s a minimal Python script that creates an **`AsyncWebCrawler`**, fetches a webpage, and prints the first 300 characters of its Markdown output: | |
| ```python | |
| import asyncio | |
| from crawl4ai import AsyncWebCrawler | |
| async def main(): | |
| async with AsyncWebCrawler() as crawler: | |
| result = await crawler.arun("https://example.com") | |
| print(result.markdown[:300]) # Print first 300 chars | |
| if __name__ == "__main__": | |
| asyncio.run(main()) | |
| ``` | |
| **What’s happening?** | |
| - **`AsyncWebCrawler`** launches a headless browser (Chromium by default). | |
| - It fetches `https://example.com`. | |
| - Crawl4AI automatically converts the HTML into Markdown. | |
| You now have a simple, working crawl! | |
| --- | |
| ## 4. Basic Configuration (Light Introduction) | |
| Crawl4AI’s crawler can be heavily customized using two main classes: | |
| 1. **`BrowserConfig`**: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.). | |
| 2. **`CrawlerRunConfig`**: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.). | |
| Below is an example with minimal usage: | |
| ```python | |
| import asyncio | |
| from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig | |
| async def main(): | |
| browser_conf = BrowserConfig(headless=True) # or False to see the browser | |
| run_conf = CrawlerRunConfig(cache_mode="BYPASS") | |
| async with AsyncWebCrawler(config=browser_conf) as crawler: | |
| result = await crawler.arun( | |
| url="https://example.com", | |
| config=run_conf | |
| ) | |
| print(result.markdown) | |
| if __name__ == "__main__": | |
| asyncio.run(main()) | |
| ``` | |
| We’ll explore more advanced config in later tutorials (like enabling proxies, PDF output, multi-tab sessions, etc.). For now, just note how you pass these objects to manage crawling. | |
| --- | |
| ## 5. Generating Markdown Output | |
| By default, Crawl4AI automatically generates Markdown from each crawled page. However, the exact output depends on whether you specify a **markdown generator** or **content filter**. | |
| - **`result.markdown`**: | |
| The direct HTML-to-Markdown conversion. | |
| - **`result.markdown.fit_markdown`**: | |
| The same content after applying any configured **content filter** (e.g., `PruningContentFilter`). | |
| ### Example: Using a Filter with `DefaultMarkdownGenerator` | |
| ```python | |
| from crawl4ai import AsyncWebCrawler, CrawlerRunConfig | |
| from crawl4ai.content_filter_strategy import PruningContentFilter | |
| from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator | |
| md_generator = DefaultMarkdownGenerator( | |
| content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed") | |
| ) | |
| config = CrawlerRunConfig(markdown_generator=md_generator) | |
| async with AsyncWebCrawler() as crawler: | |
| result = await crawler.arun("https://news.ycombinator.com", config=config) | |
| print("Raw Markdown length:", len(result.markdown.raw_markdown)) | |
| print("Fit Markdown length:", len(result.markdown.fit_markdown)) | |
| ``` | |
| **Note**: If you do **not** specify a content filter or markdown generator, you’ll typically see only the raw Markdown. We’ll dive deeper into these strategies in a dedicated **Markdown Generation** tutorial. | |
| --- | |
| ## 6. Simple Data Extraction (CSS-based) | |
| Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. Below is a minimal CSS-based example: | |
| ```python | |
| import asyncio | |
| import json | |
| from crawl4ai import AsyncWebCrawler, CrawlerRunConfig | |
| from crawl4ai.extraction_strategy import JsonCssExtractionStrategy | |
| async def main(): | |
| schema = { | |
| "name": "Example Items", | |
| "baseSelector": "div.item", | |
| "fields": [ | |
| {"name": "title", "selector": "h2", "type": "text"}, | |
| {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"} | |
| ] | |
| } | |
| async with AsyncWebCrawler() as crawler: | |
| result = await crawler.arun( | |
| url="https://example.com/items", | |
| config=CrawlerRunConfig( | |
| extraction_strategy=JsonCssExtractionStrategy(schema) | |
| ) | |
| ) | |
| # The JSON output is stored in 'extracted_content' | |
| data = json.loads(result.extracted_content) | |
| print(data) | |
| if __name__ == "__main__": | |
| asyncio.run(main()) | |
| ``` | |
| **Why is this helpful?** | |
| - Great for repetitive page structures (e.g., item listings, articles). | |
| - No AI usage or costs. | |
| - The crawler returns a JSON string you can parse or store. | |
| --- | |
| ## 7. Simple Data Extraction (LLM-based) | |
| For more complex or irregular pages, a language model can parse text intelligently into a structure you define. Crawl4AI supports **open-source** or **closed-source** providers: | |
| - **Open-Source Models** (e.g., `ollama/llama3.3`, `no_token`) | |
| - **OpenAI Models** (e.g., `openai/gpt-4`, requires `api_token`) | |
| - Or any provider supported by the underlying library | |
| Below is an example using **open-source** style (no token) and closed-source: | |
| ```python | |
| import os | |
| import json | |
| import asyncio | |
| from pydantic import BaseModel, Field | |
| from crawl4ai import AsyncWebCrawler, CrawlerRunConfig | |
| from crawl4ai.extraction_strategy import LLMExtractionStrategy | |
| class PricingInfo(BaseModel): | |
| model_name: str = Field(..., description="Name of the AI model") | |
| input_fee: str = Field(..., description="Fee for input tokens") | |
| output_fee: str = Field(..., description="Fee for output tokens") | |
| async def main(): | |
| # 1) Open-Source usage: no token required | |
| llm_strategy_open_source = LLMExtractionStrategy( | |
| provider="ollama/llama3.3", # or "any-other-local-model" | |
| api_token="no_token", # for local models, no API key is typically required | |
| schema=PricingInfo.schema(), | |
| extraction_type="schema", | |
| instruction=""" | |
| From this page, extract all AI model pricing details in JSON format. | |
| Each entry should have 'model_name', 'input_fee', and 'output_fee'. | |
| """, | |
| temperature=0 | |
| ) | |
| # 2) Closed-Source usage: API key for OpenAI, for example | |
| openai_token = os.getenv("OPENAI_API_KEY", "sk-YOUR_API_KEY") | |
| llm_strategy_openai = LLMExtractionStrategy( | |
| provider="openai/gpt-4", | |
| api_token=openai_token, | |
| schema=PricingInfo.schema(), | |
| extraction_type="schema", | |
| instruction=""" | |
| From this page, extract all AI model pricing details in JSON format. | |
| Each entry should have 'model_name', 'input_fee', and 'output_fee'. | |
| """, | |
| temperature=0 | |
| ) | |
| # We'll demo the open-source approach here | |
| config = CrawlerRunConfig(extraction_strategy=llm_strategy_open_source) | |
| async with AsyncWebCrawler() as crawler: | |
| result = await crawler.arun( | |
| url="https://example.com/pricing", | |
| config=config | |
| ) | |
| print("LLM-based extraction JSON:", result.extracted_content) | |
| if __name__ == "__main__": | |
| asyncio.run(main()) | |
| ``` | |
| **What’s happening?** | |
| - We define a Pydantic schema (`PricingInfo`) describing the fields we want. | |
| - The LLM extraction strategy uses that schema and your instructions to transform raw text into structured JSON. | |
| - Depending on the **provider** and **api_token**, you can use local models or a remote API. | |
| --- | |
| ## 8. Next Steps | |
| Congratulations! You have: | |
| 1. Installed Crawl4AI (via pip, with Docker as an option). | |
| 2. Performed a simple crawl and printed Markdown. | |
| 3. Seen how adding a **markdown generator** + **content filter** can produce “fit” Markdown. | |
| 4. Experimented with **CSS-based** extraction for repetitive data. | |
| 5. Learned the basics of **LLM-based** extraction (open-source and closed-source). | |
| If you are ready for more, check out: | |
| - **Installation**: Learn more on how to install Crawl4AI and set up Playwright. | |
| - **Focus on Configuration**: Learn to customize browser settings, caching modes, advanced timeouts, etc. | |
| - **Markdown Generation Basics**: Dive deeper into content filtering and “fit markdown” usage. | |
| - **Dynamic Pages & Hooks**: Tackle sites with “Load More” buttons, login forms, or JavaScript complexities. | |
| - **Deployment**: Run Crawl4AI in Docker containers and scale across multiple nodes. | |
| - **Explanations & How-To Guides**: Explore browser contexts, identity-based crawling, hooking, performance, and more. | |
| Crawl4AI is a powerful tool for extracting data and generating Markdown from virtually any website. Enjoy exploring, and we hope you build amazing AI-powered applications with it! | |