Spaces:
Sleeping
Sleeping
| # AsyncWebCrawler | |
| The `AsyncWebCrawler` class is the main interface for web crawling operations. It provides asynchronous web crawling capabilities with extensive configuration options. | |
| ## Constructor | |
| ```python | |
| AsyncWebCrawler( | |
| # Browser Settings | |
| browser_type: str = "chromium", # Options: "chromium", "firefox", "webkit" | |
| headless: bool = True, # Run browser in headless mode | |
| verbose: bool = False, # Enable verbose logging | |
| # Cache Settings | |
| always_by_pass_cache: bool = False, # Always bypass cache | |
| base_directory: str = str(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())), # Base directory for cache | |
| # Network Settings | |
| proxy: str = None, # Simple proxy URL | |
| proxy_config: Dict = None, # Advanced proxy configuration | |
| # Browser Behavior | |
| sleep_on_close: bool = False, # Wait before closing browser | |
| # Custom Settings | |
| user_agent: str = None, # Custom user agent | |
| headers: Dict[str, str] = {}, # Custom HTTP headers | |
| js_code: Union[str, List[str]] = None, # Default JavaScript to execute | |
| ) | |
| ``` | |
| ### Parameters in Detail | |
| #### Browser Settings | |
| - **browser_type** (str, optional) | |
| - Default: `"chromium"` | |
| - Options: `"chromium"`, `"firefox"`, `"webkit"` | |
| - Controls which browser engine to use | |
| ```python | |
| # Example: Using Firefox | |
| crawler = AsyncWebCrawler(browser_type="firefox") | |
| ``` | |
| - **headless** (bool, optional) | |
| - Default: `True` | |
| - When `True`, browser runs without GUI | |
| - Set to `False` for debugging | |
| ```python | |
| # Visible browser for debugging | |
| crawler = AsyncWebCrawler(headless=False) | |
| ``` | |
| - **verbose** (bool, optional) | |
| - Default: `False` | |
| - Enables detailed logging | |
| ```python | |
| # Enable detailed logging | |
| crawler = AsyncWebCrawler(verbose=True) | |
| ``` | |
| #### Cache Settings | |
| - **always_by_pass_cache** (bool, optional) | |
| - Default: `False` | |
| - When `True`, always fetches fresh content | |
| ```python | |
| # Always fetch fresh content | |
| crawler = AsyncWebCrawler(always_by_pass_cache=True) | |
| ``` | |
| - **base_directory** (str, optional) | |
| - Default: User's home directory | |
| - Base path for cache storage | |
| ```python | |
| # Custom cache directory | |
| crawler = AsyncWebCrawler(base_directory="/path/to/cache") | |
| ``` | |
| #### Network Settings | |
| - **proxy** (str, optional) | |
| - Simple proxy URL | |
| ```python | |
| # Using simple proxy | |
| crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080") | |
| ``` | |
| - **proxy_config** (Dict, optional) | |
| - Advanced proxy configuration with authentication | |
| ```python | |
| # Advanced proxy with auth | |
| crawler = AsyncWebCrawler(proxy_config={ | |
| "server": "http://proxy.example.com:8080", | |
| "username": "user", | |
| "password": "pass" | |
| }) | |
| ``` | |
| #### Browser Behavior | |
| - **sleep_on_close** (bool, optional) | |
| - Default: `False` | |
| - Adds delay before closing browser | |
| ```python | |
| # Wait before closing | |
| crawler = AsyncWebCrawler(sleep_on_close=True) | |
| ``` | |
| #### Custom Settings | |
| - **user_agent** (str, optional) | |
| - Custom user agent string | |
| ```python | |
| # Custom user agent | |
| crawler = AsyncWebCrawler( | |
| user_agent="Mozilla/5.0 (Custom Agent) Chrome/90.0" | |
| ) | |
| ``` | |
| - **headers** (Dict[str, str], optional) | |
| - Custom HTTP headers | |
| ```python | |
| # Custom headers | |
| crawler = AsyncWebCrawler( | |
| headers={ | |
| "Accept-Language": "en-US", | |
| "Custom-Header": "Value" | |
| } | |
| ) | |
| ``` | |
| - **js_code** (Union[str, List[str]], optional) | |
| - Default JavaScript to execute on each page | |
| ```python | |
| # Default JavaScript | |
| crawler = AsyncWebCrawler( | |
| js_code=[ | |
| "window.scrollTo(0, document.body.scrollHeight);", | |
| "document.querySelector('.load-more').click();" | |
| ] | |
| ) | |
| ``` | |
| ## Methods | |
| ### arun() | |
| The primary method for crawling web pages. | |
| ```python | |
| async def arun( | |
| # Required | |
| url: str, # URL to crawl | |
| # Content Selection | |
| css_selector: str = None, # CSS selector for content | |
| word_count_threshold: int = 10, # Minimum words per block | |
| # Cache Control | |
| bypass_cache: bool = False, # Bypass cache for this request | |
| # Session Management | |
| session_id: str = None, # Session identifier | |
| # Screenshot Options | |
| screenshot: bool = False, # Take screenshot | |
| screenshot_wait_for: float = None, # Wait before screenshot | |
| # Content Processing | |
| process_iframes: bool = False, # Process iframe content | |
| remove_overlay_elements: bool = False, # Remove popups/modals | |
| # Anti-Bot Settings | |
| simulate_user: bool = False, # Simulate human behavior | |
| override_navigator: bool = False, # Override navigator properties | |
| magic: bool = False, # Enable all anti-detection | |
| # Content Filtering | |
| excluded_tags: List[str] = None, # HTML tags to exclude | |
| exclude_external_links: bool = False, # Remove external links | |
| exclude_social_media_links: bool = False, # Remove social media links | |
| # JavaScript Handling | |
| js_code: Union[str, List[str]] = None, # JavaScript to execute | |
| wait_for: str = None, # Wait condition | |
| # Page Loading | |
| page_timeout: int = 60000, # Page load timeout (ms) | |
| delay_before_return_html: float = None, # Wait before return | |
| # Extraction | |
| extraction_strategy: ExtractionStrategy = None # Extraction strategy | |
| ) -> CrawlResult: | |
| ``` | |
| ### Usage Examples | |
| #### Basic Crawling | |
| ```python | |
| async with AsyncWebCrawler() as crawler: | |
| result = await crawler.arun(url="https://example.com") | |
| ``` | |
| #### Advanced Crawling | |
| ```python | |
| async with AsyncWebCrawler( | |
| browser_type="firefox", | |
| verbose=True, | |
| headers={"Custom-Header": "Value"} | |
| ) as crawler: | |
| result = await crawler.arun( | |
| url="https://example.com", | |
| css_selector=".main-content", | |
| word_count_threshold=20, | |
| process_iframes=True, | |
| magic=True, | |
| wait_for="css:.dynamic-content", | |
| screenshot=True | |
| ) | |
| ``` | |
| #### Session Management | |
| ```python | |
| async with AsyncWebCrawler() as crawler: | |
| # First request | |
| result1 = await crawler.arun( | |
| url="https://example.com/login", | |
| session_id="my_session" | |
| ) | |
| # Subsequent request using same session | |
| result2 = await crawler.arun( | |
| url="https://example.com/protected", | |
| session_id="my_session" | |
| ) | |
| ``` | |
| ## Context Manager | |
| AsyncWebCrawler implements the async context manager protocol: | |
| ```python | |
| async def __aenter__(self) -> 'AsyncWebCrawler': | |
| # Initialize browser and resources | |
| return self | |
| async def __aexit__(self, *args): | |
| # Cleanup resources | |
| pass | |
| ``` | |
| Always use AsyncWebCrawler with async context manager: | |
| ```python | |
| async with AsyncWebCrawler() as crawler: | |
| # Your crawling code here | |
| pass | |
| ``` | |
| ## Best Practices | |
| 1. **Resource Management** | |
| ```python | |
| # Always use context manager | |
| async with AsyncWebCrawler() as crawler: | |
| # Crawler will be properly cleaned up | |
| pass | |
| ``` | |
| 2. **Error Handling** | |
| ```python | |
| try: | |
| async with AsyncWebCrawler() as crawler: | |
| result = await crawler.arun(url="https://example.com") | |
| if not result.success: | |
| print(f"Crawl failed: {result.error_message}") | |
| except Exception as e: | |
| print(f"Error: {str(e)}") | |
| ``` | |
| 3. **Performance Optimization** | |
| ```python | |
| # Enable caching for better performance | |
| crawler = AsyncWebCrawler( | |
| always_by_pass_cache=False, | |
| verbose=True | |
| ) | |
| ``` | |
| 4. **Anti-Detection** | |
| ```python | |
| # Maximum stealth | |
| crawler = AsyncWebCrawler( | |
| headless=True, | |
| user_agent="Mozilla/5.0...", | |
| headers={"Accept-Language": "en-US"} | |
| ) | |
| result = await crawler.arun( | |
| url="https://example.com", | |
| magic=True, | |
| simulate_user=True | |
| ) | |
| ``` | |
| ## Note on Browser Types | |
| Each browser type has its characteristics: | |
| - **chromium**: Best overall compatibility | |
| - **firefox**: Good for specific use cases | |
| - **webkit**: Lighter weight, good for basic crawling | |
| Choose based on your specific needs: | |
| ```python | |
| # High compatibility | |
| crawler = AsyncWebCrawler(browser_type="chromium") | |
| # Memory efficient | |
| crawler = AsyncWebCrawler(browser_type="webkit") | |
| ``` |