Spaces:
Sleeping
Sleeping
| # Download Handling in Crawl4AI | |
| This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files. | |
| ## Enabling Downloads | |
| To enable downloads, set the `accept_downloads` parameter in the `BrowserConfig` object and pass it to the crawler. | |
| ```python | |
| from crawl4ai.async_configs import BrowserConfig, AsyncWebCrawler | |
| async def main(): | |
| config = BrowserConfig(accept_downloads=True) # Enable downloads globally | |
| async with AsyncWebCrawler(config=config) as crawler: | |
| # ... your crawling logic ... | |
| asyncio.run(main()) | |
| ``` | |
| Or, enable it for a specific crawl by using `CrawlerRunConfig`: | |
| ```python | |
| from crawl4ai.async_configs import CrawlerRunConfig | |
| async def main(): | |
| async with AsyncWebCrawler() as crawler: | |
| config = CrawlerRunConfig(accept_downloads=True) | |
| result = await crawler.arun(url="https://example.com", config=config) | |
| # ... | |
| ``` | |
| ## Specifying Download Location | |
| Specify the download directory using the `downloads_path` attribute in the `BrowserConfig` object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the `.crawl4ai` folder in your home directory. | |
| ```python | |
| from crawl4ai.async_configs import BrowserConfig | |
| import os | |
| downloads_path = os.path.join(os.getcwd(), "my_downloads") # Custom download path | |
| os.makedirs(downloads_path, exist_ok=True) | |
| config = BrowserConfig(accept_downloads=True, downloads_path=downloads_path) | |
| async def main(): | |
| async with AsyncWebCrawler(config=config) as crawler: | |
| result = await crawler.arun(url="https://example.com") | |
| # ... | |
| ``` | |
| ## Triggering Downloads | |
| Downloads are typically triggered by user interactions on a web page, such as clicking a download button. Use `js_code` in `CrawlerRunConfig` to simulate these actions and `wait_for` to allow sufficient time for downloads to start. | |
| ```python | |
| from crawl4ai.async_configs import CrawlerRunConfig | |
| config = CrawlerRunConfig( | |
| js_code=""" | |
| const downloadLink = document.querySelector('a[href$=".exe"]'); | |
| if (downloadLink) { | |
| downloadLink.click(); | |
| } | |
| """, | |
| wait_for=5 # Wait 5 seconds for the download to start | |
| ) | |
| result = await crawler.arun(url="https://www.python.org/downloads/", config=config) | |
| ``` | |
| ## Accessing Downloaded Files | |
| The `downloaded_files` attribute of the `CrawlResult` object contains paths to downloaded files. | |
| ```python | |
| if result.downloaded_files: | |
| print("Downloaded files:") | |
| for file_path in result.downloaded_files: | |
| print(f"- {file_path}") | |
| file_size = os.path.getsize(file_path) | |
| print(f"- File size: {file_size} bytes") | |
| else: | |
| print("No files downloaded.") | |
| ``` | |
| ## Example: Downloading Multiple Files | |
| ```python | |
| from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig | |
| import os | |
| from pathlib import Path | |
| async def download_multiple_files(url: str, download_path: str): | |
| config = BrowserConfig(accept_downloads=True, downloads_path=download_path) | |
| async with AsyncWebCrawler(config=config) as crawler: | |
| run_config = CrawlerRunConfig( | |
| js_code=""" | |
| const downloadLinks = document.querySelectorAll('a[download]'); | |
| for (const link of downloadLinks) { | |
| link.click(); | |
| await new Promise(r => setTimeout(r, 2000)); // Delay between clicks | |
| } | |
| """, | |
| wait_for=10 # Wait for all downloads to start | |
| ) | |
| result = await crawler.arun(url=url, config=run_config) | |
| if result.downloaded_files: | |
| print("Downloaded files:") | |
| for file in result.downloaded_files: | |
| print(f"- {file}") | |
| else: | |
| print("No files downloaded.") | |
| # Usage | |
| download_path = os.path.join(Path.home(), ".crawl4ai", "downloads") | |
| os.makedirs(download_path, exist_ok=True) | |
| asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path)) | |
| ``` | |
| ## Important Considerations | |
| - **Browser Context:** Downloads are managed within the browser context. Ensure `js_code` correctly targets the download triggers on the webpage. | |
| - **Timing:** Use `wait_for` in `CrawlerRunConfig` to manage download timing. | |
| - **Error Handling:** Handle errors to manage failed downloads or incorrect paths gracefully. | |
| - **Security:** Scan downloaded files for potential security threats before use. | |
| This revised guide ensures consistency with the `Crawl4AI` codebase by using `BrowserConfig` and `CrawlerRunConfig` for all download-related configurations. Let me know if further adjustments are needed! |