Crawl4AI Tutorial
- What is Crawl4AI
- Why Crawl4AI?
- Basic Crawl4AI Example - Single Page Crawl
- Ethics of Web Scraping
- Crawling Multiple Pages
- FAST Parallel Page Crawling
What is Crawl4AI
Crawl4AI is an open-source web crawling framework specifically designed to scrape websites and format the output in the best way for LLMs to understand.
In this tutorial, we will use Crawl4AI
to scrape a website for LLM.
Why Crawl4AI?
- Built for LLMs: Creates smart, concise Markdown optimized for RAG and fine-tuning applications.
- Lightning Fast: Delivers results 6x faster with real-time, cost-efficient performance.
- Flexible Browser Control: Offers session management, proxies, and custom hooks for seamless data access.
- Heuristic Intelligence: Uses advanced algorithms for efficient extraction, reducing reliance on costly models.
- Open Source & Deployable: Fully open-source with no API keys—ready for Docker and cloud integration.
- Thriving Community: Actively maintained by a vibrant community and the #1 trending GitHub repository.
When you visit a website and you extract the raw HTML from it (view the code source of the page), il looks like a mess and as a human looking it’s hard to extract useful information from it.
One of the most things Crawl4AI
does is takes the HTML source code and it turns it into markdown format (human readable format).
Basic Crawl4AI Example - Single Page Crawl
- Let’s install
Crawl4AI
:
pip install -U crawl4ai
crawl4ai-setup
- In this example, we will scrape the homepage of
Pydantic AI
:
import asyncio
from crawl4ai import *
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://ai.pydantic.dev/",
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
- Execute the example and you will see how fast it is in scraping data.
Ethics of Web Scraping
We have to know that there is a lot of ethics behind web scraping. We can visit the robots.txt
to show the web site rules for web scraping, so make sure to follow their rules! (ex: www.youtube.com/robots.txt, www.github.com/robots.txt).
Crawling Multiple Pages
We can scrape multiple pages using the sitemap
web site. Usually, every web site has a sitemap that defines it’s internal links. For example, we can use https://ai.pydantic.dev/sitemap.xml to scrape all web pages.
In this example, we are going to in code pull the sitemap, we are going to extract every single URL and feed all of those into Crawl4AI
import asyncio
from typing import List
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
import requests
from xml.etree import ElementTree
async def crawl_sequential(urls: List[str]):
print("\n=== Sequential Crawling with Session Reuse ===")
browser_config = BrowserConfig(
headless=True,
# For better performance in Docker or low-memory environments:
extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
)
crawl_config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator()
)
# Create the crawler (opens the browser)
crawler = AsyncWebCrawler(config=browser_config)
await crawler.start()
try:
session_id = "session1" # Reuse the same session across all URLs
for url in urls:
result = await crawler.arun(
url=url,
config=crawl_config,
session_id=session_id
)
if result.success:
print(f"Successfully crawled: {url}")
# E.g. check markdown length
print(f"Markdown length: {len(result.markdown.raw_markdown)}")
else:
print(f"Failed: {url} - Error: {result.error_message}")
finally:
# After all URLs are done, close the crawler (and the browser)
await crawler.close()
def get_pydantic_ai_docs_urls():
"""
Fetches all URLs from the Pydantic AI documentation.
Uses the sitemap (https://ai.pydantic.dev/sitemap.xml) to get these URLs.
Returns:
List[str]: List of URLs
"""
sitemap_url = "https://ai.pydantic.dev/sitemap.xml"
try:
response = requests.get(sitemap_url)
response.raise_for_status()
# Parse the XML
root = ElementTree.fromstring(response.content)
# Extract all URLs from the sitemap
# The namespace is usually defined in the root element
namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
urls = [loc.text for loc in root.findall('.//ns:loc', namespace)]
return urls
except Exception as e:
print(f"Error fetching sitemap: {e}")
return []
async def main():
urls = get_pydantic_ai_docs_urls()
if urls:
print(f"Found {len(urls)} URLs to crawl")
await crawl_sequential(urls)
else:
print("No URLs found to crawl")
if __name__ == "__main__":
asyncio.run(main())
FAST Parallel Page Crawling
In the last example, we have processed each one of the sitemap URLs sequentially, there is no parallel processing. We can make the processing more faster by using the Crawl4AI
parallel processing.
For this example make sure to install the
psutil
package.
import os
import sys
import psutil
import asyncio
import requests
from xml.etree import ElementTree
__location__ = os.path.dirname(os.path.abspath(__file__))
__output__ = os.path.join(__location__, "output")
# Append parent directory to system path
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)
from typing import List
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def crawl_parallel(urls: List[str], max_concurrent: int = 3):
print("\n=== Parallel Crawling with Browser Reuse + Memory Check ===")
# We'll keep track of peak memory usage across all tasks
peak_memory = 0
process = psutil.Process(os.getpid())
def log_memory(prefix: str = ""):
nonlocal peak_memory
current_mem = process.memory_info().rss # in bytes
if current_mem > peak_memory:
peak_memory = current_mem
print(f"{prefix} Current Memory: {current_mem // (1024 * 1024)} MB, Peak: {peak_memory // (1024 * 1024)} MB")
# Minimal browser config
browser_config = BrowserConfig(
headless=True,
verbose=False, # corrected from 'verbos=False'
extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
)
crawl_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
# Create the crawler instance
crawler = AsyncWebCrawler(config=browser_config)
await crawler.start()
try:
# We'll chunk the URLs in batches of 'max_concurrent'
success_count = 0
fail_count = 0
for i in range(0, len(urls), max_concurrent):
batch = urls[i : i + max_concurrent]
tasks = []
for j, url in enumerate(batch):
# Unique session_id per concurrent sub-task
session_id = f"parallel_session_{i + j}"
task = crawler.arun(url=url, config=crawl_config, session_id=session_id)
tasks.append(task)
# Check memory usage prior to launching tasks
log_memory(prefix=f"Before batch {i//max_concurrent + 1}: ")
# Gather results
results = await asyncio.gather(*tasks, return_exceptions=True)
# Check memory usage after tasks complete
log_memory(prefix=f"After batch {i//max_concurrent + 1}: ")
# Evaluate results
for url, result in zip(batch, results):
if isinstance(result, Exception):
print(f"Error crawling {url}: {result}")
fail_count += 1
elif result.success:
success_count += 1
else:
fail_count += 1
print(f"\nSummary:")
print(f" - Successfully crawled: {success_count}")
print(f" - Failed: {fail_count}")
finally:
print("\nClosing crawler...")
await crawler.close()
# Final memory log
log_memory(prefix="Final: ")
print(f"\nPeak memory usage (MB): {peak_memory // (1024 * 1024)}")
def get_pydantic_ai_docs_urls():
"""
Fetches all URLs from the Pydantic AI documentation.
Uses the sitemap (https://ai.pydantic.dev/sitemap.xml) to get these URLs.
Returns:
List[str]: List of URLs
"""
sitemap_url = "https://ai.pydantic.dev/sitemap.xml"
try:
response = requests.get(sitemap_url)
response.raise_for_status()
# Parse the XML
root = ElementTree.fromstring(response.content)
# Extract all URLs from the sitemap
# The namespace is usually defined in the root element
namespace = {'ns': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
urls = [loc.text for loc in root.findall('.//ns:loc', namespace)]
return urls
except Exception as e:
print(f"Error fetching sitemap: {e}")
return []
async def main():
urls = get_pydantic_ai_docs_urls()
if urls:
print(f"Found {len(urls)} URLs to crawl")
await crawl_parallel(urls, max_concurrent=10)
else:
print("No URLs found to crawl")
if __name__ == "__main__":
asyncio.run(main())
By Wahid Hamdi