Learn

Web scraping with Python: the complete guide (2026)

Learn how to build a Python web scraper using BeautifulSoup, Selenium, Playwright, and Scrapy. Step-by-step tutorial with code examples and a no-code alternative.

Mel Shires
December 18, 2023
· 5min read
Featured image for blog post

Python is the most popular language for web scraping - and for good reason. Its readable syntax, massive library ecosystem, and active community make it the go-to choice for extracting data from websites. Whether you're building a price tracker, collecting research data, or feeding web content into an AI pipeline, Python has a library that fits.

This guide covers everything you need to build a Python web scraper in 2026: the best libraries (including newer alternatives to the classics), working code examples you can run immediately, and strategies for handling the challenges you'll inevitably hit - CAPTCHAs, JavaScript rendering, IP blocks, and more.

If you'd rather skip the coding entirely, we also cover the no-code alternative that lets you extract data from any website in 2 minutes.

What is web scraping?

Web scraping is the automated process of extracting data from websites. Instead of manually copying information, a scraper makes HTTP requests to web pages, parses the HTML content, and pulls out the specific data you need - product prices, job listings, contact information, search results, or any other structured data.

The extracted data is typically saved to a spreadsheet, database, or API for analysis, monitoring, or integration into business workflows.

Prerequisites: what you need before you start

Before writing your first scraper, you'll need three things:

  • Python 3.8+ - Download from python.org. Most systems come with Python pre-installed; run python3 --version to check.
  • A code editor - VS Code, PyCharm, or even Jupyter Notebook. Any editor with Python support works.
  • pip - Python's package manager (included with Python 3.4+). You'll use it to install scraping libraries.

Setting up your Python environment

Always use a virtual environment to keep your scraping dependencies isolated from other projects:

# Create a virtual environment
python3 -m venv scraping-env

# Activate it
source scraping-env/bin/activate    # macOS/Linux
scraping-env\Scripts\activate       # Windows

# Install the core libraries
pip install requests beautifulsoup4 lxml

Virtual environments prevent dependency conflicts and make your project reproducible. When you share your project or deploy it to a server, others can recreate your exact setup with pip install -r requirements.txt.

Choosing a Python web scraping library

Python offers several libraries for web scraping, each designed for different scenarios. Here's how the major ones compare:

LibraryBest ForJavaScript SupportLearning CurveSpeed
Requests + BeautifulSoupStatic HTML pages❌ NoEasyFast
SeleniumDynamic/JS-heavy pages, form interactions✅ Full browserMediumSlow
PlaywrightModern JS apps, headless automation✅ Full browserMediumMedium
ScrapyLarge-scale crawling, spider frameworks❌ No (needs plugins)SteepVery fast
httpx + selectolaxHigh-performance async scraping❌ NoMediumVery fast

Quick recommendation: Start with Requests + BeautifulSoup for static pages. If the page requires JavaScript, use Playwright (the modern replacement for Selenium). For large-scale crawling, use Scrapy.

Making HTTP requests with Requests

The requests library handles fetching web pages. It's the foundation of most Python scrapers.

Basic GET request

import requests

response = requests.get('https://example.com')

# Always check the status code
if response.status_code == 200:
    html = response.text
    print(f"Got {len(html)} characters")
else:
    print(f"Failed with status: {response.status_code}")

Setting headers to avoid blocks

Many websites block requests that don't include a realistic User-Agent header. Always set one:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

response = requests.get('https://example.com', headers=headers)

Handling timeouts and retries

from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retries = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503])
session.mount('https://', HTTPAdapter(max_retries=retries))

response = session.get('https://example.com', timeout=10)

Parsing HTML with BeautifulSoup

Once you have the HTML, BeautifulSoup makes it easy to extract specific data points:

Basic extraction

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')  # 'lxml' is faster than 'html.parser'

# Get the page title
title = soup.title.string

# Find a specific element by class
price = soup.find('span', class_='price').text

# Find all links on the page
links = soup.find_all('a', href=True)
for link in links:
    print(link['href'], link.text)

Extracting a table into structured data

import csv

table = soup.find('table', class_='data-table')
rows = table.find_all('tr')

with open('output.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    for row in rows:
        cells = row.find_all(['td', 'th'])
        writer.writerow([cell.text.strip() for cell in cells])

CSS selectors (often cleaner than find/find_all)

# Select product cards
products = soup.select('div.product-card')

for product in products:
    name = product.select_one('h3.product-title').text.strip()
    price = product.select_one('span.price').text.strip()
    print(f"{name}: {price}")

Scraping JavaScript-rendered pages with Playwright

Many modern websites load content dynamically with JavaScript. Requests + BeautifulSoup only see the initial HTML - they can't execute JavaScript. For these sites, you need a browser automation tool.

Playwright is the modern successor to Selenium. It's faster, more reliable, and has better async support. Here's how to use it:

# Install
pip install playwright
playwright install chromium

Basic Playwright scraping

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    page.goto('https://example.com')

    # Wait for a specific element to load
    page.wait_for_selector('.product-list')

    # Extract data
    products = page.query_selector_all('.product-card')
    for product in products:
        name = product.query_selector('h3').inner_text()
        price = product.query_selector('.price').inner_text()
        print(f"{name}: {price}")

    browser.close()

Handling infinite scroll

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto('https://example.com/listings')

    # Scroll to load more content
    previous_height = 0
    while True:
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(2000)  # Wait for new content

        current_height = page.evaluate("document.body.scrollHeight")
        if current_height == previous_height:
            break  # No more content to load
        previous_height = current_height

    # Now extract all loaded content
    items = page.query_selector_all('.list-item')
    print(f"Found {len(items)} items after scrolling")

    browser.close()

When to use Playwright vs. Selenium

Selenium is still widely used, but Playwright is the better choice for new projects. It's faster, handles waits more reliably, and supports Chromium, Firefox, and WebKit out of the box. Selenium's primary advantage is its longer history and broader community documentation - if you're following an older tutorial, it likely uses Selenium.

Large-scale scraping with Scrapy

For crawling thousands of pages, Scrapy is the most efficient Python framework. It handles concurrency, request queuing, and data pipelines out of the box.

# Install
pip install scrapy

# Create a new project
scrapy startproject myproject

A basic Scrapy spider

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('div.product-card'):
            yield {
                'name': product.css('h3::text').get(),
                'price': product.css('span.price::text').get(),
                'url': product.css('a::attr(href)').get(),
            }

        # Follow pagination links
        next_page = response.css('a.next-page::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Run it with: scrapy crawl products -o products.json

Scrapy excels at scale but has a steeper learning curve. For one-off scraping tasks, Requests + BeautifulSoup is simpler and faster to set up.

High-performance async scraping with httpx

For speed-critical scraping of static pages, httpx with asyncio lets you make many concurrent requests:

import httpx
import asyncio
from bs4 import BeautifulSoup

async def scrape_url(client, url):
    response = await client.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    title = soup.title.string if soup.title else 'No title'
    return {'url': url, 'title': title}

async def main():
    urls = [
        'https://example.com/page/1',
        'https://example.com/page/2',
        'https://example.com/page/3',
        # ... hundreds more
    ]

    async with httpx.AsyncClient(timeout=15) as client:
        tasks = [scrape_url(client, url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)

    for result in results:
        if isinstance(result, dict):
            print(result)

asyncio.run(main())

This approach can scrape hundreds of pages in seconds. Pair it with selectolax instead of BeautifulSoup for even faster HTML parsing.

Common web scraping challenges (and how to handle them)

Anti-bot detection and IP blocking

Websites use services like Cloudflare, DataDome, and PerimeterX to detect and block scrapers. Strategies to handle this:

  • Rotate User-Agent headers - Use a list of realistic browser user agents and rotate them per request.
  • Add delays between requests - Use time.sleep() or random delays to avoid triggering rate limits.
  • Use proxy rotation - Route requests through rotating residential proxies to avoid IP blocks.
  • Set realistic headers - Include Accept, Accept-Language, and Referer headers that match a real browser.

CAPTCHAs

When you hit a CAPTCHA, your options are limited: use a CAPTCHA solving service (like 2Captcha), switch to a tool with built-in CAPTCHA handling, or reconsider whether scraping that site is worth the complexity.

Dynamic content and AJAX

If data loads after the initial page load (via JavaScript or AJAX calls), use Playwright or Selenium. Alternatively, inspect the browser's Network tab - often the data comes from an API endpoint you can call directly, which is faster and more reliable than rendering the full page.

Pagination and infinite scroll

For paginated sites, look for "next page" links or page number patterns in the URL. For infinite scroll, use Playwright's scroll + wait pattern (shown in the Playwright section above).

Storing scraped data

Choose your storage format based on your needs:

  • CSV - Best for simple, flat data. Easy to open in Excel or Google Sheets.
  • JSON - Best for nested or hierarchical data. Standard format for APIs.
  • SQLite - Best for larger datasets that need querying. No server required.
  • Google Sheets - Best for collaborative access. Use the gspread library to write directly. (See our guide: How to scrape data from a website into Google Sheets)
  • Excel - Best for business users who need formatted spreadsheets. Use openpyxl. (See: How to scrape data from a website into Excel)

Best practices and ethics

  • Check robots.txt - Always review example.com/robots.txt before scraping. It specifies which pages are off-limits.
  • Respect rate limits - Don't hammer servers with rapid requests. Add delays between requests (1–3 seconds is a good baseline).
  • Read terms of service - Some websites explicitly prohibit scraping. Respect these terms.
  • Don't scrape personal data - Collecting personal information without consent can violate GDPR and CCPA.
  • Cache aggressively - Store pages locally during development so you don't hit the same URLs repeatedly.
  • Scrape during off-peak hours - Minimize your impact on the target server.

Automating your Python scraper

Once your scraper works, you'll want it to run automatically:

  • Cron jobs (macOS/Linux) - Schedule your script with crontab -e. Example: 0 12 * * * /usr/bin/python3 /path/to/scraper.py runs daily at noon.
  • Task Scheduler (Windows) - Point it to your Python script and set the frequency.
  • Cloud functions - AWS Lambda, Google Cloud Functions, or similar services let you run scrapers without managing a server.
  • GitHub Actions - Free for public repos. Schedule a workflow that runs your scraper on a cron schedule.

Whichever method you choose, add error handling and notifications (email, Slack) so you know immediately when something breaks.

The easier alternative: web scraping without Python

Building and maintaining a Python web scraper takes real engineering time - setting up the environment, handling anti-bot detection, managing proxies, debugging broken selectors, and keeping everything running. If you're a developer who enjoys this, Python is a great choice.

But if you just need the data, Browse AI lets you extract data from any website without writing a single line of code. Here's how it compares:

Python (DIY)Browse AI
Setup timeHours to days2 minutes
Coding requiredYes - Python expertise neededNo - point-and-click interface
JavaScript renderingRequires Playwright/SeleniumBuilt in
Anti-bot handlingDIY proxies + rotationBuilt in (residential proxies included)
MaintenanceOngoing - fix selectors when sites changeAI auto-adapts to changes
SchedulingCron jobs / cloud functionsBuilt-in scheduler
IntegrationsBuild your own7,000+ (Zapier, Google Sheets, APIs)
CostFree (your time + infrastructure)Free tier available; paid from $19/mo

Browse AI also offers 230+ prebuilt robots for popular websites - Amazon, LinkedIn, Google, Indeed, Glassdoor, and more. Instead of building a scraper from scratch, you can use a prebuilt robot and start extracting data immediately.

→ Try Browse AI free - no coding required

Frequently asked questions

Is web scraping with Python legal?

Scraping publicly available data is generally legal. The 2022 US Ninth Circuit ruling in hiQ Labs v. LinkedIn affirmed this. However, always respect a website's terms of service, avoid scraping personal data without consent, and follow robots.txt guidelines.

What is the best Python library for web scraping?

For static pages, Requests + BeautifulSoup is the best starting point. For JavaScript-heavy sites, use Playwright. For large-scale crawling (thousands of pages), use Scrapy. For high-performance async scraping, use httpx.

How do I scrape a website without getting blocked?

Rotate User-Agent headers, add random delays between requests (1–5 seconds), use rotating proxies, set realistic HTTP headers, and respect robots.txt. For the most reliable approach without managing any of this yourself, use an AI-powered tool like Browse AI that handles anti-bot evasion automatically.

Can Python scrape JavaScript-rendered websites?

Yes, but not with Requests + BeautifulSoup alone. You need a browser automation library - Playwright or Selenium - that executes JavaScript and renders the page before extracting content.

How long does it take to learn web scraping with Python?

If you know basic Python, you can build a simple scraper in an afternoon. Handling dynamic content, anti-bot detection, and building robust production scrapers takes weeks to months of practice. Tools like Browse AI eliminate this learning curve entirely for non-developers.

What's the difference between web scraping and web crawling?

Web scraping extracts specific data from web pages (prices, names, emails). Web crawling navigates through websites by following links, discovering pages. Crawling is about finding pages; scraping is about extracting data from them. A typical project uses both - crawl to find pages, then scrape each one. (Read more: Web scraping vs. web crawling: what's the difference?)

Start extracting web data in minutes

Extract, monitor, and scrape data from any website with Browse AI - the most powerful and reliable AI web scraper.

Try Browse AI for free
Table of contents