Python is the most popular language for web scraping - and for good reason. Its readable syntax, massive library ecosystem, and active community make it the go-to choice for extracting data from websites. Whether you're building a price tracker, collecting research data, or feeding web content into an AI pipeline, Python has a library that fits.
This guide covers everything you need to build a Python web scraper in 2026: the best libraries (including newer alternatives to the classics), working code examples you can run immediately, and strategies for handling the challenges you'll inevitably hit - CAPTCHAs, JavaScript rendering, IP blocks, and more.
If you'd rather skip the coding entirely, we also cover the no-code alternative that lets you extract data from any website in 2 minutes.
What is web scraping?
Web scraping is the automated process of extracting data from websites. Instead of manually copying information, a scraper makes HTTP requests to web pages, parses the HTML content, and pulls out the specific data you need - product prices, job listings, contact information, search results, or any other structured data.
The extracted data is typically saved to a spreadsheet, database, or API for analysis, monitoring, or integration into business workflows.
Prerequisites: what you need before you start
Before writing your first scraper, you'll need three things:
- Python 3.8+ - Download from python.org. Most systems come with Python pre-installed; run
python3 --versionto check. - A code editor - VS Code, PyCharm, or even Jupyter Notebook. Any editor with Python support works.
- pip - Python's package manager (included with Python 3.4+). You'll use it to install scraping libraries.
Setting up your Python environment
Always use a virtual environment to keep your scraping dependencies isolated from other projects:
# Create a virtual environment
python3 -m venv scraping-env
# Activate it
source scraping-env/bin/activate # macOS/Linux
scraping-env\Scripts\activate # Windows
# Install the core libraries
pip install requests beautifulsoup4 lxml
Virtual environments prevent dependency conflicts and make your project reproducible. When you share your project or deploy it to a server, others can recreate your exact setup with pip install -r requirements.txt.
Choosing a Python web scraping library
Python offers several libraries for web scraping, each designed for different scenarios. Here's how the major ones compare:
| Library | Best For | JavaScript Support | Learning Curve | Speed |
|---|---|---|---|---|
| Requests + BeautifulSoup | Static HTML pages | ❌ No | Easy | Fast |
| Selenium | Dynamic/JS-heavy pages, form interactions | ✅ Full browser | Medium | Slow |
| Playwright | Modern JS apps, headless automation | ✅ Full browser | Medium | Medium |
| Scrapy | Large-scale crawling, spider frameworks | ❌ No (needs plugins) | Steep | Very fast |
| httpx + selectolax | High-performance async scraping | ❌ No | Medium | Very fast |
Quick recommendation: Start with Requests + BeautifulSoup for static pages. If the page requires JavaScript, use Playwright (the modern replacement for Selenium). For large-scale crawling, use Scrapy.
Making HTTP requests with Requests
The requests library handles fetching web pages. It's the foundation of most Python scrapers.
Basic GET request
import requests
response = requests.get('https://example.com')
# Always check the status code
if response.status_code == 200:
html = response.text
print(f"Got {len(html)} characters")
else:
print(f"Failed with status: {response.status_code}")
Setting headers to avoid blocks
Many websites block requests that don't include a realistic User-Agent header. Always set one:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}
response = requests.get('https://example.com', headers=headers)
Handling timeouts and retries
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
retries = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503])
session.mount('https://', HTTPAdapter(max_retries=retries))
response = session.get('https://example.com', timeout=10)
Parsing HTML with BeautifulSoup
Once you have the HTML, BeautifulSoup makes it easy to extract specific data points:
Basic extraction
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml') # 'lxml' is faster than 'html.parser'
# Get the page title
title = soup.title.string
# Find a specific element by class
price = soup.find('span', class_='price').text
# Find all links on the page
links = soup.find_all('a', href=True)
for link in links:
print(link['href'], link.text)
Extracting a table into structured data
import csv
table = soup.find('table', class_='data-table')
rows = table.find_all('tr')
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
for row in rows:
cells = row.find_all(['td', 'th'])
writer.writerow([cell.text.strip() for cell in cells])
CSS selectors (often cleaner than find/find_all)
# Select product cards
products = soup.select('div.product-card')
for product in products:
name = product.select_one('h3.product-title').text.strip()
price = product.select_one('span.price').text.strip()
print(f"{name}: {price}")
Scraping JavaScript-rendered pages with Playwright
Many modern websites load content dynamically with JavaScript. Requests + BeautifulSoup only see the initial HTML - they can't execute JavaScript. For these sites, you need a browser automation tool.
Playwright is the modern successor to Selenium. It's faster, more reliable, and has better async support. Here's how to use it:
# Install
pip install playwright
playwright install chromium
Basic Playwright scraping
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto('https://example.com')
# Wait for a specific element to load
page.wait_for_selector('.product-list')
# Extract data
products = page.query_selector_all('.product-card')
for product in products:
name = product.query_selector('h3').inner_text()
price = product.query_selector('.price').inner_text()
print(f"{name}: {price}")
browser.close()
Handling infinite scroll
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto('https://example.com/listings')
# Scroll to load more content
previous_height = 0
while True:
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000) # Wait for new content
current_height = page.evaluate("document.body.scrollHeight")
if current_height == previous_height:
break # No more content to load
previous_height = current_height
# Now extract all loaded content
items = page.query_selector_all('.list-item')
print(f"Found {len(items)} items after scrolling")
browser.close()
When to use Playwright vs. Selenium
Selenium is still widely used, but Playwright is the better choice for new projects. It's faster, handles waits more reliably, and supports Chromium, Firefox, and WebKit out of the box. Selenium's primary advantage is its longer history and broader community documentation - if you're following an older tutorial, it likely uses Selenium.
Large-scale scraping with Scrapy
For crawling thousands of pages, Scrapy is the most efficient Python framework. It handles concurrency, request queuing, and data pipelines out of the box.
# Install
pip install scrapy
# Create a new project
scrapy startproject myproject
A basic Scrapy spider
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products']
def parse(self, response):
for product in response.css('div.product-card'):
yield {
'name': product.css('h3::text').get(),
'price': product.css('span.price::text').get(),
'url': product.css('a::attr(href)').get(),
}
# Follow pagination links
next_page = response.css('a.next-page::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Run it with: scrapy crawl products -o products.json
Scrapy excels at scale but has a steeper learning curve. For one-off scraping tasks, Requests + BeautifulSoup is simpler and faster to set up.
High-performance async scraping with httpx
For speed-critical scraping of static pages, httpx with asyncio lets you make many concurrent requests:
import httpx
import asyncio
from bs4 import BeautifulSoup
async def scrape_url(client, url):
response = await client.get(url)
soup = BeautifulSoup(response.text, 'lxml')
title = soup.title.string if soup.title else 'No title'
return {'url': url, 'title': title}
async def main():
urls = [
'https://example.com/page/1',
'https://example.com/page/2',
'https://example.com/page/3',
# ... hundreds more
]
async with httpx.AsyncClient(timeout=15) as client:
tasks = [scrape_url(client, url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
for result in results:
if isinstance(result, dict):
print(result)
asyncio.run(main())
This approach can scrape hundreds of pages in seconds. Pair it with selectolax instead of BeautifulSoup for even faster HTML parsing.
Common web scraping challenges (and how to handle them)
Anti-bot detection and IP blocking
Websites use services like Cloudflare, DataDome, and PerimeterX to detect and block scrapers. Strategies to handle this:
- Rotate User-Agent headers - Use a list of realistic browser user agents and rotate them per request.
- Add delays between requests - Use
time.sleep()or random delays to avoid triggering rate limits. - Use proxy rotation - Route requests through rotating residential proxies to avoid IP blocks.
- Set realistic headers - Include Accept, Accept-Language, and Referer headers that match a real browser.
CAPTCHAs
When you hit a CAPTCHA, your options are limited: use a CAPTCHA solving service (like 2Captcha), switch to a tool with built-in CAPTCHA handling, or reconsider whether scraping that site is worth the complexity.
Dynamic content and AJAX
If data loads after the initial page load (via JavaScript or AJAX calls), use Playwright or Selenium. Alternatively, inspect the browser's Network tab - often the data comes from an API endpoint you can call directly, which is faster and more reliable than rendering the full page.
Pagination and infinite scroll
For paginated sites, look for "next page" links or page number patterns in the URL. For infinite scroll, use Playwright's scroll + wait pattern (shown in the Playwright section above).
Storing scraped data
Choose your storage format based on your needs:
- CSV - Best for simple, flat data. Easy to open in Excel or Google Sheets.
- JSON - Best for nested or hierarchical data. Standard format for APIs.
- SQLite - Best for larger datasets that need querying. No server required.
- Google Sheets - Best for collaborative access. Use the
gspreadlibrary to write directly. (See our guide: How to scrape data from a website into Google Sheets) - Excel - Best for business users who need formatted spreadsheets. Use
openpyxl. (See: How to scrape data from a website into Excel)
Best practices and ethics
- Check robots.txt - Always review
example.com/robots.txtbefore scraping. It specifies which pages are off-limits. - Respect rate limits - Don't hammer servers with rapid requests. Add delays between requests (1–3 seconds is a good baseline).
- Read terms of service - Some websites explicitly prohibit scraping. Respect these terms.
- Don't scrape personal data - Collecting personal information without consent can violate GDPR and CCPA.
- Cache aggressively - Store pages locally during development so you don't hit the same URLs repeatedly.
- Scrape during off-peak hours - Minimize your impact on the target server.
Automating your Python scraper
Once your scraper works, you'll want it to run automatically:
- Cron jobs (macOS/Linux) - Schedule your script with
crontab -e. Example:0 12 * * * /usr/bin/python3 /path/to/scraper.pyruns daily at noon. - Task Scheduler (Windows) - Point it to your Python script and set the frequency.
- Cloud functions - AWS Lambda, Google Cloud Functions, or similar services let you run scrapers without managing a server.
- GitHub Actions - Free for public repos. Schedule a workflow that runs your scraper on a cron schedule.
Whichever method you choose, add error handling and notifications (email, Slack) so you know immediately when something breaks.
The easier alternative: web scraping without Python
Building and maintaining a Python web scraper takes real engineering time - setting up the environment, handling anti-bot detection, managing proxies, debugging broken selectors, and keeping everything running. If you're a developer who enjoys this, Python is a great choice.
But if you just need the data, Browse AI lets you extract data from any website without writing a single line of code. Here's how it compares:
| Python (DIY) | Browse AI | |
|---|---|---|
| Setup time | Hours to days | 2 minutes |
| Coding required | Yes - Python expertise needed | No - point-and-click interface |
| JavaScript rendering | Requires Playwright/Selenium | Built in |
| Anti-bot handling | DIY proxies + rotation | Built in (residential proxies included) |
| Maintenance | Ongoing - fix selectors when sites change | AI auto-adapts to changes |
| Scheduling | Cron jobs / cloud functions | Built-in scheduler |
| Integrations | Build your own | 7,000+ (Zapier, Google Sheets, APIs) |
| Cost | Free (your time + infrastructure) | Free tier available; paid from $19/mo |
Browse AI also offers 230+ prebuilt robots for popular websites - Amazon, LinkedIn, Google, Indeed, Glassdoor, and more. Instead of building a scraper from scratch, you can use a prebuilt robot and start extracting data immediately.
→ Try Browse AI free - no coding required
Frequently asked questions
Is web scraping with Python legal?
Scraping publicly available data is generally legal. The 2022 US Ninth Circuit ruling in hiQ Labs v. LinkedIn affirmed this. However, always respect a website's terms of service, avoid scraping personal data without consent, and follow robots.txt guidelines.
What is the best Python library for web scraping?
For static pages, Requests + BeautifulSoup is the best starting point. For JavaScript-heavy sites, use Playwright. For large-scale crawling (thousands of pages), use Scrapy. For high-performance async scraping, use httpx.
How do I scrape a website without getting blocked?
Rotate User-Agent headers, add random delays between requests (1–5 seconds), use rotating proxies, set realistic HTTP headers, and respect robots.txt. For the most reliable approach without managing any of this yourself, use an AI-powered tool like Browse AI that handles anti-bot evasion automatically.
Can Python scrape JavaScript-rendered websites?
Yes, but not with Requests + BeautifulSoup alone. You need a browser automation library - Playwright or Selenium - that executes JavaScript and renders the page before extracting content.
How long does it take to learn web scraping with Python?
If you know basic Python, you can build a simple scraper in an afternoon. Handling dynamic content, anti-bot detection, and building robust production scrapers takes weeks to months of practice. Tools like Browse AI eliminate this learning curve entirely for non-developers.
What's the difference between web scraping and web crawling?
Web scraping extracts specific data from web pages (prices, names, emails). Web crawling navigates through websites by following links, discovering pages. Crawling is about finding pages; scraping is about extracting data from them. A typical project uses both - crawl to find pages, then scrape each one. (Read more: Web scraping vs. web crawling: what's the difference?)



