How to Build a Web Crawler That Won’t Get Blocked: A Practical Guide
The First Time I Got Blocked
I still remember the moment. I’d written a beautiful little Python script to pull restaurant data from a popular Chinese review site. Ten requests in — BAM. 403 Forbidden. Then 429. Then my IP was in the digital doghouse.
I learned the hard way that websites have gotten very good at spotting crawlers. These days, it’s not just about writing code that works — it’s about writing code that *doesn’t get caught*.
Let me walk you through the strategies I’ve picked up after months of trial and error.
Why Websites Block Crawlers
Before we talk about how to avoid getting blocked, it helps to understand why websites block crawlers in the first place:
The good news? Most websites don’t mind reasonable, respectful crawling. The key word is *respectful*.
Strategy 1: Rotate Your User-Agent
This is the absolute bare minimum. Your User-Agent string tells the server what browser and OS you’re using. If you send the Python `requests` library’s default UA on every request, you might as well wear a t-shirt that says “I’M A BOT.”
import random
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
"Mozilla/5.0 (iPhone; CPU iPhone OS 17_4 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Mobile/15E148 Safari/604.1",
"Mozilla/5.0 (Linux; Android 14; Pixel 8 Pro) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.6367.83 Mobile Safari/537.36",
]
def get_headers():
return {
"User-Agent": random.choice(USER_AGENTS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
}
**Pro tip:** Maintain a pool of at least 20 different UAs covering Windows, Mac, Linux, iOS, and Android. Rotate randomly on each request.
Strategy 2: Respect Robots.txt
The `robots.txt` file at the root of any website tells crawlers which paths are off-limits. While it’s not legally binding in most jurisdictions, ignoring it is bad form and can get your IP permanently banned.
import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()
if rp.can_fetch("*", "/some/path"):
# Safe to crawl
pass
else:
# Skip this path
pass
Strategy 3: Randomize Your Request Timing
This is where most beginners get caught. They write loops like this:
# Bad: predictable pattern
time.sleep(5) # Exactly 5 seconds every time
Human browsing isn’t metronomically regular. We pause, scroll, get distracted, take a sip of coffee. Your crawler should mimic this:
import time
import random
# Good: random delay between requests
def human_delay():
delay = random.uniform(2.0, 8.0)
print(f"Waiting {delay:.1f} seconds...")
time.sleep(delay)
Even better: vary the delay based on page complexity. A page with lots of images might take longer to “read” than a text-heavy page.
Strategy 4: Use a Proxy Pool
When you’re crawling at scale, even the best timing won’t save a single IP from rate limiting. You need multiple IP addresses.
Free Options
import requests
# Free proxy list (unreliable but zero cost)
proxies = {
"http": "http://123.45.67.89:8080",
"https": "http://123.45.67.89:8080",
}
try:
r = requests.get(url, proxies=proxies, timeout=10)
except:
# Try another proxy
pass
Paid Options (Recommended)
For serious crawling, invest in a paid proxy service. They cost anywhere from $10 to $100 per month and offer:
| Service | Price | IP Pool | Best For |
|---|---|---|---|
| BrightData | From $10/mo | 72M+ | Enterprise scale |
| Smartproxy | From $50/mo | 40M+ | Social media scraping |
| Oxylabs | From $100/mo | 100M+ | E-commerce pricing |
| Proxy-Seller | From $20/mo | 200K+ | Budget-friendly |
**My advice:** Start with free proxies for testing, move to a paid residential proxy service when you go into production.
Strategy 5: Manage Cookies Like a Browser
Websites issue cookies during your first visit and expect to see them on subsequent requests. Using a `requests.Session()` handles this automatically:
session = requests.Session()
# First request — get initial cookies
session.get("https://example.com", headers=get_headers())
# Subsequent requests — cookies are sent automatically
response = session.get("https://example.com/data", headers=get_headers())
Strategy 6: Add Realistic Referrers
Browsers send a `Referer` header that tells the server where you came from. A crawler that goes straight to a deep page without a referrer looks suspicious:
# Suspicious: no referrer
headers = get_headers()
# Better: pretend you came from Google
headers["Referer"] = "https://www.google.com/search?q=china+travel+guide"
# Even better: crawl the homepage first, then use it as a referrer
homepage = session.get("https://example.com", headers=get_headers())
headers["Referer"] = "https://example.com/"
session.get("https://example.com/deep-page", headers=headers)
Strategy 7: Crawl Like a Human (Schedule Matters)
Here’s a tip most tutorials don’t mention: **when** you crawl matters.
# Don't crawl at 3 AM — no human browses at 3 AM
# Best times: 9 AM - 11 AM, 2 PM - 5 PM, 8 PM - 10 PM
# Worst times: Midnight - 6 AM, lunch hour (12-1 PM)
from datetime import datetime
crawl_hours = range(8, 23) # 8 AM to 11 PM
current_hour = datetime.now().hour
if current_hour not in crawl_hours:
print("Sleeping until morning...")
time.sleep(3600 * (crawl_hours[0] - current_hour))
Building Your Crawler: A Complete Template
Here’s a starter template that combines everything we’ve discussed:
import requests
import time
import random
from datetime import datetime
class RespectfulCrawler:
def __init__(self, base_url, proxy_list=None):
self.base_url = base_url
self.session = requests.Session()
self.proxies = proxy_list or []
self.user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
# Add 15-20 more
]
self.request_count = 0
self.max_per_minute = 15
self.start_time = time.time()
def _headers(self):
headers = {
"User-Agent": random.choice(self.user_agents),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate",
"Referer": random.choice([
"https://www.google.com/",
self.base_url,
"https://www.baidu.com/s?wd=china",
]),
"DNT": "1",
"Connection": "keep-alive",
}
return headers
def _delay(self):
# Adaptive delay — increases if we've made many requests
elapsed = time.time() - self.start_time
rate = self.request_count / (elapsed / 60)
if rate > self.max_per_minute:
delay = random.uniform(5, 10)
else:
delay = random.uniform(2, 5)
time.sleep(delay)
def fetch(self, url):
self._delay()
proxy = None
if self.proxies:
proxy = {"http": random.choice(self.proxies), "https": random.choice(self.proxies)}
try:
r = self.session.get(url, headers=self._headers(), proxies=proxy, timeout=15)
self.request_count += 1
if r.status_code == 200:
return r.text
elif r.status_code in [403, 429]:
print(f"⚠️ Blocked! Status: {r.status_code}. Cooling down...")
time.sleep(60) # Cool down for a minute
return None
else:
print(f"⚠️ Unexpected status: {r.status_code}")
return None
except requests.exceptions.RequestException as e:
print(f"❌ Request failed: {e}")
return None
def is_polite_time(self):
hour = datetime.now().hour
# Only crawl during reasonable hours
return 8 <= hour <= 22
How to Detect If You’ve Been Blocked
Watch for these warning signs:
| Signal | What It Means | Action |
|---|---|---|
| HTTP 403 / 429 | Temporary block | Stop, use different IP, slow down |
| CAPTCHA challenge | Bot detected | Switch to residential proxy |
| Empty data returned | You're in a honeypot | Check your session cookies |
| Delayed responses > 5s | Rate limited | Reduce frequency |
| Consistent 200 but no real data | Blacklisted silently | Change IP and UA |
Storing Your Data: A Simple Database Schema
If you’re crawling at any scale, you’ll want a database. Here’s a minimal schema that works for most use cases:
CREATE TABLE crawled_data (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
source_name VARCHAR(100), -- e.g., "Trip.com Hotels"
title VARCHAR(255), -- Title of the page/item
content TEXT, -- Extracted content
url VARCHAR(500), -- Original URL
price DECIMAL(10,2), -- Price if applicable
rating DECIMAL(2,1), -- Rating if applicable
tags VARCHAR(255), -- Comma-separated tags
crawled_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_source (source_name),
INDEX idx_crawled_at (crawled_at)
);
Final Thoughts
Building a crawler that doesn’t get blocked isn’t about being sneaky — it’s about being respectful. Think of it this way: you’re a guest on someone’s website. Act like one.
The best crawlers are the ones the website never notices.
Got questions about specific crawling challenges? Drop them in the comments below.
—
*This guide is based on real-world experience crawling Chinese travel and food sites. Your mileage may vary depending on the website’s specific anti-bot measures.*