Speed:

Ready

How to Build a Web Crawler That Won’t Get Blocked: A Practical Guide The First Time I Got Blocked I still remember the moment. I’d written a beautiful little Python script to pull restaurant data from a popular Chinese review site. Ten requests in — BAM. 403 Forbidden. Then 429. Then my IP was in the digital doghouse. I learned the hard way that websites have gotten very good at spotting crawlers. These days, it’s not just about writing code that works — it’s about writing code that *doesn’t get caught*. Let me walk you through the strategies I’ve picked up after months of trial and error. Why Websites Block Crawlers Before we talk about how to avoid getting blocked, it helps to understand why websites block crawlers in the first place: **Server load**: One crawler can generate as much traffic as 1,000 human visitors **Data theft**: Competitors scraping pricing data **Bandwidth costs**: Every request costs the website money **Bots vs humans**: Some content is meant for human consumption, not automated extraction The good news? Most websites don’t mind reasonable, respectful crawling. The key word is *respectful*. Strategy 1: Rotate Your User-Agent This is the absolute bare minimum. Your User-Agent string tells the server what browser and OS you’re using. If you send the Python `requests` library’s default UA on every request, you might as well wear a t-shirt that says “I’M A BOT.” import random USER_AGENTS = [ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0", "Mozilla/5.0 (iPhone; CPU iPhone OS 17_4 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Mobile/15E148 Safari/604.1", "Mozilla/5.0 (Linux; Android 14; Pixel 8 Pro) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.6367.83 Mobile Safari/537.36", ] def get_headers(): return { "User-Agent": random.choice(USER_AGENTS), "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate, br", "Connection": "keep-alive", } **Pro tip:** Maintain a pool of at least 20 different UAs covering Windows, Mac, Linux, iOS, and Android. Rotate randomly on each request. Strategy 2: Respect Robots.txt The `robots.txt` file at the root of any website tells crawlers which paths are off-limits. While it’s not legally binding in most jurisdictions, ignoring it is bad form and can get your IP permanently banned. import urllib.robotparser rp = urllib.robotparser.RobotFileParser() rp.set_url("https://example.com/robots.txt") rp.read() if rp.can_fetch("*", "/some/path"): # Safe to crawl pass else: # Skip this path pass Strategy 3: Randomize Your Request Timing This is where most beginners get caught. They write loops like this: # Bad: predictable pattern time.sleep(5) # Exactly 5 seconds every time Human browsing isn’t metronomically regular. We pause, scroll, get distracted, take a sip of coffee. Your crawler should mimic this: import time import random # Good: random delay between requests def human_delay(): delay = random.uniform(2.0, 8.0) print(f"Waiting {delay:.1f} seconds...") time.sleep(delay) Even better: vary the delay based on page complexity. A page with lots of images might take longer to “read” than a text-heavy page. Strategy 4: Use a Proxy Pool When you’re crawling at scale, even the best timing won’t save a single IP from rate limiting. You need multiple IP addresses. Free Options import requests # Free proxy list (unreliable but zero cost) proxies = { "http": "http://123.45.67.89:8080", "https": "http://123.45.67.89:8080", } try: r = requests.get(url, proxies=proxies, timeout=10) except: # Try another proxy pass Paid Options (Recommended) For serious crawling, invest in a paid proxy service. They cost anywhere from $10 to $100 per month and offer: Rotating residential IPs Static datacenter IPs Country-specific IPs 99.9% uptime Service Price IP Pool Best For BrightData From $10/mo 72M+ Enterprise scale Smartproxy From $50/mo 40M+ Social media scraping Oxylabs From $100/mo 100M+ E-commerce pricing Proxy-Seller From $20/mo 200K+ Budget-friendly **My advice:** Start with free proxies for testing, move to a paid residential proxy service when you go into production. Strategy 5: Manage Cookies Like a Browser Websites issue cookies during your first visit and expect to see them on subsequent requests. Using a `requests.Session()` handles this automatically: session = requests.Session() # First request — get initial cookies session.get("https://example.com", headers=get_headers()) # Subsequent requests — cookies are sent automatically response = session.get("https://example.com/data", headers=get_headers()) Strategy 6: Add Realistic Referrers Browsers send a `Referer` header that tells the server where you came from. A crawler that goes straight to a deep page without a referrer looks suspicious: # Suspicious: no referrer headers = get_headers() # Better: pretend you came from Google headers["Referer"] = "https://www.google.com/search?q=china+travel+guide" # Even better: crawl the homepage first, then use it as a referrer homepage = session.get("https://example.com", headers=get_headers()) headers["Referer"] = "https://example.com/" session.get("https://example.com/deep-page", headers=headers) Strategy 7: Crawl Like a Human (Schedule Matters) Here’s a tip most tutorials don’t mention: **when** you crawl matters. # Don't crawl at 3 AM — no human browses at 3 AM # Best times: 9 AM - 11 AM, 2 PM - 5 PM, 8 PM - 10 PM # Worst times: Midnight - 6 AM, lunch hour (12-1 PM) from datetime import datetime crawl_hours = range(8, 23) # 8 AM to 11 PM current_hour = datetime.now().hour if current_hour not in crawl_hours: print("Sleeping until morning...") time.sleep(3600 * (crawl_hours[0] - current_hour)) Building Your Crawler: A Complete Template Here’s a starter template that combines everything we’ve discussed: import requests import time import random from datetime import datetime class RespectfulCrawler: def __init__(self, base_url, proxy_list=None): self.base_url = base_url self.session = requests.Session() self.proxies = proxy_list or [] self.user_agents = [ "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...", # Add 15-20 more ] self.request_count = 0 self.max_per_minute = 15 self.start_time = time.time() def _headers(self): headers = { "User-Agent": random.choice(self.user_agents), "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "Referer": random.choice([ "https://www.google.com/", self.base_url, "https://www.baidu.com/s?wd=china", ]), "DNT": "1", "Connection": "keep-alive", } return headers def _delay(self): # Adaptive delay — increases if we've made many requests elapsed = time.time() - self.start_time rate = self.request_count / (elapsed / 60) if rate > self.max_per_minute: delay = random.uniform(5, 10) else: delay = random.uniform(2, 5) time.sleep(delay) def fetch(self, url): self._delay() proxy = None if self.proxies: proxy = {"http": random.choice(self.proxies), "https": random.choice(self.proxies)} try: r = self.session.get(url, headers=self._headers(), proxies=proxy, timeout=15) self.request_count += 1 if r.status_code == 200: return r.text elif r.status_code in [403, 429]: print(f"⚠️ Blocked! Status: {r.status_code}. Cooling down...") time.sleep(60) # Cool down for a minute return None else: print(f"⚠️ Unexpected status: {r.status_code}") return None except requests.exceptions.RequestException as e: print(f"❌ Request failed: {e}") return None def is_polite_time(self): hour = datetime.now().hour # Only crawl during reasonable hours return 8 <= hour <= 22 How to Detect If You’ve Been Blocked Watch for these warning signs: Signal What It Means Action HTTP 403 / 429 Temporary block Stop, use different IP, slow down CAPTCHA challenge Bot detected Switch to residential proxy Empty data returned You're in a honeypot Check your session cookies Delayed responses > 5s Rate limited Reduce frequency Consistent 200 but no real data Blacklisted silently Change IP and UA Storing Your Data: A Simple Database Schema If you’re crawling at any scale, you’ll want a database. Here’s a minimal schema that works for most use cases: CREATE TABLE crawled_data ( id BIGINT AUTO_INCREMENT PRIMARY KEY, source_name VARCHAR(100), -- e.g., "Trip.com Hotels" title VARCHAR(255), -- Title of the page/item content TEXT, -- Extracted content url VARCHAR(500), -- Original URL price DECIMAL(10,2), -- Price if applicable rating DECIMAL(2,1), -- Rating if applicable tags VARCHAR(255), -- Comma-separated tags crawled_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, INDEX idx_source (source_name), INDEX idx_crawled_at (crawled_at) ); Final Thoughts Building a crawler that doesn’t get blocked isn’t about being sneaky — it’s about being respectful. Think of it this way: you’re a guest on someone’s website. Act like one. Don’t take more than you need Don’t come back too often Leave when asked Say thank you (by adding value, not just taking data) The best crawlers are the ones the website never notices. Got questions about specific crawling challenges? Drop them in the comments below. — *This guide is based on real-world experience crawling Chinese travel and food sites. Your mileage may vary depending on the website’s specific anti-bot measures.*

How to Build a Web Crawler That Won’t Get Blocked: A Practical Guide

The First Time I Got Blocked

I still remember the moment. I’d written a beautiful little Python script to pull restaurant data from a popular Chinese review site. Ten requests in — BAM. 403 Forbidden. Then 429. Then my IP was in the digital doghouse.

I learned the hard way that websites have gotten very good at spotting crawlers. These days, it’s not just about writing code that works — it’s about writing code that *doesn’t get caught*.

Let me walk you through the strategies I’ve picked up after months of trial and error.

Why Websites Block Crawlers

Before we talk about how to avoid getting blocked, it helps to understand why websites block crawlers in the first place:

**Server load**: One crawler can generate as much traffic as 1,000 human visitors

**Data theft**: Competitors scraping pricing data

**Bandwidth costs**: Every request costs the website money

**Bots vs humans**: Some content is meant for human consumption, not automated extraction

The good news? Most websites don’t mind reasonable, respectful crawling. The key word is *respectful*.

Strategy 1: Rotate Your User-Agent

This is the absolute bare minimum. Your User-Agent string tells the server what browser and OS you’re using. If you send the Python `requests` library’s default UA on every request, you might as well wear a t-shirt that says “I’M A BOT.”

import random

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
    "Mozilla/5.0 (iPhone; CPU iPhone OS 17_4 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Mobile/15E148 Safari/604.1",
    "Mozilla/5.0 (Linux; Android 14; Pixel 8 Pro) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.6367.83 Mobile Safari/537.36",
]

def get_headers():
    return {
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
    }

**Pro tip:** Maintain a pool of at least 20 different UAs covering Windows, Mac, Linux, iOS, and Android. Rotate randomly on each request.

Strategy 2: Respect Robots.txt

The `robots.txt` file at the root of any website tells crawlers which paths are off-limits. While it’s not legally binding in most jurisdictions, ignoring it is bad form and can get your IP permanently banned.

import urllib.robotparser

rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://example.com/robots.txt")
rp.read()

if rp.can_fetch("*", "/some/path"):
    # Safe to crawl
    pass
else:
    # Skip this path
    pass

Strategy 3: Randomize Your Request Timing

This is where most beginners get caught. They write loops like this:

# Bad: predictable pattern
time.sleep(5)  # Exactly 5 seconds every time

Human browsing isn’t metronomically regular. We pause, scroll, get distracted, take a sip of coffee. Your crawler should mimic this:

import time
import random

# Good: random delay between requests
def human_delay():
    delay = random.uniform(2.0, 8.0)
    print(f"Waiting {delay:.1f} seconds...")
    time.sleep(delay)

Even better: vary the delay based on page complexity. A page with lots of images might take longer to “read” than a text-heavy page.

Strategy 4: Use a Proxy Pool

When you’re crawling at scale, even the best timing won’t save a single IP from rate limiting. You need multiple IP addresses.

Free Options

import requests

# Free proxy list (unreliable but zero cost)
proxies = {
    "http": "http://123.45.67.89:8080",
    "https": "http://123.45.67.89:8080",
}

try:
    r = requests.get(url, proxies=proxies, timeout=10)
except:
    # Try another proxy
    pass

Paid Options (Recommended)

For serious crawling, invest in a paid proxy service. They cost anywhere from $10 to $100 per month and offer:

Rotating residential IPs

Static datacenter IPs

Country-specific IPs

99.9% uptime

Service	Price	IP Pool	Best For
BrightData	From $10/mo	72M+	Enterprise scale
Smartproxy	From $50/mo	40M+	Social media scraping
Oxylabs	From $100/mo	100M+	E-commerce pricing
Proxy-Seller	From $20/mo	200K+	Budget-friendly

**My advice:** Start with free proxies for testing, move to a paid residential proxy service when you go into production.

Strategy 5: Manage Cookies Like a Browser

Websites issue cookies during your first visit and expect to see them on subsequent requests. Using a `requests.Session()` handles this automatically:

session = requests.Session()

# First request — get initial cookies
session.get("https://example.com", headers=get_headers())

# Subsequent requests — cookies are sent automatically
response = session.get("https://example.com/data", headers=get_headers())

Strategy 6: Add Realistic Referrers

Browsers send a `Referer` header that tells the server where you came from. A crawler that goes straight to a deep page without a referrer looks suspicious:

# Suspicious: no referrer
headers = get_headers()

# Better: pretend you came from Google
headers["Referer"] = "https://www.google.com/search?q=china+travel+guide"

# Even better: crawl the homepage first, then use it as a referrer
homepage = session.get("https://example.com", headers=get_headers())
headers["Referer"] = "https://example.com/"
session.get("https://example.com/deep-page", headers=headers)

Strategy 7: Crawl Like a Human (Schedule Matters)

Here’s a tip most tutorials don’t mention: **when** you crawl matters.

# Don't crawl at 3 AM — no human browses at 3 AM
# Best times: 9 AM - 11 AM, 2 PM - 5 PM, 8 PM - 10 PM
# Worst times: Midnight - 6 AM, lunch hour (12-1 PM)

from datetime import datetime

crawl_hours = range(8, 23)  # 8 AM to 11 PM
current_hour = datetime.now().hour

if current_hour not in crawl_hours:
    print("Sleeping until morning...")
    time.sleep(3600 * (crawl_hours[0] - current_hour))

Building Your Crawler: A Complete Template

Here’s a starter template that combines everything we’ve discussed:

import requests
import time
import random
from datetime import datetime

class RespectfulCrawler:
    def __init__(self, base_url, proxy_list=None):
        self.base_url = base_url
        self.session = requests.Session()
        self.proxies = proxy_list or []
        self.user_agents = [
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
            # Add 15-20 more
        ]
        self.request_count = 0
        self.max_per_minute = 15
        self.start_time = time.time()

    def _headers(self):
        headers = {
            "User-Agent": random.choice(self.user_agents),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.5",
            "Accept-Encoding": "gzip, deflate",
            "Referer": random.choice([
                "https://www.google.com/",
                self.base_url,
                "https://www.baidu.com/s?wd=china",
            ]),
            "DNT": "1",
            "Connection": "keep-alive",
        }
        return headers

    def _delay(self):
        # Adaptive delay — increases if we've made many requests
        elapsed = time.time() - self.start_time
        rate = self.request_count / (elapsed / 60)

        if rate > self.max_per_minute:
            delay = random.uniform(5, 10)
        else:
            delay = random.uniform(2, 5)

        time.sleep(delay)

    def fetch(self, url):
        self._delay()

        proxy = None
        if self.proxies:
            proxy = {"http": random.choice(self.proxies), "https": random.choice(self.proxies)}

        try:
            r = self.session.get(url, headers=self._headers(), proxies=proxy, timeout=15)
            self.request_count += 1

            if r.status_code == 200:
                return r.text
            elif r.status_code in [403, 429]:
                print(f"⚠️ Blocked! Status: {r.status_code}. Cooling down...")
                time.sleep(60)  # Cool down for a minute
                return None
            else:
                print(f"⚠️ Unexpected status: {r.status_code}")
                return None

        except requests.exceptions.RequestException as e:
            print(f"❌ Request failed: {e}")
            return None

    def is_polite_time(self):
        hour = datetime.now().hour
        # Only crawl during reasonable hours
        return 8 <= hour <= 22

How to Detect If You’ve Been Blocked

Watch for these warning signs:

Signal	What It Means	Action
HTTP 403 / 429	Temporary block	Stop, use different IP, slow down
CAPTCHA challenge	Bot detected	Switch to residential proxy
Empty data returned	You're in a honeypot	Check your session cookies
Delayed responses > 5s	Rate limited	Reduce frequency
Consistent 200 but no real data	Blacklisted silently	Change IP and UA

Storing Your Data: A Simple Database Schema

If you’re crawling at any scale, you’ll want a database. Here’s a minimal schema that works for most use cases:

CREATE TABLE crawled_data (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    source_name VARCHAR(100),           -- e.g., "Trip.com Hotels"
    title VARCHAR(255),                 -- Title of the page/item
    content TEXT,                        -- Extracted content
    url VARCHAR(500),                   -- Original URL
    price DECIMAL(10,2),                -- Price if applicable
    rating DECIMAL(2,1),                -- Rating if applicable
    tags VARCHAR(255),                  -- Comma-separated tags
    crawled_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_source (source_name),
    INDEX idx_crawled_at (crawled_at)
);

Final Thoughts

Building a crawler that doesn’t get blocked isn’t about being sneaky — it’s about being respectful. Think of it this way: you’re a guest on someone’s website. Act like one.

Don’t take more than you need

Don’t come back too often

Leave when asked

Say thank you (by adding value, not just taking data)

The best crawlers are the ones the website never notices.

Got questions about specific crawling challenges? Drop them in the comments below.

—

*This guide is based on real-world experience crawling Chinese travel and food sites. Your mileage may vary depending on the website’s specific anti-bot measures.*