How to Build a Web Crawler That Won’t Get Blocked: A Practical Guide

Ever tried to scrape a website only to get blocked after 10 requests? Learn battle-tested strategies for building crawlers that respect websites and actually work — proxy rotation, UA switching, timing tactics, and more.

Ready
EN

How to Build a Web Crawler That Won’t Get Blocked: A Practical Guide

The First Time I Got Blocked

I still remember the moment. I’d written a beautiful little Python script to pull restaurant data from a popular Chinese review site. Ten requests in — BAM. 403 Forbidden. Then 429. Then my IP was in the digital doghouse.

I learned the hard way that websites have gotten very good at spotting crawlers. These days, it’s not just about writing code that works — it’s about writing code that *doesn’t get caught*.

Let me walk you through the strategies I’ve picked up after months of trial and error.

Why Websites Block Crawlers

Before we talk about how to avoid getting blocked, it helps to understand why websites block crawlers in the first place:

  • **Server load**: One crawler can generate as much traffic as 1,000 human visitors
  • **Data theft**: Competitors scraping pricing data
  • **Bandwidth costs**: Every request costs the website money
  • **Bots vs humans**: Some content is meant for human consumption, not automated extraction
  • The good news? Most websites don’t mind reasonable, respectful crawling. The key word is *respectful*.

    Strategy 1: Rotate Your User-Agent

    This is the absolute bare minimum. Your User-Agent string tells the server what browser and OS you’re using. If you send the Python `requests` library’s default UA on every request, you might as well wear a t-shirt that says “I’M A BOT.”

    import random
    
    USER_AGENTS = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
        "Mozilla/5.0 (iPhone; CPU iPhone OS 17_4 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Mobile/15E148 Safari/604.1",
        "Mozilla/5.0 (Linux; Android 14; Pixel 8 Pro) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.6367.83 Mobile Safari/537.36",
    ]
    
    def get_headers():
        return {
            "User-Agent": random.choice(USER_AGENTS),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.5",
            "Accept-Encoding": "gzip, deflate, br",
            "Connection": "keep-alive",
        }
    

    **Pro tip:** Maintain a pool of at least 20 different UAs covering Windows, Mac, Linux, iOS, and Android. Rotate randomly on each request.

    Strategy 2: Respect Robots.txt

    The `robots.txt` file at the root of any website tells crawlers which paths are off-limits. While it’s not legally binding in most jurisdictions, ignoring it is bad form and can get your IP permanently banned.

    import urllib.robotparser
    
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url("https://example.com/robots.txt")
    rp.read()
    
    if rp.can_fetch("*", "/some/path"):
        # Safe to crawl
        pass
    else:
        # Skip this path
        pass
    

    Strategy 3: Randomize Your Request Timing

    This is where most beginners get caught. They write loops like this:

    # Bad: predictable pattern
    time.sleep(5)  # Exactly 5 seconds every time
    

    Human browsing isn’t metronomically regular. We pause, scroll, get distracted, take a sip of coffee. Your crawler should mimic this:

    import time
    import random
    
    # Good: random delay between requests
    def human_delay():
        delay = random.uniform(2.0, 8.0)
        print(f"Waiting {delay:.1f} seconds...")
        time.sleep(delay)
    

    Even better: vary the delay based on page complexity. A page with lots of images might take longer to “read” than a text-heavy page.

    Strategy 4: Use a Proxy Pool

    When you’re crawling at scale, even the best timing won’t save a single IP from rate limiting. You need multiple IP addresses.

    Free Options

    import requests
    
    # Free proxy list (unreliable but zero cost)
    proxies = {
        "http": "http://123.45.67.89:8080",
        "https": "http://123.45.67.89:8080",
    }
    
    try:
        r = requests.get(url, proxies=proxies, timeout=10)
    except:
        # Try another proxy
        pass
    

    Paid Options (Recommended)

    For serious crawling, invest in a paid proxy service. They cost anywhere from $10 to $100 per month and offer:

  • Rotating residential IPs
  • Static datacenter IPs
  • Country-specific IPs
  • 99.9% uptime
  • Service Price IP Pool Best For
    BrightData From $10/mo 72M+ Enterprise scale
    Smartproxy From $50/mo 40M+ Social media scraping
    Oxylabs From $100/mo 100M+ E-commerce pricing
    Proxy-Seller From $20/mo 200K+ Budget-friendly

    **My advice:** Start with free proxies for testing, move to a paid residential proxy service when you go into production.

    Strategy 5: Manage Cookies Like a Browser

    Websites issue cookies during your first visit and expect to see them on subsequent requests. Using a `requests.Session()` handles this automatically:

    session = requests.Session()
    
    # First request — get initial cookies
    session.get("https://example.com", headers=get_headers())
    
    # Subsequent requests — cookies are sent automatically
    response = session.get("https://example.com/data", headers=get_headers())
    

    Strategy 6: Add Realistic Referrers

    Browsers send a `Referer` header that tells the server where you came from. A crawler that goes straight to a deep page without a referrer looks suspicious:

    # Suspicious: no referrer
    headers = get_headers()
    
    # Better: pretend you came from Google
    headers["Referer"] = "https://www.google.com/search?q=china+travel+guide"
    
    # Even better: crawl the homepage first, then use it as a referrer
    homepage = session.get("https://example.com", headers=get_headers())
    headers["Referer"] = "https://example.com/"
    session.get("https://example.com/deep-page", headers=headers)
    

    Strategy 7: Crawl Like a Human (Schedule Matters)

    Here’s a tip most tutorials don’t mention: **when** you crawl matters.

    # Don't crawl at 3 AM — no human browses at 3 AM
    # Best times: 9 AM - 11 AM, 2 PM - 5 PM, 8 PM - 10 PM
    # Worst times: Midnight - 6 AM, lunch hour (12-1 PM)
    
    from datetime import datetime
    
    crawl_hours = range(8, 23)  # 8 AM to 11 PM
    current_hour = datetime.now().hour
    
    if current_hour not in crawl_hours:
        print("Sleeping until morning...")
        time.sleep(3600 * (crawl_hours[0] - current_hour))
    

    Building Your Crawler: A Complete Template

    Here’s a starter template that combines everything we’ve discussed:

    import requests
    import time
    import random
    from datetime import datetime
    
    class RespectfulCrawler:
        def __init__(self, base_url, proxy_list=None):
            self.base_url = base_url
            self.session = requests.Session()
            self.proxies = proxy_list or []
            self.user_agents = [
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
                # Add 15-20 more
            ]
            self.request_count = 0
            self.max_per_minute = 15
            self.start_time = time.time()
    
        def _headers(self):
            headers = {
                "User-Agent": random.choice(self.user_agents),
                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
                "Accept-Language": "en-US,en;q=0.5",
                "Accept-Encoding": "gzip, deflate",
                "Referer": random.choice([
                    "https://www.google.com/",
                    self.base_url,
                    "https://www.baidu.com/s?wd=china",
                ]),
                "DNT": "1",
                "Connection": "keep-alive",
            }
            return headers
    
        def _delay(self):
            # Adaptive delay — increases if we've made many requests
            elapsed = time.time() - self.start_time
            rate = self.request_count / (elapsed / 60)
    
            if rate > self.max_per_minute:
                delay = random.uniform(5, 10)
            else:
                delay = random.uniform(2, 5)
    
            time.sleep(delay)
    
        def fetch(self, url):
            self._delay()
    
            proxy = None
            if self.proxies:
                proxy = {"http": random.choice(self.proxies), "https": random.choice(self.proxies)}
    
            try:
                r = self.session.get(url, headers=self._headers(), proxies=proxy, timeout=15)
                self.request_count += 1
    
                if r.status_code == 200:
                    return r.text
                elif r.status_code in [403, 429]:
                    print(f"⚠️ Blocked! Status: {r.status_code}. Cooling down...")
                    time.sleep(60)  # Cool down for a minute
                    return None
                else:
                    print(f"⚠️ Unexpected status: {r.status_code}")
                    return None
    
            except requests.exceptions.RequestException as e:
                print(f"❌ Request failed: {e}")
                return None
    
        def is_polite_time(self):
            hour = datetime.now().hour
            # Only crawl during reasonable hours
            return 8 <= hour <= 22
    

    How to Detect If You’ve Been Blocked

    Watch for these warning signs:

    Signal What It Means Action
    HTTP 403 / 429 Temporary block Stop, use different IP, slow down
    CAPTCHA challenge Bot detected Switch to residential proxy
    Empty data returned You're in a honeypot Check your session cookies
    Delayed responses > 5s Rate limited Reduce frequency
    Consistent 200 but no real data Blacklisted silently Change IP and UA

    Storing Your Data: A Simple Database Schema

    If you’re crawling at any scale, you’ll want a database. Here’s a minimal schema that works for most use cases:

    CREATE TABLE crawled_data (
        id BIGINT AUTO_INCREMENT PRIMARY KEY,
        source_name VARCHAR(100),           -- e.g., "Trip.com Hotels"
        title VARCHAR(255),                 -- Title of the page/item
        content TEXT,                        -- Extracted content
        url VARCHAR(500),                   -- Original URL
        price DECIMAL(10,2),                -- Price if applicable
        rating DECIMAL(2,1),                -- Rating if applicable
        tags VARCHAR(255),                  -- Comma-separated tags
        crawled_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
        INDEX idx_source (source_name),
        INDEX idx_crawled_at (crawled_at)
    );
    

    Final Thoughts

    Building a crawler that doesn’t get blocked isn’t about being sneaky — it’s about being respectful. Think of it this way: you’re a guest on someone’s website. Act like one.

  • Don’t take more than you need
  • Don’t come back too often
  • Leave when asked
  • Say thank you (by adding value, not just taking data)
  • The best crawlers are the ones the website never notices.

    Got questions about specific crawling challenges? Drop them in the comments below.

    *This guide is based on real-world experience crawling Chinese travel and food sites. Your mileage may vary depending on the website’s specific anti-bot measures.*

    Leave a Reply

    Your email address will not be published. Required fields are marked *