Web scraping guide for beginner using scrapy framework

Sometimes, we have to scrape websites to collect information. For python programmer, scrapy is one of the best tools for scraping websites. Scrapy is a Python framework for crawling websites and extracting data. We can think of it like a bot that visits web pages, reads the content, and saves the data that we want it to save. In this guide we will use scrapy.

Install scrapy and create project

As a first step, we will create virtual environment and install scrapy there and create a scrapy project.

mkdir scrapy_project # create project dir
cd scrapy_project # go to that dir
python3 -m venv .venv # create environment
source .venv/bin/activate # active environment (might be different on windows)
pip install scrapy

creating the project

scrapy startproject core . # to create project inside our current directory, use dot(.)
cd myproject

This creates a folder structure like this:

myproject/
    scrapy.cfg
    myproject/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py

The spiders/ folder is where we add python script for scraping. That python script or simply python file is called a spider.

Write the spider

This example is from scrapy official documentation, visit the official site to learn more. Create a file inside spiders/ called quotes_spider.py.

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
            }

        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, self.parse)

Code explanation:

name: unique name for the spider. We use this name for running the spider.
start_urls: list of URLs that scrapy visits first.
parse: function called on every page. This is the function that extracts data.
response.css(...): selects elements using CSS selectors.
yield {...}: sends one row of data to Scrapy.
response.follow(next_page, self.parse): follows the "Next" link to scrape the next page.

Run the spider

scrapy crawl quotes

you will see the extracted data printed in the terminal.

To save the data to a file,

scrapy crawl quotes -o quotes.json # json file
# OR
scrapy crawl quotes -o quotes.csv # csv file

Handle more complex pages

Sometimes we need to go deeper, like clicking into each item and scraping a detail page.

For example: We may want to fetch author details from a website:

First, we go to author list page and get author links
Second, visit these links and collect author information.

def parse(self, response):
    for quote in response.css("div.quote"):
        author_url = quote.css("small.author + a::attr(href)").get()
        yield response.follow(author_url, self.parse_author)

def parse_author(self, response):
    yield {
        "name": response.css("h3.author-title::text").get(),
        "born": response.css("span.author-born-date::text").get(),
    }

This visits each author's page and pulls their bio.

Useful Settings

Open settings.py, this is the place where we can make these changes:

# wait between requests
DOWNLOAD_DELAY = 1

# Respect robots.txt (set to False only if you know what you're doing)
ROBOTSTXT_OBEY = True

# Max simultaneous requests across all sites
CONCURRENT_REQUESTS = 16 # increase number for faster scraping

# Pretend to be a real browser
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"

Extra Tips

Some useful tips that I haven't mentioned above are:

Use scrapy shell "https://example.com" to test CSS selectors before writing your spider.
Use XPath if CSS selectors aren't working, it is more powerful for complex HTML.
If a site uses JavaScript to load content, scrapy alone won't work, we should use other package like scrapy-playwright or selenium.

Almost every website has a Terms of Service. Most of them include a line like:

"You may not scrape, crawl, or use automated tools to access this site."

Violating Terms of Service is not automatically a crime, but it can get your account banned, your IP blocked, or even result in a civil lawsuit. So, be aware while scraping contents from others' sites.

238

Updated: Aug. 2, 2026

Similar Blogs

JWT Based Authentication in Django using djangorestframework-simplejwt

A JSON Web Token (JWT) is like a secure, digital ID card for web apps. Instead of checking a database every single time a user …

July 30, 2026
Graphql in python using graphene

In this guide, we’ll explain how to integrate GraphQL into Python projects. You’ll learn how to set up a GraphQL server, define schema, and write …

Aug. 2, 2026
Build GraphQL API in Python Using PostgreSQL

In today’s world of modern web development, building efficient and flexible APIs is more important than ever. GraphQL has quickly become a popular alternative to …

July 30, 2026