Web scraping guide for beginner using scrapy framework
Sometimes, we have to scrape websites to collect information. For python programmer, scrapy is one of the best tools for scraping websites. Scrapy is a Python framework for crawling websites and extracting data. We can think of it like a bot that visits web pages, reads the content, and saves the data that we want it to save. In this guide we will use scrapy.
Install scrapy and create project
As a first step, we will create virtual environment and install scrapy there and create a scrapy project.
mkdir scrapy_project # create project dir
cd scrapy_project # go to that dir
python3 -m venv .venv # create environment
source .venv/bin/activate # active environment (might be different on windows)
pip install scrapy
creating the project
scrapy startproject core . # to create project inside our current directory, use dot(.)
cd myproject
This creates a folder structure like this:
myproject/
scrapy.cfg
myproject/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
The spiders/ folder is where we add python script for scraping. That python script or simply python file is called a spider.
Write the spider
This example is from scrapy official documentation, visit the official site to learn more. Create a file inside spiders/ called quotes_spider.py.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com"]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, self.parse)
Code explanation:
name: unique name for the spider. We use this name for running the spider.start_urls: list of URLs that scrapy visits first.parse: function called on every page. This is the function that extracts data.response.css(...): selects elements using CSS selectors.yield {...}: sends one row of data to Scrapy.response.follow(next_page, self.parse): follows the "Next" link to scrape the next page.
Run the spider
scrapy crawl quotes
you will see the extracted data printed in the terminal.
To save the data to a file,
scrapy crawl quotes -o quotes.json # json file
# OR
scrapy crawl quotes -o quotes.csv # csv file
Handle more complex pages
Sometimes we need to go deeper, like clicking into each item and scraping a detail page.
For example: We may want to fetch author details from a website:
- First, we go to author list page and get author links
- Second, visit these links and collect author information.
def parse(self, response):
for quote in response.css("div.quote"):
author_url = quote.css("small.author + a::attr(href)").get()
yield response.follow(author_url, self.parse_author)
def parse_author(self, response):
yield {
"name": response.css("h3.author-title::text").get(),
"born": response.css("span.author-born-date::text").get(),
}
This visits each author's page and pulls their bio.
Useful Settings
Open settings.py, this is the place where we can make these changes:
# wait between requests
DOWNLOAD_DELAY = 1
# Respect robots.txt (set to False only if you know what you're doing)
ROBOTSTXT_OBEY = True
# Max simultaneous requests across all sites
CONCURRENT_REQUESTS = 16 # increase number for faster scraping
# Pretend to be a real browser
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
Extra Tips
Some useful tips that I haven't mentioned above are:
- Use
scrapy shell "https://example.com"to test CSS selectors before writing your spider. - Use
XPathif CSS selectors aren't working, it is more powerful for complex HTML. - If a site uses
JavaScriptto load content, scrapy alone won't work, we should use other package likescrapy-playwrightorselenium.
Almost every website has a Terms of Service. Most of them include a line like:
"You may not scrape, crawl, or use automated tools to access this site."
Violating Terms of Service is not automatically a crime, but it can get your account banned, your IP blocked, or even result in a civil lawsuit. So, be aware while scraping contents from others' sites.
Leave a comment
Comments