Scrapy: Setting up our first spider

We are going to set up and launch our very first spider (a bot) that will crawl quotes.toscrape and grab quotes from famous people right now, so grab yourself your favourite drink (hot or cold) and let’s have fun with our first spider.

As we know what Scrapy is (right?), the first step is to create an isolated environment to install all packages independently. If you use Conda, run:

conda install -c conda-forge scrapy

If you are using a pip environment, install it with:

pip install Scrapy

It will take a few seconds to install all the dependencies that Scrapy uses, but after that, you’re good to go. Remember that if you run into a problem, check the docs for aditional information.

Now, enter into the directory where you want to keep the project, then run:

scrapy startproject quotes_toscrape

This will create a Scrapy project named ‘quotes_toscrape’. Let’s check its contents.


quotes_toscrape/ # Main project directory
spiders/ # Where spiders will be stored

__init__.py

items.py # We define the model of the item we scrap here

middlewares.py # We define the middleware of our models here

pipelines.py # We define what to do after we scrape an item here

settings.py # Project settings

scrapy.cfg # Configuration file for deploying Scrapy

Seems a little overwhelming initially, but we don’t need to touch those files right now. Let’s create the spider:

scrapy genspider main_spider quotes.toscrape.com

This action creates a spider called “main_spider” that points to quotes.toscrape.com. Check spiders/main_spider.py now:

# -*- coding: utf-8 -*-
import scrapy


class MainSpiderSpider(scrapy.Spider):
name = 'main_spider'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']

def parse(self, response):
pass

Let’s dissect what’s there:

  • name: It has to be unique. Identifies the spider.
  • allowed_domains: Requests outside allowed_domains won’t be followed.
  • start_urls: Where our spider will start scraping.
  • def parse: Default callback. Scrapy will launch this method first to process the response (returning data). Has to return a Request, Items or a dictionary.

Right now, our spider will go to “quotes.toscrape.com”, return all the HTML and nothing else. We want to get every quote and its author from quotes.toscrape, but first, we need to ‘study’ the website:

Every square (a ‘div’ in HTML terms) contains a quote, its author, and a serie of tags. We want that. We want to tell to our spider “For every div, fetch the quote, its author and a list of every tag”. But how can we tell the spider “These are the divs I want you to look for information”?

We need to know what makes the divs we want different from others. Let’s right click the website then open the inspector. You’ll see something something like this:

You see that we have some (red) divs what we are not interested, but the ones we are (green) have a class called ‘quote’. Inside them, we will find our quote (class=”text”), the author (class=”author”) and the tags (class=”tags”). Let’s write the code that makes it possible:


def parse(self, response):
quotes = response.xpath('//div[@class="quote"]')

for quote in quotes:
text = quote.xpath('./span[@class="text"]/text()').extract_first()
author = quotes.xpath('.//small[@class="author"]/text()').extract_first()
tags = quote.xpath('.//div[@class="tags"]/a/text()').extract()

yield {
'Quote': text,
'Author': author,
'Tags': tags
}

Scrapy loads the start url and stores in quotes every element he founds that has a div with the class quote. Then, for every element (quote) in quotes, checks for a span with the class text, checks for its text and extracts the first one. Does the same for the author looking for a small element with the class of author and every element inside the div with tags class. Check of, as we have many, we just extract everything, not just the first.

(Scrapy uses .xpath to navigate through html elements (alternatively you can use .css). The syntax it’s logic and easy to read, but to create your own spiders you’ll need to learn it on the side. Don’t worry, it’s easy)

After getting everything we want from one div, we yield (return) what we found then goes for the second. Run:

scrapy crawl main_spider

Whoa. That’s a lot of information. The important thing is that we don’t get any ‘log_count/ERROR’ so it went fine, and that we are getting the information we want. We don’t want to copy-paste every line there or screenshot the terminal (unless you’re a bit weird), so let’s store it into a .json file:

scrapy crawl main_spider -o my_quotes.json

You can also change the .json extension for .xml or .csv, depending of what you need.

As you see, to create a spider you just need to create a project, create a spider with a starting url, then tell the spider what information you want from. Seems easy, right? You can get all the information you want from a website! Maybe some of you noticed that there is a “Next” button on the bottom right of the website.

Can we create a spider that after fetching everything, go to the next page and repeat the process over and over until it reaches the end?

Of course, and I’ll explain how in the next lesson.

Github code: https://github.com/david1707/our-first-spider/tree/basic_spider