We are going to set up and launch our very first spider (a bot) that will crawl quotes.toscrape and grab quotes from famous people right now, so grab yourself your favourite drink (hot or cold) and let’s have fun with our first spider.
As we know what Scrapy is (right?), the first step is to create an isolated environment to install all packages independently. If you use Conda, run:
conda install -c conda-forge scrapy
If you are using a pip environment, install it with:
pip install Scrapy
It will take a few seconds to install all the dependencies that Scrapy uses, but after that, you’re good to go. Remember that if you run into a problem, check the docs for aditional information.
Now, enter into the directory where you want to keep the project, then run:
scrapy startproject quotes_toscrape
This will create a Scrapy project named ‘quotes_toscrape’. Let’s check its contents.
quotes_toscrape/ # Main project directory
spiders/ # Where spiders will be stored
__init__.py
items.py # We define the model of the item we scrap here
middlewares.py # We define the middleware of our models here
pipelines.py # We define what to do after we scrape an item here
settings.py # Project settings
scrapy.cfg # Configuration file for deploying Scrapy
Seems a little overwhelming initially, but we don’t need to touch those files right now. Let’s create the spider:
scrapy genspider main_spider quotes.toscrape.com
This action creates a spider called “main_spider” that points to quotes.toscrape.com. Check spiders/main_spider.py now:
# -*- coding: utf-8 -*-
import scrapy
class MainSpiderSpider(scrapy.Spider):
name = 'main_spider'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
pass
Let’s dissect what’s there:
- name: It has to be unique. Identifies the spider.
- allowed_domains: Requests outside allowed_domains won’t be followed.
- start_urls: Where our spider will start scraping.
- def parse: Default callback. Scrapy will launch this method first to process the response (returning data). Has to return a Request, Items or a dictionary.
Right now, our spider will go to “quotes.toscrape.com”, return all the HTML and nothing else. We want to get every quote and its author from quotes.toscrape, but first, we need to ‘study’ the website:
Every square (a ‘div’ in HTML terms) contains a quote, its author, and a serie of tags. We want that. We want to tell to our spider “For every div, fetch the quote, its author and a list of every tag”. But how can we tell the spider “These are the divs I want you to look for information”?
We need to know what makes the divs we want different from others. Let’s right click the website then open the inspector. You’ll see something something like this:
You see that we have some (red) divs what we are not interested, but the ones we are (green) have a class called ‘quote’. Inside them, we will find our quote (class=”text”), the author (class=”author”) and the tags (class=”tags”). Let’s write the code that makes it possible:
def parse(self, response):
quotes = response.xpath('//div[@class="quote"]')
for quote in quotes:
text = quote.xpath('./span[@class="text"]/text()').extract_first()
author = quotes.xpath('.//small[@class="author"]/text()').extract_first()
tags = quote.xpath('.//div[@class="tags"]/a/text()').extract()
yield {
'Quote': text,
'Author': author,
'Tags': tags
}
Scrapy loads the start url and stores in quotes every element he founds that has a div with the class quote. Then, for every element (quote) in quotes, checks for a span with the class text, checks for its text and extracts the first one. Does the same for the author looking for a small element with the class of author and every element inside the div with tags class. Check of, as we have many, we just extract everything, not just the first.
(Scrapy uses .xpath to navigate through html elements (alternatively you can use .css). The syntax it’s logic and easy to read, but to create your own spiders you’ll need to learn it on the side. Don’t worry, it’s easy)
After getting everything we want from one div, we yield (return) what we found then goes for the second. Run:
scrapy crawl main_spider
Whoa. That’s a lot of information. The important thing is that we don’t get any ‘log_count/ERROR’ so it went fine, and that we are getting the information we want. We don’t want to copy-paste every line there or screenshot the terminal (unless you’re a bit weird), so let’s store it into a .json file:
scrapy crawl main_spider -o my_quotes.json
You can also change the .json extension for .xml or .csv, depending of what you need.
As you see, to create a spider you just need to create a project, create a spider with a starting url, then tell the spider what information you want from. Seems easy, right? You can get all the information you want from a website! Maybe some of you noticed that there is a “Next” button on the bottom right of the website.
Can we create a spider that after fetching everything, go to the next page and repeat the process over and over until it reaches the end?
Of course, and I’ll explain how in the next lesson.
Github code: https://github.com/david1707/our-first-spider/tree/basic_spider