At the last post we used the pagination of the website, visiting all the pages it had to offer. But most of the times all the information isn’t displayed; for that you have to visit the details section of each item.
For now we have learnt how to get information from elements and how to visit new pages. How can we apply that to scrap the details of every item in a website?
If you have being following this tutorial, you should have the baseline spider we need. If you didn’t (you should!), clone the ‘next_page_spider‘ branch hosted in my github. After you have it, check the code. What it is doing right now?
Parse function gets all div with ‘quotes’ class, and for each one we get the text, the author, and the tags. After that, checks the next page and keeps going on until can’t find more pages.
What we want to do is pretty similar. Visit each quote, somehow get into the details section and go to the next page. But how we get into the details?
Each quote has a link (‘a’) tag. If we click the link, it will open the details of each author:
Let’s recap: We get all the quotes, for each one we get the author’s link, we fetch the information and we proceed to the next one.
As we don’t want to get the quotes but the authors, remove everything inside the parse function. Anything else, stays:
Right now, it just iterates over the ‘next’ page, so let’s get every quote. You already know this:
quotes = response.xpath('//div[@class="quote"]'
Now, for each quote, we want to get the link to the author. As we saw at the first image of this post, the link is stored in a ‘a’ tag, by the href attribute. As it is a partial link, we need to add the website to the link, but as some OS uses slash ‘/’ or backslash ‘\’ in its routing, we just use the Scrapy method, urljoin:
for quote in quotes:
author_partial_link = quote.xpath('.//span/a/@href').extract_first()
author_link = response.urljoin(author_partial_link)
Now we can’t use the parse method to fetch the information, we need to send it to a new method:
yield scrapy.Request(author_link, callback=self.parse_author)
This says “To the author link I’ve passed you, apply the instructions of the parse_author function”. Now we just need to create a parse_author method and create the instructions that will fetch the information. Let’s study a random author page:
Seems like we can easily get the author’s name, its born date and location, and the description. As we did before, let’s use .xpath to get the data pointing at their class name and extract it, then yield it to get them:
def parse_author(self, response):
name = response.xpath('//h3[@class="author-title"]/text()').extract_first().strip()
born_date = response.xpath('//span[@class="author-born-date"]/text()').extract_first().strip()
born_location = response.xpath('//span[@class="author-born-location"]/text()').extract_first().strip()[3:]
description = response.xpath('//div[@class="author-description"]/text()').extract_first().strip()
yield {
'name': name,
'born_date': born_date,
'born_location': born_location,
'description': description,
}
For example, in
This should be working right now. On the parse function we get each author’s link and call the parse_author function which will send the author’s name, birth date and location, and the description. After that, we go to the next page and repeat the process. Let’s test that, running the spider and saving the data in a .json file:
scrapy crawl main_spider -o details.json
You should have a details.json file with 52 lines, with 50 authors right now! (If by any chance isn’t working for you, compare the code)
Some authors appear more than once (10 pages and 10 quotes should ydield 100 authors), but Scrapy takes care of that, deleting the repeated data. Cool, right?
In the next lesson, we will learn about the Scrapy shell tool and how it can help us in our scraping process.
Github code: https://github.com/david1707/our-first-spider/tree/details_spider