Scrapy: Visiting ‘next’ pages

Last time we created our spider and scraped everything from the first page. But what when a website has more than one page? Let’s learn how we can send the bot to the next page until reaches the end.

Our parse (first method Scrapy runs) code was like this:


def parse(self, response):
quotes = response.xpath('//div[@class="quote"]')

for quote in quotes:
text = quote.xpath('./span[@class="text"]/text()').extract_first()
author = quotes.xpath('.//small[@class="author"]/text()').extract_first()
tags = quote.xpath('.//div[@class="tags"]/a/text()').extract()

yield {
'Quote': text,
'Author': author,
'Tags': tags
}

We selected every div with the ‘quote’ class, and in a loop for, we iterated over each one we sent back the quote, author and tags.

Now we have to tell the bot “If you run out of quotes, go to the next page”. We have to set that functionality right after the loop ends. We check if we have a ‘next’ element, then get the ‘href’ (link) method. The one in this website it’s a bit tricky, as it has a relative route (not the full route) instead of the absolute (from the ‘http…’ to the end), so we have to play around that. Let’s see the code:

next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()  
if next_page_url:
next_page_absolute_url = response.urljoin(next_page_url)
yield scrapy.Request(next_page_absolute_url, self.parse)

That’s all we need! Let me dissect the code:

In line 1, we reach for a ‘li’ HTML tag with the class ‘next’, we get the ‘a’ tag (the link), and we get the ‘href’ where the route is stored. Notice the @ before the ‘href’: Normally we go down the HTML structure with a slash, but when we want to get an attribute of a tag, we type @ + the attribute name. We only want the first (and only) one of the elements Scrapy can found, so we write ‘.extract_first()’, to get it as a string. If we wanted more than one (like when we got the tags), we just type ‘extract()’. Remember: .extract() returns a list, .extract_first() a string.

Line 2 checks that next_page_url has a value. If there is a next page, run the indented statements.

Line 3 is very important to understand. When we run Scrapy, Scrapy requests a URL, then the server responses with the HTML code. response.urljoin(next_page_url) joins that URL with next_page_url. Its equivalent it is ‘http://quotes.toscrape.com’ + /page/2/.

Line 4 prompts Scrapy to request the next page url, which will get a new response, and to run the parse method.

This closes the circle, getting an url, getting the desired data, getting a new url, and so on until no next page is found.

The next button contains a link element where the next page url is

Let’s run the spider again to see how we improved the functionality:

scrapy crawl main_spider -o next_page_quotes.json

Now instead of 10 quotes, we have 100 of them! As you can see, after getting the base spider, it’s pretty easy to add functionality. Just 4 lines were enough to multiply its power. Now we can fetch all the information we can see. Can we?

I want you to do a small exercise: Think about an online shop, such as Amazon, Ebay, etc. All the information is not displayed in the search list, but a summary of every item. We are missing information we need.

Ideally, we will enter on each item link, reach all the information, then move to the next one and once we are all done, follow through to the next page repeating the process.

That is what you can do easily in the next lesson. Ideally you’ll check it right now.

Github code: https://github.com/david1707/our-first-spider/tree/next_page_spider