In our last lesson, How to go to the next page, we scraped the whole website up to the last book. But today, we are going to learn a tool that is going to make our Web Scraping tasks even easier.
We are talking about the CrawlSpider.
In this post you will learn how to:
- How to use the new spider: CrawlSpider
- What Rules and LinkExtractor are
- Scrape the whole website without effort
Are you ready?
Table of contents |
Our game plan |
The new spider: CrawlSpider |
Rules and LinkExtractor |
Filtering the URLs |
Conclusion |
Exercise |
Our game-plan
Every task that we have done until now, has helped us with two things: Getting the needed URLs or extracting the information.
We have extracted the partial URLS, manipulated them, added to the base URL to create the absolute URL and while it worked, it was too much. Well, maybe not too much, as it were a few lines of code, but we can make it simpler.
Way simpler.
Here, again, we are going to use two parts of the code. One to get the URLs, and
As we are going to use the same structure, we shouldn’t make any modification of that. We are going to improve the way we extract the URLs.
We are going to make it so simpler you won’t believe it.
I’m talking about the new spider: CrawlSpider.
The new spider: CrawlSpider
We pick it up from the last lesson. This is our current spider:
# -*- coding: utf-8 -*- import scrapy class SpiderSpider(scrapy.Spider): name = 'spider' allowed_domains = ['books.toscrape.com'] start_urls = ['http://books.toscrape.com/'] base_url = 'http://books.toscrape.com/' def parse(self, response): all_books = response.xpath('//article[@class="product_pod"]') for book in all_books: book_url = book.xpath('.//h3/a/@href').extract_first() if 'catalogue/' not in book_url: book_url = 'catalogue/' + book_url book_url = self.base_url + book_url yield scrapy.Request(book_url, callback=self.parse_book) next_page_partial_url = response.xpath( '//li[@class="next"]/a/@href').extract_first() if next_page_partial_url: if 'catalogue/' not in next_page_partial_url: next_page_partial_url = "catalogue/" + next_page_partial_url next_page_url = self.base_url + next_page_partial_url yield scrapy.Request(next_page_url, callback=self.parse) def parse_book(self, response): title = response.xpath('//div/h1/text()').extract_first() relative_image = response.xpath( '//div[@class="item active"]/img/@src').extract_first().replace('../..', '') final_image = self.base_url + relative_image price = response.xpath( '//div[contains(@class, "product_main")]/p[@class="price_color"]/text()').extract_first() stock = response.xpath( '//div[contains(@class, "product_main")]/p[contains(@class, "instock")]/text()').extract()[1].strip() stars = response.xpath( '//div/p[contains(@class, "star-rating")]/@class').extract_first().replace('star-rating ', '') description = response.xpath( '//div[@id="product_description"]/following-sibling::p/text()').extract_first() upc = response.xpath( '//table[@class="table table-striped"]/tr[1]/td/text()').extract_first() price_excl_tax = response.xpath( '//table[@class="table table-striped"]/tr[3]/td/text()').extract_first() price_inc_tax = response.xpath( '//table[@class="table table-striped"]/tr[4]/td/text()').extract_first() tax = response.xpath( '//table[@class="table table-striped"]/tr[5]/td/text()').extract_first() yield { 'Title': title, 'Image': final_image, 'Price': price, 'Stock': stock, 'Stars': stars, 'Description': description, 'Upc': upc, 'Price after tax': price_excl_tax, 'Price incl tax': price_inc_tax, 'Tax': tax, }
Wow… the parse method is too messy… I am sorry! Delete it, please.
No, I’m not kidding. Remove the whole function.
Remember that we are going to simplify the extraction of the URLs? Remove that goddamn big parse function now..
Check the main SpiderSpider class. We are inheriting the scrapy.Spider. We don’t want that spider, it is too stupid! So, we should use
from scrapy.spiders import CrawlSpider class SpiderSpider(CrawlSpider):
Way better!
But…remember that the Spider always calls the parse method to start reading the code? Well, not this one.
Here, instead of looking for a parse method, we can instruct this spider to do what we want. But to do so, we need to set ground rules, right?
Rules and LinkExtractor
The CrawlSpider besides having the same attributes as the regular Spider has a new attribute: rules.
‘Rules’ is a list of one or more Rule objects, where each Rule defines one type of behaviour for crawling the site.
Also, we are going to use LinkExtractor: An object which defines how links will be extracted from each crawled page.
Rules set the behaviour of how it is going to crawl the site and LinkExtractor how links are going to be extracted. But it is best if we see how it works, right? Let’s import the Rule and LinkExtractor, and then define the rules:
from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class SpiderSpider(CrawlSpider): name = 'spider' allowed_domains = ['books.toscrape.com'] start_urls = ['http://books.toscrape.com/'] base_url = 'http://books.toscrape.com/' rules = [Rule(LinkExtractor(allow='catalogue/'), callback='parse_filter_book', follow=True)]
We import the resources and we create one Rule: In this rule, we are going to set how links are going to be extracted, from where and what to do with them.
First, we set allow=’catalogue/’. Now if the URL does not have ‘catalogue/’ in it, we won’t even process it. Way better than the IFs we used before, right?
We also have a callback: A callback in programming is what we do after the current process is done. In this case, it means “After getting a valid URL, call the parse_filter_book method.
And follow just specifies if links should be followed from each response. As we set it to True, we are going to get any nested URLs. The whole website.
Now, change parse_book to parse_filter_book and run the code!
Oh… we got an error:
AttributeError: ‘NoneType’ object has no attribute ‘replace’
Of course: We are extracting every URL in the code! Not only the books but also the pagination (page-1.html, page-2.html, etc) and every URLs the spider finds.
We should use the parse_filter_book method only if the page is a valid book URL!
Filtering the URLs
Inside the parse_filter_book we are going to perform a small check: If the URL is a book URL, extract the data. If not, do nothing.
But how we know if a URL belongs to a book or to other URL?
Well, let’s check one: Open http://books.toscrape.com/catalogue/sharp-objects_997/index.html and a non-book URL, for example http://books.toscrape.com/index.html
Now we need to look for an element from books that isn’t in the non-book URLs. For example, I have noticed that books have a product_gallery class:
We can use this to separate the book URLs from the non-book URLs!
Modify your code like this:
def parse_filter_book(self, response): exists = response.xpath('//div[@id="product_gallery"]').extract_first() if exists: title = response.xpath('//div/h1/text()').extract_first() relative_image = response.xpath( '//div[@class="item active"]/img/@src').extract_first() final_image = self.base_url + relative_image.replace('../..', '') price = response.xpath( '//div[contains(@class, "product_main")]/p[@class="price_color"]/text()').extract_first() stock = response.xpath( '//div[contains(@class, "product_main")]/p[contains(@class, "instock")]/text()').extract()[1].strip() stars = response.xpath( '//div/p[contains(@class, "star-rating")]/@class').extract_first().replace('star-rating ', '') description = response.xpath( '//div[@id="product_description"]/following-sibling::p/text()').extract_first() upc = response.xpath( '//table[@class="table table-striped"]/tr[1]/td/text()').extract_first() price_excl_tax = response.xpath( '//table[@class="table table-striped"]/tr[3]/td/text()').extract_first() price_inc_tax = response.xpath( '//table[@class="table table-striped"]/tr[4]/td/text()').extract_first() tax = response.xpath( '//table[@class="table table-striped"]/tr[5]/td/text()').extract_first() yield { 'Title': title, 'Image': final_image, 'Price': price, 'Stock': stock, 'Stars': stars, 'Description': description, 'Upc': upc, 'Price after tax': price_excl_tax, 'Price incl tax': price_inc_tax, 'Tax': tax, } else: print(response.url)
The key are the first two lines: We try to get the div with the ‘product_gallery’ class. If it exists, we parse the URL. If not, we just print it.
Let’s run the code now…
And our thousand books are there! Great!
Conclusion
Today we have learnt how:
- A Crawler works
- To set Rules and LinkExtractor
- To extract every URL in the website
- That we have to filter the URLs received to extract the data from the book URLs and no every URL
This was not another step in your Web Scraping learning, this was a great leap.
Using CrawlSpiders helps you to simplify your code a lot, as you saw in this lesson.
This was an easy example, but what if instead of books, we have books, musical instruments, food, etc as in Amazon and e
You have a base understanding now from Scrapy. Now we need to go deeper. On the next lesson, we will learn about items.
But before that…
Exercise
Right now, you know how to get the URLs needed using Spider and the CrawlSpider, how to extract data using Xpath and how to yield the information to a file.
Now it’s time to work in your own! Look for an easy website to scrap and try to scrap it by yourself.
You can use help such as looking for past lessons, searching Google, looking into the Scrapy documentation, etc. But you need to do it by yourself.
After that, leave a comment here with the website and your code, so everybody can see how you managed to do it on your own and how proud you are!