In our last lesson, we have created our first Scrapy spider and we have scraped a few fields from the book. But we also learnt that every item has a URL with more detailed data. Let’s see how to extract all the data in different ways.
In this post you will learn how to:
- Scrap items on their own page
- Extract routes with relative URLs
- Select elements by tag, class, partial class and siblings elements
- Extract information from tables
- Use callbacks to other Scrapy class methods
Table of contents |
Our actual spider |
Using Scrapy to get to the detailed book URL |
Extracting time – Different ways to pull data |
Conclusion |
Our actual spider
On our last lesson, our spider was able to extract the title, price, image URL and book URL. Let me remember the code:
import scrapy class SpiderSpider(scrapy.Spider): name = 'spider' allowed_domains = ['books.toscrape.com'] start_urls = ['http://books.toscrape.com/'] def parse(self, response): all_books = response.xpath('//article[@class="product_pod"]') for book in all_books: title = book.xpath('.//h3/a/@title').extract_first() price = book.xpath('.//div/p[@class="price_color"]/text()').extract_first() image_url = self.start_urls[0] + book.xpath('.//img[@class="thumbnail"]/@src').extract_first() book_url = self.start_urls[0] + book.xpath('.//h3/a/@href').extract_first() yield { 'title': title, 'price': price, 'Image URL': image_url, 'Book URL': book_url, }
If you don’t know how to create a Scrapy project and spider, please, go to the first lesson: Creating your first spider
This spider is going to be our starting point, but instead of extracting title, price, image and book URL, we are going to extract the book URL, and then parse from that URL, not from our the one on start_urls.
Using Scrapy to get to the detailed book URL
Take the whole spider, and remove everything related to title, image and price. Remove the yield. This should be your spider now:
# -*- coding: utf-8 -*- import scrapy class SpiderSpider(scrapy.Spider): name = 'spider' allowed_domains = ['books.toscrape.com'] start_urls = ['http://books.toscrape.com/'] def parse(self, response): all_books = response.xpath('//article[@class="product_pod"]') for book in all_books: book_url = self.start_urls[0] + book.xpath('.//h3/a/@href').extract_first()
Right now we are getting all the books and extracting its URL. Now, for each book, we are going to use a new method. Parse method is called automatically when the spider starts, but we can create our own methods.
As we have the Book URL we can create another request, that’s it, a petition to the server. But instead of the base URL books.toscrape.com, we are going to use the book’s URL. Add this to your script:
# Old code for book in all_books: book_url = self.start_urls[0] + \ book.xpath('.//h3/a/@href').extract_first() # New code yield scrapy.Request(book_url, callback=self.parse_book) def parse_book(self, response): print(response.status)
We use the Scrapy method Request to request a new HTML to the server. That HTML is the one stored at book_url. The callback, the method that we are going to run after we get the response, it is a new method: parse_book.
Run the code and each time you will get a bunch of 200, the status code of success:
Extracting time – Different ways to pull data
As we did on the parse method, we are going to extract the data from each own book URL. Open one random book, for example, Sharp Objects
We are going to use this one as a model and every book will be scraped the same way.
We have a lot to choose from! Why don’t we start from the title?
Extracting data – The easy ones
Right-click on the title, select inspect and look where it is located. It’s just the only h1 tag after a div. Pretty easy. Let’s find one h1 after a div, and extract the text. Then, we store it in a variable:
def parse_book(self, response): title = response.xpath('//div/h1/text()').extract_first() print(title)
Let’s run the code and print the title:
Easy, right?
Before, we just had the main URL and loop over the articles to extract the data.
Now we have the main URL and loop over the articles to extract the URL, then request the new URL and we extract the data. One additional step in another method. This is all it takes.
Let’s keep going. Locate the image and right-click it and then inspect it. Seems like we have a partial URL again!
Luckily you have learnt a lot in our first lesson and you know how to create the final URL by getting the partial URL and adding the base URL. Why don’t you give it a try?
Doesn’t matter if you don’t succeed at the first try. Get the URL, add the base URL and print the result until you find it.
This is how I did it:
class SpiderSpider(scrapy.Spider): name = 'spider' allowed_domains = ['books.toscrape.com'] start_urls = ['http://books.toscrape.com/'] # New 'base_url' variable base_url = 'http://books.toscrape.com' ..... def parse_book(self, response): title = response.xpath('//div/h1/text()').extract_first() relative_image = response.xpath('//div[@class="item active"]/img/@src').extract_first() final_image = self.base_url + relative_image.replace('../..', '')
As always, print final_image to see that you have a proper URL. You know the drill.
Let”s get the price.
The ‘contains’ selector
Right-click the price, inspect it and you can see that it is inside a p tag with a price_color class.
The problem is that every item at the bottom section of ‘Products you recently viewed’ have that too!
We not only need to search for the price searching for a p tag with the price_color class inside a div, that div also need to have a product_main class!
But that is just one part of the class:
We can use a selector to search for an item that its class contains a string. Instead of using the whole class, “col-sm-6 product_main”, we are only search for product_main.
Here’s the code:
price = response.xpath( '//div[contains(@class, "product_main")]/p[@class="price_color"]/text()').extract_first()
We look for a div that its class contains product_main, then we get the text inside the p with price_color class.
Print the price and run the code again to check it is working.
Now, your turn: Scrape the stock (The text that says ‘ In stock (X available) ‘). Use the technique you just have seen and do it yourself.
Here’s my solution:
stock = response.xpath( '//div[contains(@class, "product_main")]/p[contains(@class, "instock")]/text()').extract()[1].strip()
This time we have 2 elements, so I extract the desired and I remove the empty spaces with python .strip().
Let’s extract the ratings. Right-click on the stars and we have this:
Every star has a icon-star class, but if you watch the previous div, you can see that all the stars are wrapped around in div with star-rating Four class. Four is the rating.
Try to extract it. Just get a p that contains the star-rating class and get that class. Remove the extra text we don’t need.
Here’s my code:
stars = response.xpath( '//div/p[contains(@class, "star-rating")]/@class').extract_first().replace('star-rating ', '')
Family matters – Siblings
The description is a tricky one:
The p tag has no class! How can we select it?
Well, we can’t… But we can select the previous element, div id=”product_description”, then select the next HTML node, or
description = response.xpath( '//div[@id="product_description"]/following-sibling::p/text()').extract_first()
We select the div with the id product_description, then we go to the next p sibling and we select and extract the text. Phew!
Tables
As if you didn’t had enough with contains and siblings, now we have tables!
Don’t you worry, I have you covered.
We need to select the table, the row or tr, then the position of said row, and then the value, in this case, td. After the selection, we get the text as usual. Let me do the first one, UPC:
upc = response.xpath( '//table[@class="table table-striped"]/tr[1]/td/text()').extract_first()
Print it and run the spider. This is how we extract data from tables. Now it’s your turn:
Extract the price excl tax, price inc tax and tax. As we did on the first spider, yield the result as we did on the first spider.
Do it yourself and don’t look here unless needed.
price_excl_tax = response.xpath( '//table[@class="table table-striped"]/tr[3]/td/text()').extract_first() price_inc_tax = response.xpath( '//table[@class="table table-striped"]/tr[4]/td/text()').extract_first() tax = response.xpath( '//table[@class="table table-striped"]/tr[5]/td/text()').extract_first() yield { 'Title': title, 'Image': final_image, 'Price': price, 'Stock': stock, 'Stars': stars, 'Description': description, 'Upc': upc, 'Price excl tax': price_excl_tax, 'Price incl tax': price_inc_tax, 'Tax': tax, }
And that’s it! Run the spider but this time, store the file into a file.
scrapy crawl spider -o books_detailed.json
Open the new file and make sure everything is in order.
Conclusion
Congratulations! You managed to improve your spider!
Now you know how to get elements the normal way, by attribute as class or id, by partial attributes, siblings elements, tables, etc and you can extract all the details from all the books!
Well, at least, from all the books o
Don’t you worry, you can know how to do it on the third lesson of this tutorial: How to get to the next page