Scrapy: How to use the shell - Let's learn about

After learning how to get the details of an item in the last post, we are done with the basics and now It’s time you learn how to do it on your own, with a website of your choosing.

Today we will learn the way we’ll extract the information of any website and how Xpath, the selection tool, works.

Let’s say that you’re interested in audiobooks and you want to scrape all the information from Audiobooks on eBay. So you open your code editor, create a new project, a spider then… you don’t know where to go from there. How do you select every item? And how do you select the title? What about the image?

Fortunately, Scrapy comes with a tool in form of a shell which helps you to understand how to select the information. Let’s jump to it, by writing:

scrapy shell 'https://www.ebay.com/b/Audiobooks/29792/bn_317579'

What happened? We sent a request to get us the website. The response is stored in a ‘response’ object. Now we are able to use it and its methods. Let’s check if the url we sent was correct:

response.url

Did we get it or we had any problem doing it?

response.status

Everything seems in order. Right now, open the URL in your browser. What do you see?

Each audiobook has a block with its information and we want the link of each one. At the top left of the image seems like in this page we have 48 audiobooks, so if we manage to select somehow 48 ‘blocks’ of something, we will be good. By now, open your inspector by right-clicking inside of the audiobooks and open the Inspector tool. Paying a little bit of attention reading the code, seems like each audiobooks is inside an unordered list (‘ul’), and each element is inside a list item (‘li’). By clicking in one list item, you’ll see it selected:

Each ‘li’ tag contains an audiobook, so the link to the item link with its details should be inside. Click in the title to see if it is there:

The link it is there in the ‘a’ with the class ‘s-item__link’. To get all the elements with a certain tag in a document, we say response.xpath(‘//TAG’). Remember that we saw before that in each page there are 48 items. Let’s check it in the shell:

We saved all the ‘a’ link tags in ‘links’ and we checked its length. As we just get all the ‘a’ tags without discriminating, we got 548 links, too much. Now we want to get the ‘a’ tags with the ‘s-item__link’ class, and to specify the class of a tag, we say response.xpath(‘//TAG[@class=”CLASS_NAME”‘). Let’s go to the shell again:

Nice! We have 48 items and we are in a good path. We know that //a selects any tag in the document, and we can specify any attribute of the tag with //a[@class=”CLASS_NAME”] or //div[@id=”description”]. But how we extract attributes? In this case we have the ‘href’, so let’s get them:

By adding /@href we are saying that we want to get the link address, instead of filtering by any specific @href. Now, we have a list of a Scrapy selector object, but we want to extract the @href of each object. We do it by… using .extract:

Now we have the list with all the links, and just that: Links. This is going to be your bread and butter: You filter with xpath by tag and/or attributes in that tag, then use extract().

As we have all the item links, let’s check what we want to extract for each item detail link. We have loaded the main website, so let’s fetch the detail link of the first website:

Now that we have the first item detail page loaded, let’s play around it. Open it in the browser (VS Code let’s me open it with Control + click, you may do the same. If not, just print(links[0]) and copy the URL):

We want the title, the price, and the condition. Right-click in the title and look the code. You just need to select an ‘h1’ tag, with a certain id and get the text:

We use the text() Scrapy function to get the text. We could use extract() here too, but that would return a list. Instead, as there is only one title, we use extract_first()

Now we want the price and the condition. And you know? You’re ready to get it! Give it a try now, you already have the knowledge to do it. Spend 30 seconds, ten minutes, whatever it takes to get the price and the condition. Just check the structure with the inspector in your browser and try to reach them within the terminal.

If you tried but didn’t managed to do it, it’s ok. Failing and learning from it is part of the process. This is how I did it:

And now we are done!

See how easy it is: We right-click or search for the item we want in the inspector. Check its tag, any attribute it might have (id, class, etc) and then get what you want from it (an href inside the tag, a text, etc). Rinse and repeat.

Now, let’s back a bit. Imagine if you tried to write a spider without checking it in the shell. You’ll be running a spider for seconds or even minutes, see that you missed some piece of the code, fix it, run it again, and so on. I know because that’s what I did when I was starting. I ran into multiple errors that took me minutes to fix when using the shell I can fix in seconds.

Xpath is the most basic, important thing in scraping. It is how we select the nodes in the website code to get the information. I’ve guided you by the eBay website, but you should try it now in your own. Open your favorite website, fetch it with the scrapy shell and try to get the title, text, an image, whatever (I’ll even throw you this link to help you). Play around and have fun; this way you’ll learn a lot. And if you reach to a point where you cannot pass, google it the error or ask for help (or ask ME for help!).

And prepare, because the next session we are going to upload our code to ScrapingHub!