Beautiful Soup – 01 – Your first Web Scraping script with Python

Today we will learn how to scrap a music web store using a Python library called Beautiful Soup. With simple, easy to read code, we are going to extract the data of all albums from our favourite music bands and store it into a .csv file.

It is simple, it is easy and even better, is efficient. And it is a lot of fun!

If you prefer it, here’s the video version
Table of contents
Introduction
Getting ready
Importing libraries
Fetching the URL
Selecting elements from the URL
Getting our first album
Getting all the albums
Storing the albums in a file
Extra points!
Conclusion

Introduction

If you know what Python, Beautiful Soup and web scraping is, skip to the next lesson: How to get the next page with Beautiful Soup

If you don’t, let me give a brief jump-start to you with a short, easy explanation:

  • Python: An easy to learn programming language. It is one of the most used programming languages due to its easiness to learn, as it can be read like the English language.
  • Beautiful Soup: Beautiful Soup is a library (a set of pre-writen code) that give us methods to extract data from websites via web scraping
  • Web Scraping: A technique to extract data from websites.

With that in mind, we are going to install Beautiful Soup to scrap a website, Best CD Price to fetch the data and store it into a .csv file. Let’s go!


Getting ready

If you have used Python before, open your favourite IDE and create a new environment in the project’s folder.

If you never used Python before and what I said sounds strange, don’t panic. You don’t need to install anything if you don’t want to. Just open repl.it, click ‘+ next repl’, select Python as the project’s language and you are ready to go:

In this image, you have a white column in the middle where you’ll write the code, at your right, a black terminal where the output will be displayed and to your left, a column listing all the Python files. This script has only one file.


Importing libraries

Image from Vecteezy

If we had to code everything with just Python, it would take us days instead of less than 30 minutes. We need to import some libraries to help us. Write the following code to import them:

from bs4 import BeautifulSoup
import requests
import lxml

import pandas as pd

We are importing:

  • Requests to fetch the HTML files
  • BeautifulSoup to pull the data from HTML files
  • lxml to parse (or translate) the HTML to Python
  • Pandas to manipulate our data, printing it and saving it into a file

If we click “Run” it will download and install all the libraries. Don’t worry, it only installs them the first time.

Beutiful Soup loading libraries

Fetching the URL

The first step to scrape data from an URL? Fetching that URL.

Let’s make it simple: Go to Best CD Price and search for one band, then copy the resulting URL. This is mine: http://www.best-cd-price.co.uk/search-Keywords/1-/229816/sex+pistols.html

After the importing code, type this:

search_url = f'http://www.best-cd-price.co.uk/search-Keywords/1-/229816/sex+pistols.html' # Replace this URL with yours

# HTTP GET requests
page = requests.get(search_url)

# Checking if we successfully fetched the URL
if page.status_code == requests.codes.ok:
  print('Everything is cool!')

Run the code and you’ll see the “Everything is cool!” message.

We have stored our URL in ‘search_url’. Using requests we used the ‘get’ method to fetch the URL and if everything is working properly, our URL is successfully fetched with a 200 status code (Success) and we print ‘Everything is cool!’ in our terminal.

Python needs to understand the code. To do so, we have to translate it, or parsing it. Replace the last print with the following code:

if page.status_code == requests.codes.ok:
  bs = BeautifulSoup(page.text, 'lxml')
  print(bs)

We parse the page’s text, with the ‘lxml’ parser, and print the result.

Beautiful Soup printing a parsed URL

Sounds familiar?

We have the whole URL stored in the ‘bs’ variable. Now, let’s take the parts we need.


Selecting elements from the URL

Now the fun part begins!

For me, web scraping is fun especially because this part of the process. We are like a detective in a crime scene, looking for hints we can follow up.

Copy the search URL and paste it into a browser. While Chrome is recommended, it is not mandatory. Rightclick the website at any place and select “Inspect”. A side drawer will open. Make sure to click the ‘Elements’ tab.

You are looking at the skeleton of the website: The HTML code.

You can move your cursor to an HTML tag and that part of the website will be selected.

This is the equivalent of a detective magnifying glass. By hovering over HTML tags we can tell what part we need to select.

And I found our first clue. Every CD is wrapped inside a li tag (List Item) inside a ul tag (Unordered List). Let’s grab all of them then!

  bs = BeautifulSoup(page.text, 'lxml')
  list_all_cd = bs.findAll('li', class_='ResultItem')
  print(list_all_cd)
  print(len(list_all_cd))

We have our website stored in the ‘bs’ variable. We use the ‘findAll’ method to find every ‘li’ tag. But as there are 192 li elements we need to reduce our scope.

We are going to fetch every li tag that also has a class ‘ResultItem’ and then, print all of them and the length of the list.

We get the whole list and ’10’, as there are 10 items in our page. It is looking good!


Getting our first album

We have a list of 10 (or less, depending of the band) albums. Let’s get one and see how we can extract the desired data. Then, following the same process, we will get the rest of them.

Remove the previous prints and type this:

  list_all_cd = bs.findAll('li', class_='ResultItem')
  cd = list_all_cd[0]
  print(cd)

After selecting all CDs, we store the first one into ‘cd’ and we print it:

You can view the same structure on the website too:

Let’s grab the information of this CD!

  image = cd.find('img', class_='ProductImage')
  print(image)

Output> <img class="ProductImage" src="https://images-eu.ssl-images-amazon.com/images/I/51HONDwUBZL._SS100_.jpg" width="100"/>

Following the same technique as before, we search for an ‘img‘ tag with the class ‘ProductImage’. Now the method is ‘find’ as we want only the first instance, not every

Hm, we have the element, indeed. But we only need that image’s URL, that it is stored in the ‘src’ property. As a Python element, we can extract it as a normal property, using [‘src’]

  image = cd.find('img', class_='ProductImage')['src']
  print(image)

Output> https://images-eu.ssl-images-amazon.com/images/I/51HONDwUBZL._SS100_.jpg

Nice, we extracted the image. Let’s keep going:

    image = cd.find('img', class_='ProductImage')['src']

    name = cd.find('h2').find('a').text

    url = cd.find('h2').find('a')['href']
    url = base_url + url

    artist = cd.find('li', class_="Artist")
    artist = artist.find('a').text if artist else ''

    binding = cd.find('li', class_="Binding")
    binding = binding.text.replace('Binding: ', '') if binding else ''

    format_album = cd.find('li', class_="Format")
    format_album = format_album.text.replace('Format: ', '') if format_album else ''

    release_date = cd.find('li', class_="ReleaseDate")
    release_date = release_date.text.replace('Released: ', '') if release_date else ''

    label = cd.find('li', class_="Label")
    label = label.find('a').text if label else ''

  print(name)
  print(artist)
  print(binding)
  print(format_album)
  print(release_date)

Output> 1) Never Mind The Bollocks - 40th Anniversary Deluxe Edition
Output> Sex Pistols
Output> Audio CD
Output> CD+DVD, Box set
Output> 2017-12-01

Cool! With a few lines we have everything!

We keep extracting the values we need. If the element is a property of the tag as the ‘src’ or ‘href’, we use [‘href’] to extract it. If it is the text between the starting and ending tag, we use ‘.text’.

As not every album has the same properties, we try to fetch the value first and then, if it exists, we try to find the value. If not, we just return an empty string:

    artist = cd.find('li', class_="Artist")
    artist = artist.find('a').text if artist else ''

Some values have extra text we don’t need, as format_album or release_date. We remove that extra text with the ‘replace’ function, replacing that text with an empty string.

This was the most complicated thing of the code, but I’m sure you crushed it.


Getting all the albums

We have everything to fetch the information from one CD, now let’s do the same with every CD and store it into an object.

Replace what it is inside the ‘if page.status_code…’ statement with this

if page.status_code == requests.codes.ok:
  bs = BeautifulSoup(page.text, 'lxml')
  # Fetching all items
  list_all_cd = bs.findAll('li', class_='ResultItem')
  data = {
    'Image': [],
    'Name': [],
    'URL': [],
    'Artist': [],
    'Binding': [],
    'Format': [],
    'Release Date': [],
    'Label': [],
  }
  
  for cd in list_all_cd:

    # Getting the CD attributes
    image = cd.find('img', class_='ProductImage')['src']

    name = cd.find('h2').find('a').text

    url = cd.find('h2').find('a')['href']
    url = 'http://www.best-cd-price.co.uk' + url

    artist = cd.find('li', class_="Artist")
    artist = artist.find('a').text if artist else ''

    binding = cd.find('li', class_="Binding")
    binding = binding.text.replace('Binding: ', '') if binding else ''

    format_album = cd.find('li', class_="Format")
    format_album = format_album.text.replace('Format: ', '') if format_album else ''

    release_date = cd.find('li', class_="ReleaseDate")
    release_date = release_date.text.replace('Released: ', '') if release_date else ''

    label = cd.find('li', class_="Label")
    label = label.find('a').text if label else ''

    # Store the values into the 'data' object
    data['Image'].append(image if image else '')
    data['Name'].append(name if name else '')
    data['URL'].append(url if url else '')
    data['Artist'].append(artist if artist else '')
    data['Binding'].append(binding if binding else '')
    data['Format'].append(format_album if format_album else '')
    data['Release Date'].append(release_date if release_date else '')
    data['Label'].append(label if label else '')

print(data)

‘data’ is our object structure. We are going to add each value in each key. The name of the album, to data[‘Name’], its cover to data[‘Image’], etc. Now we just need to loop over each item and store the fetched value into the data object.

For each element in our list_all_cds, we are going to assign it the name ‘cd’ and run the code inside the for loop.

After getting each value, we append (or ‘add it’) to the data value:

data['Image'].append(image)

And now, the print. The information is there, nice!

Beutiful Soup data object

But it is ugly and hard to read…

Remember we installed the ‘panda’ library? That will help us to display the data and something else.


Storing the albums in a file

Let’s use that panda library! Copy this at the end of the file, out of the for loop:

  table = pd.DataFrame(data, columns=['Image', 'Name', 'URL', 'Artist', 'Binding', 'Format', 'Release Date', 'Label'])
  table.index = table.index + 1
  table.to_csv(f'my_albums.csv', sep=',', encoding='utf-8', index=False)
  print(table)

Pandas (or pd) give us a ‘DataFrame’ function where we pass the data and the columns list. That’s it. That’s enough to create a beautiful table.

‘table.index = table.index +1’ sets the first index to ‘1’ instead of ‘0’.

The next line creates a .csv file, with a comma as separator and sets the encoding to ‘UTF-8’. We don’t want to store the index so we set it to false.

Beutiful Soup printing data via pandas

It looks better!

But now check your left column. You have a ‘my_albums.csv’ file. Everything is stored there!

Congratulations, you have written your first scraping script in Python


Extra points!

You succeeded in creating your own first scraping bot. You can scrape any band by switching the URL…

But we can do better, right?

Why not asking the user the name of a band and search it. It can be done?

Of course.

Replace the old code with the new one:

# Old code
search_url = f'http://www.best-cd-price.co.uk/search-Keywords/1-/229816/sex+pistols.html'

# New code
band_name = input('Please, enter a band name:\n')
formated_band_name = band_name.replace(' ', '+')
print(f'Searching {band_name}. Wait, please...')

base_url = 'http://www.best-cd-price.co.uk'
search_url = f'http://www.best-cd-price.co.uk/search-Keywords/1-/229816/{formated_band_name}.html'

Here we ask the user to enter a band, we format the name by replacing empty spaces with ‘+’ signs. This website does it when searching, so we have to do it too. Example: http://www.best-cd-price.co.uk/search-Keywords/1-/229816/sex+pistols.html

Now, our search URL uses the formatted band name

# Old code
url = 'http://www.best-cd-price.co.uk' + url

# New code
url = base_url + url

Not mandatory, but is a better practice to have the urls at the start of the code to easily replace it.

# Old code
table.to_csv(f'my_albums.csv', sep=',', encoding='utf-8', index=False)

# New code
table.to_csv(f'{band_name}_albums.csv', sep=',', encoding='utf-8', index=False)

Now the file has the name of each band, so we can create any number of files we want without rewriting it! Let’s run the code.


Conclusion

In just few minutes we have learn how to:

  • Fetch a website
  • Localize the element(s) we want to scrap
  • Analyze what HTML tags and class we need to hit to retrieve the values
  • How to store the values retrieved into an object
  • How to create a file with that object
  • We improved the code making it dynamically by letting our users type the name of the band and storing that into a file with the band’s name as file name.

I’m really proud of you by reaching to the end of this tutorial.

Right now we are scraping just one page. Wouldn’t be great to learn how to scrape all the pages?

Now you can! How to get to the next page on Beautiful Soup


My Youtube tutorial videos

Final code on Repl.it

Reach to me on Twitter

My Github

Contact me: DavidMM1707@gmail.com

Keep reading more tutorials