Beautiful Soup – 02 – How to get the next page

It is easy to scrape a simple page, but how do we get the next page on Beautiful Soup? What can we do to crawl all the pages until we reach the end?

Today, we are going to learn how to fetch all the items while Web Scraping by reaching to the next pages.

Video version of this post
Table of content
Getting started
Refactoring – Getting rid of the clutter
Recursive function – The trick to get the next page
Conclusion

Getting Started

Get the next page on Beautiful Soup - Getting ready

As the topic of this post is what to do to crawl next pages, instead of coding a Beautiful Soup script again, we are going to take the one we did previously.

If you are a beginner, please, do the ‘Your first Web Scraping script with Python and Beautiful Soup‘ tutorial first.

If you know how to use Beautiful Soup, use this starting code in repl.it.

This code fetches us the albums from the band the user asks for. All of them? No, just the first 10 ones that are displayed on the first page. By now.

Open a new repl.it file or copy-paste the code in your code editor: Now it’s time to code!


Refactoring – Getting rid of the clutter

Get the next page on Beautiful Soup - Getting rid of the clutter

Before adding features, we need to clean the clutter by refactoring.

We are going to take blocks of code and placing them in their own functions, then calling that functions where the code was.

Go to the end of the code and take the lines where we create the table:

  table = pd.DataFrame(data, columns=['Image', 'Name', 'URL', 'Artist', 'Binding', 'Format', 'Release Date', 'Label'])
  table.index = table.index + 1
  table.to_csv(f'{band_name}_albums.csv', sep=',', encoding='utf-8', index=False)
  print(table)

Cut them and create a function, for example, export_table_and_print, and put it after base_url and search_url:

base_url = 'http://www.best-cd-price.co.uk'
search_url = f'http://www.best-cd-price.co.uk/search-Keywords/1-/229816/{formated_band_name}.html'

# New code after this line
def export_table_and_print(data):
    table = pd.DataFrame(data, columns=[
                         'Image', 'Name', 'URL', 'Artist', 'Binding', 'Format', 'Release Date', 'Label'])
    table.index = table.index + 1
    clean_band_name = band_name.lower().replace(' ', '_')
    table.to_csv(f'{clean_band_name}_albums.csv',
                 sep=',', encoding='utf-8', index=False)
    print('Scraping done. Here are the results:')
    print(table)

We also added a ‘clean_band_name’ so the filename where we store the data doesn’t have empty spaces and it is all lowercase, so “ThE BeAtLES” search stores a ‘the_beatles_albums.csv’ file.

Now, where the old code was, call the function, just at the end of the file:

    data['Release Date'].append(release_date)
    data['Label'].append(label)

# New code
export_table_and_print(data)

The first part is done. Run the code and check it is still working.

Go to the ‘for loop’ at around line 45. Take everything that involves in extracting values and adding them to ‘data’ (so, the whole code) and replace it with the ‘get_cd_attributes(cd)’.

After the last function, create that function and paste the code:

def get_cd_attributes(cd):
  # Getting the CD attributes
    image = cd.find('img', class_='ProductImage')['src']

    name = cd.find('h2').find('a').text

    url = cd.find('h2').find('a')['href']
    url = base_url + url

    ....

    data['Release Date'].append(release_date)
    data['Label'].append(label)
...

  for cd in list_all_cd:
    get_cd_attributes(cd)   

export_table_and_print(data)
    

Again, run the code and check it is still working. If it is not, compare your code with mine:

from bs4 import BeautifulSoup
import requests
import lxml

import pandas as pd


band_name = input('Please, enter a band name:\n')
formated_band_name = band_name.replace(' ', '+')
print(f'Searching {band_name}. Wait, please...')


base_url = 'http://www.best-cd-price.co.uk'
search_url = f'http://www.best-cd-price.co.uk/search-Keywords/1-/229816/{formated_band_name}.html'

def export_table_and_print(data):
    table = pd.DataFrame(data, columns=[
                         'Image', 'Name', 'URL', 'Artist', 'Binding', 'Format', 'Release Date', 'Label'])
    table.index = table.index + 1
    clean_band_name = band_name.lower().replace(' ', '_')
    table.to_csv(f'{clean_band_name}_albums.csv',
                 sep=',', encoding='utf-8', index=False)
    print('Scraping done. Here are the results:')
    print(table)

def get_cd_attributes(cd):
  # Getting the CD attributes
    image = cd.find('img', class_='ProductImage')['src']

    name = cd.find('h2').find('a').text

    url = cd.find('h2').find('a')['href']
    url = base_url + url

    artist = cd.find('li', class_="Artist")
    artist = artist.find('a').text if artist else ''

    binding = cd.find('li', class_="Binding")
    binding = binding.text.replace('Binding: ', '') if binding else ''

    format_album = cd.find('li', class_="Format")
    format_album = format_album.text.replace('Format: ', '') if format_album else ''

    release_date = cd.find('li', class_="ReleaseDate")
    release_date = release_date.text.replace('Released: ', '') if release_date else ''

    label = cd.find('li', class_="Label")
    label = label.find('a').text if label else ''

    # Store the values into the 'data' object
    data['Image'].append(image)
    data['Name'].append(name)
    data['URL'].append(url)
    data['Artist'].append(artist)
    data['Binding'].append(binding)
    data['Format'].append(format_album)
    data['Release Date'].append(release_date)
    data['Label'].append(label)

# HTTP GET requests
page = requests.get(search_url)

# Checking if we successfully fetched the URL
if page.status_code == requests.codes.ok:
  bs = BeautifulSoup(page.text, 'lxml')
  # Fetching all items
  list_all_cd = bs.findAll('li', class_='ResultItem')
  data = {
    'Image': [],
    'Name': [],
    'URL': [],
    'Artist': [],
    'Binding': [],
    'Format': [],
    'Release Date': [],
    'Label': [],
  }
  
  for cd in list_all_cd:
    get_cd_attributes(cd)

export_table_and_print(data)

It is working? Cool. Time to get ALL the albums!


Recursive function – The trick to get the next page

Ok, here’s the trick to get the job done: Recursiveness.

We are going to create a “parse_page’ function. That function will fetch the 10 albums the page will have.

After the function it is done, it is going to call itself again, with the next page, to parse it, over and over again until we have everything.

Let me simplify it for you:

I hope it is clear: As we keep having a next page’ to parse, we are going to call the same function again and again to fetch all the data. When there is no more, we stop. As simple as that.

Step 1: Create the function

Grab this code, create another function called ‘parse_page(url)’ and call that function at the last line.

# HTTP GET requests
page = requests.get(search_url)

# Checking if we successfully fetched the URL
if page.status_code == requests.codes.ok:
  bs = BeautifulSoup(page.text, 'lxml')
  # Fetching all items
  list_all_cd = bs.findAll('li', class_='ResultItem')
  data = {
    'Image': [],
    'Name': [],
    'URL': [],
    'Artist': [],
    'Binding': [],
    'Format': [],
    'Release Date': [],
    'Label': [],
  }
  
  for cd in list_all_cd:
    get_cd_attributes(cd)

export_table_and_print(data)

The data object is going to be used in different places, take it out and put it after the search_url.

We took the main code and created a parse_page function, called it using the ‘search_url’ as parameter and took the ‘data’ object out so we can use it globally.

In case you are dizzy, here’s what your code should look like now:

from bs4 import BeautifulSoup
import requests
import lxml

import pandas as pd


band_name = input('Please, enter a band name:\n')
formated_band_name = band_name.replace(' ', '+')
print(f'Searching {band_name}. Wait, please...')


base_url = 'http://www.best-cd-price.co.uk'
search_url = f'http://www.best-cd-price.co.uk/search-Keywords/1-/229816/{formated_band_name}.html'

data = {
  'Image': [],
  'Name': [],
  'URL': [],
  'Artist': [],
  'Binding': [],
  'Format': [],
  'Release Date': [],
  'Label': [],
}

def export_table_and_print(data):
    table = pd.DataFrame(data, columns=[
                         'Image', 'Name', 'URL', 'Artist', 'Binding', 'Format', 'Release Date', 'Label'])
    table.index = table.index + 1
    clean_band_name = band_name.lower().replace(' ', '_')
    table.to_csv(f'{clean_band_name}_albums.csv',
                 sep=',', encoding='utf-8', index=False)
    print('Scraping done. Here are the results:')
    print(table)

def get_cd_attributes(cd):
  # Getting the CD attributes
    image = cd.find('img', class_='ProductImage')['src']

    name = cd.find('h2').find('a').text

    url = cd.find('h2').find('a')['href']
    url = base_url + url

    artist = cd.find('li', class_="Artist")
    artist = artist.find('a').text if artist else ''

    binding = cd.find('li', class_="Binding")
    binding = binding.text.replace('Binding: ', '') if binding else ''

    format_album = cd.find('li', class_="Format")
    format_album = format_album.text.replace('Format: ', '') if format_album else ''

    release_date = cd.find('li', class_="ReleaseDate")
    release_date = release_date.text.replace('Released: ', '') if release_date else ''

    label = cd.find('li', class_="Label")
    label = label.find('a').text if label else ''

    # Store the values into the 'data' object
    data['Image'].append(image)
    data['Name'].append(name)
    data['URL'].append(url)
    data['Artist'].append(artist)
    data['Binding'].append(binding)
    data['Format'].append(format_album)
    data['Release Date'].append(release_date)
    data['Label'].append(label)

def parse_page(next_url):
    # HTTP GET requests
  page = requests.get(next_url)

  # Checking if we successfully fetched the URL
  if page.status_code == requests.codes.ok:
    bs = BeautifulSoup(page.text, 'lxml')
    # Fetching all items
    list_all_cd = bs.findAll('li', class_='ResultItem')

    
    for cd in list_all_cd:
      get_cd_attributes(cd)

  export_table_and_print(data)

parse_page(search_url)

Please check this line:

page = requests.get(next_url)

Now we are not fetching the ‘search_url’ (the first one) but the URL that we pass as an argument. This is very important.

Step 2: Add recursion

Run the code again. It should fetch the 10 first albums as always.

That’s why because we haven’t used recursion. Let’s write the code that will:

  • Get all the pagination links
  • From all the links, grab the last one
  • Check if the last one has a ‘Next’ text
  • If it has it, get the relative (partial) url
  • Build the next page url by adding base_url and the relative_url
  • Call parse_page again with the next page url
  • If doesn’t has the ‘Next’ text, just export the table and print it

Once we have fetched all the cd attributes (that’s it, after the ‘for cd in list_all_cd’ loop), add this line:

    next_page_text = bs.find('ul', class_="SearchBreadcrumbs").findAll('li')[-1].text

We are getting all the ‘list item’ (or ‘li’) elements inside the ‘unordered list’ with the ‘SearchBreadcrumbs’ class. That’s the pagination list.

Then, we go to the last one and get the text. Add this after the last code:

    if next_page_text == 'Next':
        next_page_partial = bs.find('ul', class_="SearchBreadcrumbs").findAll(
            'li')[-1].find('a')['href']
        next_page_url = base_url + next_page_partial
        print(next_page_url)
        parse_page(next_page_url)
    # No more 'Next' pages, finish the script
    else:
        export_table_and_print(data)

Now we check if ‘next_page_text’ has ‘Next’ as text. If it does, we take the partial url, we add it to the base to build the next_page_url. If it does not, there is no more pages, so we can create the file and print it.

That’s all we need. Run the code, and now you are getting dozens, if not hundreds of items!

Step 3: Fixing a small bug

But we can still improve the code. Add this 4 lines after parsing the page with Beautiful Soup:

  if page.status_code == requests.codes.ok:
    bs = BeautifulSoup(page.text, 'lxml')

    # New Code
    check_no_results = bs.find('ul', class_="SearchResults").find('p')
    if check_no_results and check_no_results.text:
        print('Search returned no results.')
        return None

Sometimes there is a ‘Next’ page when the numbers of albums are multiple of 10 (10, 20, 30, 40 and so on) but there is no album there. That makes the code to end without creating the file.

With this code, it is fixed.

Your coding is done! Congratulations!


Conclusion

Let me summarize what we have done:

  • We moved blocks of code with the same functionality to functions
  • We put the scraping code inside a function and we call it passing the initial search_url
  • Inside the function, we scrap the code
  • After it is done, we check for the next URL
  • If there is a ‘next url‘, we call the function with the next page URL
  • If not, we end the scraping and create the .csv file

Now it seems simpler, right?

I want to keep doing tutorials like this one, but I want to ask you what do you want to see:

  • Do you want more Web Scraping with Beautiful Soup or Scrapy?
  • Do you want me to teach how to make a Flask web app or a Django one?
  • Or do you want to learn more Front-End things like Vue.js?

Please, leave me a comment with what do you want to see in future posts.

And if this tutorial has been useful to you, share it with your friends, on Twitter, Facebook or where you can help others.


Final code on Repl.it

My Youtube tutorial videos

Reach to me on Twitter

My Github

Contact me: DavidMM1707@gmail.com

Keep reading more tutorials