Reading is the fuel of so many skills you learn in life. You gain the most out of it by reading the best books.
The best-seller list is one of the best ways to get a sense of what is worth reading and buying on a certain topic.
Probably you used to copy and paste book data like the title, author, rating from the readers, and ranking. And copied data from the web and pasted it into a spreadsheet.
Copying and pasting are cumbersome, error-prone, and time-consuming.
Or you might need it to decide which books to buy. This is actually my case.
I like to read the best books on my favorite topics. So I made this script to get the best books in a CSV file quickly and easily and I wanted to share it with you.
This quick tutorial will help you get started. You’ll learn how to scrape data from Amazon bestseller page and export it to a spreadsheet. You can apply the same strategy with any bestseller page.
Use the code in your own interest (Licensed as Creative Commons Zero).
Let’s get started.
Requirements
All you need to install is two libraries: BeautifulSoup
and Pandas
. I assume you have Python3 and pip installed.
If you haven’t already, please do so by running the following command on your terminal:
$ pip install beautifulsoup4 pandas
Then create a new a Python script and import both along with the
standard library: urllib
:
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import pandas as pd
Here you use:
urlopen
to open the page.Request
to create a request object to communicate with the server.BeautifulSoup
to parse the HTML.pandas
to export the data to a CSV file.
Requesting the Page
You need to request the page. First pass the URL to the urlopen
function:
# Page for best sellers in writing (Authorship subcategory)
url = 'https://www.amazon.com/gp/bestsellers/books/11892'
request = Request(url, headers={'User-agent': 'Mozilla/5.0'})
html = urlopen(request)
Note: The User-agent
header is required to prevent Amazon from
blocking your request. If you don’t pass it, you will get this error: urllib.error.HTTPError: HTTP Error 503: Service Unavailable
.
Parsing the HTML
Now, the html
variable contains the HTML of the page. urlopen
doesn’t understand HTML, it just returns it.
So you need to parse it with BeautifulSoup
. Doing so will let you
scrape the data depending on the structure of the parsed HTML.
soup = BeautifulSoup(html, 'html.parser')
Next, you need to go to the page in the browser and see the structure of the HTML. Inspect the HTML by clicking CTRL+SHIFT+C (CMD+SHIFT+C on MacOS) in the browser. When you click on the item of interest you want to know the HTML structure of, you’ll see the HTML tag in the elements tab.
Scraping Amazon Best Seller Page
You want to grab the list of best sellers items that you can loop over.
After you inspect the HTML of an item in the browser, you will find that
each item has a div
tag with the id
attribute gridItemRoot
.
Use the find_all()
method to get all the div
tags with that id.
This is what we want to loop over to get information about each book.
books = soup.find_all('div', id="gridItemRoot")
Side note: Amazon should have a better way to structure their HTML than
this. This id
attribute is not unique as it’s suppose to be. It
should be a class
attribute.
Anyway, let’s continue.
Now, you loop over each book. Focus on one item at a time and get the information you want to get from the HTML structure.
for book in books:
rank = book.find('span', class_="zg-bdg-text").get_text().replace('#', '')
print(rank)
title = book.find(
'div',
class_="_p13n-zg-list-grid-desktop_truncationStyles_p13n-sc-css-line-clamp-1__1Fn1y"
).get_text(strip=True)
print(f"Title: {title}")
In this example, you get the rank with the HTML tag span
with the
class zg-bdg-text
. You then do a little change to the rank result
(e.g. #1
) and replace the #
to omit it.
Similar to getting the rank result, you get the title with the HTML tag
div
with the associated long class.
and then you get the author and rating info:
for book in books:
...
author = book.find('div', class_="a-row a-size-small").get_text(strip=True)
print(f"Author: {author}")
r = book.find('div', class_="a-icon-row").find('a', class_="a-link-normal")
rating = r.get_text(strip=True).replace(' out of 5 stars', '') if r else None
You want to just get the rating number. So you do a little change to the result string.
Export to CSV
To export the data to a CSV file, you need to have empty lists before the for loop and append the data to each list in the loop.
Until you get to the point where you create a Pandas DataFrame and export it to a CSV file:
pd.DataFrame({
'Rank': ranks,
'Title': titles,
'Author': authors,
'Rating': ratings
}).to_csv('best_seller.csv', index=False)
Final Thoughts
This is a quick tutorial to get you started with scraping data in general. We’ve seen how to get the data from the Amazon bestseller page, parse the HTML, and export the result to a CSV file.
Note: There is a limitation to this code because we’ve used BeautifulSoup
which is just an HTML parser. You may notice that the
page contains 50 books in the bestseller list while we get only 30.
That’s because there is a Javascript code on that page. It disables the rest of the content on the page after 30 books until we scroll down.
If you want to get the whole 50-book list, you need to use another library. Consider using Selenium or Scrapy.
But this is a quick tutorial to get you started. 30 books is a good number.