Last Updated on March 27, 2023 by mishou
I. What I want to do?
I’m reading The Brothers Karamazov by Fyodor Dostoevsky again. And I have decided to delve into In Search of Lost Time by Proust. I may also read Ulysses by James Joyce. They may suffice, but I have searched for books to read for the rest of my life and I am comparing 100 books listed on a few websites. To do this, I have scraped the book lists from each of the sites.
II. Time Magazine’s All-Time 100 Novels
URL: https://www.goodreads.com/list/show/2681.Time_Magazine_s_All_Time_100_Novels
I want to retrieve the authors and titles from the following page:

I typed:
Show the Python code for scraping titles and authors on a web page linked to https://www.goodreads.com/list/show/2681.Time_Magazine_s_All_Time_100_Novels
.
ChatGPT showed the following code:
import requests
from bs4 import BeautifulSoup
url = "https://www.goodreads.com/list/show/2681.Time_Magazine_s_All_Time_100_Novels"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
books = soup.find_all("tr", {"itemtype": "http://schema.org/Book"})
for book in books:
title = book.find("a", {"class": "bookTitle"}).text.strip()
author = book.find("a", {"class": "authorName"}).text.strip()
print("Title: ", title)
print("Author: ", author)
print()
I have just run the code on Google Colaboratory:

ChatGPT also explained the code:
In this code, we first import the requests and BeautifulSoup libraries. We then specify the URL of the web page we want to scrape and use the requests.get() method to retrieve the HTML content of the page.
Next, we create a BeautifulSoup object from the HTML content using the “html.parser” parser.
We then use the soup.find_all() method to find all the tr elements with an itemtype attribute set to http://schema.org/Book, which correspond to the books listed on the web page.
For each book, we use the book.find() method to find the title and author elements within the book tr element. We use the text.strip() method to remove any extra whitespace from the extracted text.
Finally, we print out the title and author information for each book using the print() function.
III. 100 books to read in Life on Amazon.com
URL: https://www.amazon.com/b?ie=UTF8&node=8192263011
I’m on Garuda Linux and I used Brave Browser for scraping and I opened it by running the following Python code:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from chromedriver_py import binary_path # for the path variable
service_object = Service(binary_path)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
# setting up sebdriver
options = Options()
options.binary_location = "/usr/bin/brave"
driver = webdriver.Chrome(chrome_options = options, executable_path=binary_path)
# access the page with the browser
driver.get('https://www.amazon.com/b?ie=UTF8&node=8192263011')
Show all the books in Brave Browser and run the code shown below the image:

# after showing all the 100 books
# retrieve page sources
page_sources = driver.page_source
# parse HTML Code
soup = BeautifulSoup(page_sources, 'html')
# find the span element with class `a-truncate-full`
titles = soup.find_all('span', class_='a-truncate-full')
# clean the texts
titles_ls = [item.text.strip().replace("\n", "") for item in titles]
# count items
print(len(titles_ls))
# show the list
titles_ls
You can create a data frame for titles and authors by running following code:
# create a data frame
title = titles_ls[0::2]
author = titles_ls[1::2]
df = pd.DataFrame({'title': title,
'author': author})
# export a CSV file
df.to_csv('amazon100.csv', index=False)

To be continued.