Last Updated on April 30, 2023 by mishou

I. Scrapy and BeautifulSoup

Scrapy is a powerful library that can be used to extract data from web pages and XML files. I usually use Pandas’ read_html() to get the data of HTML tables. When it comes to texts or other types of data, I have used BeautifulSoup. But Scrapy is a full-fledged web scraper and it is sometimes said you should learn Scrapy if you want do serious web scraping.

Scrapy has a steep learning curve and a reputation for not being beginner-friendly but I believe you would master it with a rather gentle learning curve if you run Scrapy on Google Colaboratory.

You can learn the advantages and disadvantages of each library at Difference between BeautifulSoup and Scrapy crawler. It reads:

BeautifulSoup

Advantages:

  • Easy for beginners to learn and master in web scrapping.
  • It has good community support to figure out the issue.
  • It has good comprehensive documentation.

Disadvantages:

  • It has an external python dependency.

Scrapy crawler

Advantages:

  • It is easily extensible.
  • It has built-in support for extracting data.
  • It has very fast speed compared to other libraries.
  • It is both memory and CPU efficient.
  • You can also build robust and extensive applications.
  • Has strong community support.

Disadvantages:

  • It has light documentation for beginners.

II. Create a spider

Spiders are classes that you define and that Scrapy uses to scrape information from a website. Let’s create our first spider. You can learn the scripts here:

Scrapy Tutorial

1. Creating the files and directories for our project

# install scrapy
!pip install Scrapy
# create files for learning
!scrapy startproject firstproject
google colab

2. Creating quotes_spider.py and save it

Change the current working directory to the spiders directory with os.chdir().

# change working directories
os.chdir('/content/firstproject/firstproject/spiders')

Create a quotes_spider.py and save it under the spiders directory using a IPython Magic Command, %%writefile.

google colab

III. Extracting data using Scrapy shell

google colab

Now, run the follwoing code:

!scrapy shell 'https://quotes.toscrape.com'

Then put the following commands by turns:

quote = response.css("div.quote")[0]
text = quote.css("span.text::text").get()
text

You will get the texts:

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

You can quit Scrapy shell with Ctrl + c. You can see the scripts here:

https://colab.research.google.com/drive/1kSSEUQJ86VyjccT004rHx0P922YKOcIy?usp=sharing

To be continued.

By mishou

Leave a Reply

Your email address will not be published. Required fields are marked *