Last Updated on April 30, 2023 by mishou
I. Scrapy and BeautifulSoup
Scrapy is a powerful library that can be used to extract data from web pages and XML files. I usually use Pandas’ read_html() to get the data of HTML tables. When it comes to texts or other types of data, I have used BeautifulSoup. But Scrapy is a full-fledged web scraper and it is sometimes said you should learn Scrapy if you want do serious web scraping.
Scrapy has a steep learning curve and a reputation for not being beginner-friendly but I believe you would master it with a rather gentle learning curve if you run Scrapy on Google Colaboratory.
You can learn the advantages and disadvantages of each library at Difference between BeautifulSoup and Scrapy crawler. It reads:
- Easy for beginners to learn and master in web scrapping.
- It has good community support to figure out the issue.
- It has good comprehensive documentation.
- It has an external python dependency.
- It is easily extensible.
- It has built-in support for extracting data.
- It has very fast speed compared to other libraries.
- It is both memory and CPU efficient.
- You can also build robust and extensive applications.
- Has strong community support.
- It has light documentation for beginners.
II. Create a spider
Spiders are classes that you define and that Scrapy uses to scrape information from a website. Let’s create our first spider. You can learn the scripts here:
1. Creating the files and directories for our project
# install scrapy !pip install Scrapy # create files for learning !scrapy startproject firstproject
2. Creating quotes_spider.py and save it
Change the current working directory to the spiders directory with os.chdir().
# change working directories os.chdir('/content/firstproject/firstproject/spiders')
Create a quotes_spider.py and save it under the spiders directory using a IPython Magic Command, %%writefile.
III. Extracting data using Scrapy shell
Now, run the follwoing code:
!scrapy shell 'https://quotes.toscrape.com'
Then put the following commands by turns:
quote = response.css("div.quote") text = quote.css("span.text::text").get() text
You will get the texts:
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
You can quit Scrapy shell with Ctrl + c. You can see the scripts here:
To be continued.