Lecture: Scrape a table from a PDF file, tabula-py ver. 3

Last Updated on July 6, 2022 by shibatau

I. Scrape a table from a PDF file

II. Data

Global Gender Gap Report 2020

III. Scripts

Here is a code of my sample codes on Google Colaboratory:

tabula.read_pdf("/content/WEF_GGGR_2021.pdf", pages=10, stream=True, lattice=False)

Notes:

  • if your tables have lines separating cells, you can use lattice option. By default, tabula-py sets guess=True. If your tables don’t have separation lines, you can try stream option.
  • read_pdf( ) reads only page 1 by default.

You can learn more here:

tabula-py example notebook

You can use Google Colaboratory to run the scripts. Please download the PDF file linked in II. Data and upload it to Google Colaboratory. You can see the scripts here:

https://colab.research.google.com/drive/11I5UytIn2XX18Tw31cfGQQGzmyxCzUO4?usp=sharing

About shibatau

I was born and grown up in Kyoto. I studied western philosophy at the University and specialized in analytic philosophy, especially Ludwig Wittgenstein at the postgraduate school. I'm interested in new technology, especially machine learning and have been learning R language for two years and began to learn Python last summer. Listening toParamore, Sia, Amazarashi and MIyuki Nakajima. Favorite movies I've recently seen: "FREEHELD". Favorite actors and actresses: Anthony Hopkins, Denzel Washington, Ellen Page, Meryl Streep, Mia Wasikowska and Robert DeNiro. Favorite books: Fyodor Mikhailovich Dostoyevsky, "The Karamazov Brothers", Shinran, "Lamentations of Divergences". Favorite phrase: Salvation by Faith. Twitter: @shibatau

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.