Python: Understanding scraping with Beautiful Soup ver. 2

Last Updated on May 15, 2022 by shibatau

I. Why don’t you use Pandas?

Web scraping seems difficult and Beautiful Soup might look confusing and difficult to use. Ok, let’s try to use Pandas and use DataFrame, which you are accustomed to.

You have to load and parse HTML and convert it to a table in the following procedures:

As you see, your table would be complex with very long column names to show the structures of the nested values. Now you would understand why we should use Beautifulsoup (Dictionaries + Objects + …) instead of data frames for the data with deeply nested structures.

You can see the sample codes here though they are not completed.

https://colab.research.google.com/drive/1hG60qYH6czgO9abC-XwjmG6wc-0U0ZLI?usp=sharing

II. Using Beautiful soup

You can learn how to use Beautiful Soup here:

Tutorial: Web Scraping with Python Using Beautiful Soup

I have added some comments and functions for learners’ understanding. You can see the scripts here again:

https://colab.research.google.com/drive/1hG60qYH6czgO9abC-XwjmG6wc-0U0ZLI?usp=sharing

Some notes:

Soup is an object created with Beautifulsoup and it has methods (functions) and attributes (data) shown below.

>>>dir(soup)
['ASCII_SPACES',
 'DEFAULT_BUILDER_FEATURES',
 'HTML_FORMATTERS',
 'NO_PARSER_SPECIFIED_WARNING',
 'ROOT_TAG_NAME',
 'XML_FORMATTERS',
 '__bool__',
 '__call__',
 '__class__',
 '__contains__',
 '__copy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_all_strings',
 '_attr_value_as_string',
 '_attribute_checker',
 '_check_markup_is_url',
 '_feed',
 '_find_all',
 '_find_one',
 '_formatter_for_name',
 '_is_xml',
 '_lastRecursiveChild',
 '_last_descendant',
 '_most_recent_element',
 '_popToTag',
 '_select_debug',
 '_selector_combinators',
 '_should_pretty_print',
 '_tag_name_matches_and',
 'append',
 'attribselect_re',
 'attrs',
 'builder',
 'can_be_empty_element',
 'childGenerator',
 'children',
 'clear',
 'contains_replacement_characters',
 'contents',
 'currentTag',
 'current_data',
 'declared_html_encoding',
 'decode',
 'decode_contents',
 'decompose',
 'descendants',
 'encode',
 'encode_contents',
 'endData',
 'extract',
 'fetchNextSiblings',
 'fetchParents',
 'fetchPrevious',
 'fetchPreviousSiblings',
 'find',
 'findAll',
 'findAllNext',
 'findAllPrevious',
 'findChild',
 'findChildren',
 'findNext',
 'findNextSibling',
 'findNextSiblings',
 'findParent',
 'findParents',
 'findPrevious',
 'findPreviousSibling',
 'findPreviousSiblings',
 'find_all',
 'find_all_next',
 'find_all_previous',
 'find_next',
 'find_next_sibling',
 'find_next_siblings',
 'find_parent',
 'find_parents',
 'find_previous',
 'find_previous_sibling',
 'find_previous_siblings',
 'format_string',
 'get',
 'getText',
 'get_attribute_list',
 'get_text',
 'handle_data',
 'handle_endtag',
 'handle_starttag',
 'has_attr',
 'has_key',
 'hidden',
 'index',
 'insert',
 'insert_after',
 'insert_before',
 'isSelfClosing',
 'is_empty_element',
 'is_xml',
 'known_xml',
 'markup',
 'name',
 'namespace',
 'new_string',
 'new_tag',
 'next',
 'nextGenerator',
 'nextSibling',
 'nextSiblingGenerator',
 'next_element',
 'next_elements',
 'next_sibling',
 'next_siblings',
 'object_was_parsed',
 'original_encoding',
 'parent',
 'parentGenerator',
 'parents',
 'parse_only',
 'parserClass',
 'parser_class',
 'popTag',
 'prefix',
 'preserve_whitespace_tag_stack',
 'preserve_whitespace_tags',
 'prettify',
 'previous',
 'previousGenerator',
 'previousSibling',
 'previousSiblingGenerator',
 'previous_element',
 'previous_elements',
 'previous_sibling',
 'previous_siblings',
 'pushTag',
 'quoted_colon',
 'recursiveChildGenerator',
 'renderContents',
 'replaceWith',
 'replaceWithChildren',
 'replace_with',
 'replace_with_children',
 'reset',
 'select',
 'select_one',
 'setup',
 'string',
 'strings',
 'stripped_strings',
 'tagStack',
 'tag_name_re',
 'text',
 'unwrap',
 'wrap']

You can use the find_all function to get texts with tags. For example, you can get all the occurrences of the <p> TEXTS </p> with the code:

>>>soup.find_all('p')

You can access the first object with the following code:

>>>soup.find_all('p')[0]
<p>Here is some simple content for this page.</p>

You can get just the texts between the tags with the get_text function:

>>>soup.find_all('a')[0].get_text()
'Here is some simple content for this page.'

If you are interested in Classes and Objects, my following post may help you get general understanding of Python Objects.

Python: Learning how to handle CSV data with DataFrame, Class and Dictionary, ver. 8

About shibatau

I was born and grown up in Kyoto. I studied western philosophy at the University and specialized in analytic philosophy, especially Ludwig Wittgenstein at the postgraduate school. I'm interested in new technology, especially machine learning and have been learning R language for two years and began to learn Python last summer. Listening toParamore, Sia, Amazarashi and MIyuki Nakajima. Favorite movies I've recently seen: "FREEHELD". Favorite actors and actresses: Anthony Hopkins, Denzel Washington, Ellen Page, Meryl Streep, Mia Wasikowska and Robert DeNiro. Favorite books: Fyodor Mikhailovich Dostoyevsky, "The Karamazov Brothers", Shinran, "Lamentations of Divergences". Favorite phrase: Salvation by Faith. Twitter: @shibatau

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.