vsupalov

Beautifulsoup4 Essentials

I love the utility made possible by beautifulsoup4, aka bs4. I also struggle with the docs every single time I pick it up. Somehow, I find it hard to find the parts which I need, and end up searching for them a long. Which is annoying the 5th time around.

So here’s a quick overview of the most essential snippets. You’re welcome future-me.

Installation & Docs

The package is called beautifulsoup4. You can find the docs here.

Import & Reading HTML Data

from bs4 import BeautifulSoup
soup = BeautifulSoup(htmlString, 'html.parser')

The soup variable is the one we’re going to work with.

Finding Elements

From the docs: find_all(name, attrs, recursive, string, limit, **kwargs)

# each element in the list is a soup-like object?
result_list = soup.find_all("a")

soup.find_all(id="the-id")

# a element with css class something
soup.find_all("a", class_="something")

# you can also pass a function
def function_evaluating_value(href):
    return href == "value"

soup.find_all(href=function_evaluating_value)

Getting Content

# or an element
element.get_text()

# get an attribute
# will return None if there is none
# the ['href'] notation will raise a KeyError
element.get('href')

# some are lists
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#multi-valued-attributes
element.get('class')

# all attributed
element.attrs

# tag name
element.name

That’s it! Happy soup-ing :)