Beautifulsoup4 Essentials
I love the utility made possible by beautifulsoup4, aka bs4. I also struggle with the docs every single time I pick it up. Somehow, I find it hard to find the parts which I need, and end up searching for them a long. Which is annoying the 5th time around.
So here’s a quick overview of the most essential snippets. You’re welcome future-me.
Installation & Docs
The package is called beautifulsoup4. You can find the docs here.
Import & Reading HTML Data
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmlString, 'html.parser')
The soup variable is the one we’re going to work with.
Finding Elements
From the docs: find_all(name, attrs, recursive, string, limit, **kwargs)
# each element in the list is a soup-like object?
result_list = soup.find_all("a")
soup.find_all(id="the-id")
# a element with css class something
soup.find_all("a", class_="something")
# you can also pass a function
def function_evaluating_value(href):
return href == "value"
soup.find_all(href=function_evaluating_value)
Getting Content
# or an element
element.get_text()
# get an attribute
# will return None if there is none
# the ['href'] notation will raise a KeyError
element.get('href')
# some are lists
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#multi-valued-attributes
element.get('class')
# all attributed
element.attrs
# tag name
element.name
That’s it! Happy soup-ing :)
A word from the author
Hi, I'm Vladislav. I work with small teams and bootstrapped founders who need to get their infrastructure right — reliable deployments, less operational risk, and systems that don't fall apart the moment the founder looks away. If that sounds like your situation, here's how we can work together.
I've been writing about Docker, deployment, and infrastructure since 2017. If you'd like to read more, the articles page is a good place to start — or you can sign up for the newsletter to get new pieces by email.