Beautifulsoup4 Essentials

I love the utility made possible by beautifulsoup4, aka bs4. I also struggle with the docs every single time I pick it up. Somehow, I find it hard to find the parts which I need, and end up searching for them a long. Which is annoying the 5th time around.

So here’s a quick overview of the most essential snippets. You’re welcome future-me.

Installation & Docs

The package is called beautifulsoup4. You can find the docs here.

Import & Reading HTML Data

from bs4 import BeautifulSoup
soup = BeautifulSoup(htmlString, 'html.parser')

The soup variable is the one we’re going to work with.

Finding Elements

From the docs: find_all(name, attrs, recursive, string, limit, **kwargs)

# each element in the list is a soup-like object?
result_list = soup.find_all("a")

soup.find_all(id="the-id")

# a element with css class something
soup.find_all("a", class_="something")

# you can also pass a function
def function_evaluating_value(href):
    return href == "value"

soup.find_all(href=function_evaluating_value)

Getting Content

# or an element
element.get_text()

# get an attribute
# will return None if there is none
# the ['href'] notation will raise a KeyError
element.get('href')

# some are lists
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#multi-valued-attributes
element.get('class')

# all attributed
element.attrs

# tag name
element.name

That’s it! Happy soup-ing :)

Subscribe to my newsletter!
You'll get notified via e-mail when new articles are published. I mostly write about Docker, Kubernetes, automation and building stuff on the web. Sometimes other topics sneak in as well.

Your e-mail address will be used to send out summary emails about new articles, at most weekly. You can unsubscribe from the newsletter at any time.

Für den Versand unserer Newsletter nutzen wir rapidmail. Mit Ihrer Anmeldung stimmen Sie zu, dass die eingegebenen Daten an rapidmail übermittelt werden. Beachten Sie bitte auch die AGB und Datenschutzbestimmungen .

vsupalov.com

© 2024 vsupalov.com. All rights reserved.