I love the utility made possible by beautifulsoup4, aka bs4. I also struggle with the docs every single time I pick it up. Somehow, I find it hard to find the parts which I need, and end up searching for them a long. Which is annoying the 5th time around.
So here’s a quick overview of the most essential snippets. You’re welcome future-me.
Installation & Docs
The package is called beautifulsoup4
. You can find the docs here.
Import & Reading HTML Data
from bs4 import BeautifulSoup
soup = BeautifulSoup(htmlString, 'html.parser')
The soup variable is the one we’re going to work with.
Finding Elements
From the docs: find_all(name, attrs, recursive, string, limit, **kwargs)
# each element in the list is a soup-like object?
result_list = soup.find_all("a")
soup.find_all(id="the-id")
# a element with css class something
soup.find_all("a", class_="something")
# you can also pass a function
def function_evaluating_value(href):
return href == "value"
soup.find_all(href=function_evaluating_value)
Getting Content
# or an element
element.get_text()
# get an attribute
# will return None if there is none
# the ['href'] notation will raise a KeyError
element.get('href')
# some are lists
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#multi-valued-attributes
element.get('class')
# all attributed
element.attrs
# tag name
element.name
That’s it! Happy soup-ing :)