I love the utility made possible by beautifulsoup4, aka bs4. I also struggle with the docs every single time I pick it up. Somehow, I find it hard to find the parts which I need, and end up searching for them a long. Which is annoying the 5th time around.
So here’s a quick overview of the most essential snippets. You’re welcome future-me.
Installation & Docs
The package is called
beautifulsoup4. You can find the docs here.
Import & Reading HTML Data
from bs4 import BeautifulSoup soup = BeautifulSoup(htmlString, 'html.parser')
The soup variable is the one we’re going to work with.
From the docs:
find_all(name, attrs, recursive, string, limit, **kwargs)
# each element in the list is a soup-like object? result_list = soup.find_all("a") soup.find_all(id="the-id") # a element with css class something soup.find_all("a", class_="something") # you can also pass a function def function_evaluating_value(href): return href == "value" soup.find_all(href=function_evaluating_value)
# or an element element.get_text() # get an attribute # will return None if there is none # the ['href'] notation will raise a KeyError element.get('href') # some are lists # https://www.crummy.com/software/BeautifulSoup/bs4/doc/#multi-valued-attributes element.get('class') # all attributed element.attrs # tag name element.name
That’s it! Happy soup-ing :)