Home

BeautifulSoup

BeautifulSoup is a Python library for parsing HTML and XML documents. It creates a parse tree from page source that can be navigated and searched. It is designed for quick extraction of data from malformed markup and for operations with a simple interface.

Developed by Leonard Richardson, BeautifulSoup is commonly used via the bs4 package. The library is open-source

Core features include a navigable parse tree, methods such as find, find_all, find_parent, and select for CSS-style

Usage example: from bs4 import BeautifulSoup; soup = BeautifulSoup(html, 'html.parser'); for link in soup.find_all('a', href=True): print(link['href'])

BeautifulSoup emphasizes simplicity and readability. For very large HTML documents, choosing a faster parser such as

and
BSD-licensed.
It
supports
multiple
parsing
backends,
including
Python's
built-in
html.parser,
lxml,
and
html5lib.
If
a
preferred
parser
isn't
installed,
it
will
fall
back
to
the
built-in
parser.
queries,
and
straightforward
traversal
of
parent,
siblings,
and
children.
The
library
is
designed
to
handle
imperfect
markup
and
to
produce
reliable
data
structures
that
can
be
inspected
and
transformed.
lxml
can
improve
performance,
while
the
built-in
parser
offers
broad
compatibility.
It
is
widely
used
for
web
scraping,
data
extraction,
and
quick
prototyping
of
data
pipelines.