BeautifulSoup4 Module - Python
Last Updated :
18 Feb, 2025
BeautifulSoup4 is a user-friendly Python library designed for parsing HTML and XML documents. It simplifies the process of web scraping by allowing developers to effortlessly navigate, search and modify the parse tree of a webpage. With BeautifulSoup4, we can extract specific elements, attributes and text from complex web pages using intuitive methods. This library abstracts away the complexities of HTML and XML structures, enabling us to focus on retrieving and processing the data we need. BeautifulSoup4 supports multiple parsers (like Python’s built-in html.parser, lxml, and html5lib), giving us the flexibility to choose the best tool for our task. Whether we’re gathering data for research, automating data extraction or building web applications.
For example:
Python
from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
<title>Test Page</title>
</head>
<body>
<p class="content">Hello, BeautifulSoup!</p>
</body>
</html>
"""
# Parsing the HTML content
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.title)
Output:
<title>Test Page</title>
Explanation:
- BeautifulSoup() function parses the provided HTML content.
- Accessing soup.title retrieves the <title> tag from the HTML.
Importing BeautifulSoup4
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
Parameters :
- html_doc is a string containing the HTML or XML content to be parsed.
- 'html.parser' is the parser to use. (Alternatives include 'lxml' or 'html5lib'.)
Return Type : Returns a BeautifulSoup object that represents the parsed document.
Parsing HTML with BeautifulSoup4
BeautifulSoup4 converts raw HTML content into a navigable parse tree.
Python
from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h1>Welcome to BeautifulSoup4</h1>
<p>This is a sample page.</p>
</body>
</html>
"""
# Parsing the HTML content
soup = BeautifulSoup(html_doc, 'html.parser')
# Finding the first <h1> tag
header = soup.find('h1')
print(header.text)
Output:
Welcome to BeautifulSoup4
Explanation:
- find() method searches for the first <h1> tag in the document.
- Printing header.text outputs the text content of the <h1> tag.
BeautifulSoup4 offers methods like find_all() to extract multiple elements from an HTML document.
Python
from bs4 import BeautifulSoup
html_doc = """
<html>
<head>
<title>List Example</title>
</head>
<body>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
</body>
</html>
"""
# Parsing the HTML content
soup = BeautifulSoup(html_doc, 'html.parser')
# Finding all <li> tags
items = soup.find_all('li')
for item in items:
print(item.text)
Output:
Item 1
Item 2
Item 3
Explanation:
- find_all() method retrieves all <li> elements.
- Iterating through the returned list prints the text of each list item.
Navigating the Parse Tree with BeautifulSoup4
Beyond simple extraction, BeautifulSoup4 allows you to traverse the document structure using attributes like .parent, .children and .siblings.
Python
from bs4 import BeautifulSoup
html_doc = """
<html>
<body>
<div class="container">
<h1>Title</h1>
<p>Paragraph content</p>
</div>
</body>
</html>
"""
# Parsing the HTML content
soup = BeautifulSoup(html_doc, 'html.parser')
# Accessing the container and navigating to its parent
container = soup.find('div', class_='container')
print("Parent tag:", container.parent.name)
Output:
Parent tag: html
Explanation: .parent attribute returns the immediate parent of the found tag, allowing you to traverse upwards in the DOM tree.
Using CSS Selectors with BeautifulSoup4
select() method lets you search for elements using CSS selector syntax.
Python
from bs4 import BeautifulSoup
html_doc = """
<html>
<head><title>CSS Selector Example</title></head>
<body>
<div id="main">
<p class="info">Info Paragraph 1</p>
<p class="info">Info Paragraph 2</p>
</div>
</body>
</html>
"""
# Parsing the HTML content
soup = BeautifulSoup(html_doc, 'html.parser')
# Using a CSS selector to find all <p> tags with class "info" inside the div with id "main"
elements = soup.select('div#main p.info')
for element in elements:
print(element.get_text())
Output:
Info Paragraph 1
Info Paragraph 2
Explanation:
- CSS selector 'div#main p.info' locates all <p> tags with class "info" that are descendants of the <div> with id "main".
- select() method returns a list of matching elements.
Similar Reads
How to Import BeautifulSoup in Python Beautiful Soup is a Python library used for parsing HTML and XML documents. It provides a simple way to navigate, search, and modify the parse tree, making it valuable for web scraping tasks. In this article, we will explore how to import BeautifulSoup in Python. What is BeautifulSoup?BeautifulSoup
3 min read
BeautifulSoup object - Python Beautifulsoup BeautifulSoup object is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. The BeautifulSoup object represents the parsed document as a whole. For most purposes, yo
2 min read
Python BeautifulSoup - find all class Prerequisite:- Requests , BeautifulSoup The task is to write a program to find all the classes for a given Website URL. In Beautiful Soup there is no in-built method to find all classes. Module needed: bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This modu
2 min read
Basics Of Python Modules A library refers to a collection of modules that together cater to a specific type of needs or application. Module is a file(.py file) containing variables, class definitions statements, and functions related to a particular task. Python modules that come preloaded with Python are called standard li
3 min read
Built-in Modules in Python Python is one of the most popular programming languages because of its vast collection of modules which make the work of developers easy and save time from writing the code for a particular task for their program. Python provides various types of modules which include Python built-in modules and ext
9 min read
__future__ Module in Python __future__ module is a built-in module in Python that is used to inherit new features that will be available in the new Python versions.. This module includes all the latest functions which were not present in the previous version in Python. And we can use this by importing the __future__ module. I
4 min read
Python Module Index Python has a vast ecosystem of modules and packages. These modules enable developers to perform a wide range of tasks without taking the headache of creating a custom module for them to perform a particular task. Whether we have to perform data analysis, set up a web server, or automate tasks, there
4 min read
How to Use lxml with BeautifulSoup in Python In this article, we will explore how to use lxml with BeautifulSoup in Python. lxml is a high-performance XML and HTML parsing library for Python, known for its speed and comprehensive feature set. It supports XPath, XSLT, validation, and efficient handling of large documents, making it a preferred
3 min read
BeautifulSoup - Error Handling Sometimes, during scraping data from websites we all have faced several types of errors in which some are out of understanding and some are basic syntactical errors. Here we will discuss on types of exceptions that are faced during coding the script. Error During Fetching of Website When we are fetc
4 min read
External Modules in Python Python is one of the most popular programming languages because of its vast collection of modules which make the work of developers easy and save time from writing the code for a particular task for their program. Python provides various types of modules which include built-in modules and external m
5 min read