Navigation with BeautifulSoup
Last Updated :
24 Dec, 2021
BeautifulSoup is a Python package used for parsing HTML and XML documents, it creates a parse tree for parsed paged which can be used for web scraping, it pulls data from HTML and XML files and works with your favorite parser to provide the idiomatic way of navigating, searching, and modifying the parse tree.
Installation
This module does not come built-in with Python. To install this type the below command in the terminal.
pip install bs4
Navigation With BeautifulSoup
Below code snippet is the HTML document which we shall use, to navigate using BeautifulSoup tags with this code snippet as reference.
Python3
ht_doc = """
<html><head><title>Geeks For Geeks</title></head>
<body>
<p class="title"><b>most viewed courses in GFG,its all free</b></p>
<p class ="prog">Top 5 Popular Programming Languages</p>
<a href="https://www.geeksforgeeks.org/java-programming-examples/" \
class="prog" id="link1">Java</a>
<a href="https://www.geeksforgeeks.org/cc-programs/" class="prog" \
id="link2">c/c++</a>
<a href="https://www.geeksforgeeks.org/python-programming-examples/"\
class="prog" id="link3">Python</a>
<a href="https://https://www.geeksforgeeks.org/introduction-to-javascript/"\
class="prog" id="link4">Javascript</a>
<a href="https://www.geeksforgeeks.org/ruby-programming-language/" \
class="prog" id="link5">Ruby</a>
<p>according to an online survey. </p>
<p class="prog"> Programming Languages</p>
</body></html>
"""
Now let us navigate in all possible ways by applying BeautifulSoup in Python on the above code snippet, the most important component in Html documents are tags which may also contain other tags/strings(tag's children). BeautifulSoup provides different ways to iterate over these children, let us see all possible cases
Navigating Downwards
Navigating Using Tag Names :
Example 1: To get Head Tag.
Use .head to BeautifulSoup object to get the head tag in HTML document.
Syntax : (BeautifulSoup Variable).head
Example 2: To get Title Tag
Use .title tag to retrieve the title of the HTML document embedded in BeautifulSoup variable
Syntax : (BeautifulSoup Variable).title
Code:
Python3
soup = BeautifulSoup(ht_doc, 'html.parser')
print(soup.head)
print(soup.title)
Output:
<head><title>Geeks For Geeks</title></head>
<title>Geeks For Geeks</title>
Example 3: To get a specific tag.
We can retrieve some specific tags like the first <b> tag in the body tag
Syntax : (BeautifulSoup Variable).body.b
Using tag name as an attribute will get you the first name of that name
Syntax: (BeautifulSoup Variable).(tag attribute)
By using find_all, we can get all contents associated with the attribute
Syntax: (BeautifulSoup Variable).find_all(tag value)
Code:
Python3
soup = BeautifulSoup(ht_doc, 'html.parser')
# retrieving b tag element
print(soup.body.b)
# retrieving a tag element from BeautifulSoup assigned variable
print(soup.a)
# retrieving all elements tagged with a in ht_doc
print(soup.find_all("a"))
Output:
<b>most viewed courses in GFG,its all free</b>
<a class="prog" href="https://www.geeksforgeeks.org/java-programming-examples/" id="link1">Java</a>
[<a class="prog" href="https://www.geeksforgeeks.org/java-programming-examples/" id="link1">Java</a>,
<a class="prog" href="https://www.geeksforgeeks.org/cc-programs/" id="link2">c/c++</a>,
<a class="prog" href="https://www.geeksforgeeks.org/python-programming-examples/" id="link3">Python</a>,
<a class="prog" href="https://https://www.geeksforgeeks.org/introduction-to-javascript/" id="link4">Javascript</a>,
<a class="prog" href="https://www.geeksforgeeks.org/ruby-programming-language/" id="link5">Ruby</a>]
Example 4: Contents and .children
We can get tags children in a list by using .contents.
Syntax: (BeautifulSoup Variable).contents
Code:
Python3
soup = BeautifulSoup(ht_doc, 'html.parser')
# assigning head tag of BeautifulSoup variable
hTag = soup.head
print(hTag)
# retrieving contents of BeautifulSoup variable
print(hTag.contents)
Output:
<head><title>Geeks For Geeks</title></head>
[<title>Geeks For Geeks</title>]
Example 5: .descendants
The .descendants attribute allows you to iterate over all of a tag’s children, recursively −its direct children and the children of its direct children and so on...
Syntax: (Variable assigned from BeautifulSoup Variable).descendants
Code:
Python3
# embedding html document inyto BeautifulSoup variable
soup = BeautifulSoup(ht_doc, 'html.parser')
# assigning head element of BeautifulSoup-assigned Variable
htag=soup.head
# iterating through child in descendants of htag variable
for child in htag.descendants:
print(child)
Output :
<title>Geeks For Geeks</title>
Geeks For Geeks
Example 6: .string
If the tag has only one child, and that child is a NavigableString, the child is made available as .string
However, if a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to None, we can see this practical working in below code.
Python3
soup = BeautifulSoup(ht_doc, 'html.parser')
htag = soup.head
print(htag.string)
Output:
Geeks For Geeks
Example 7: .strings and stripped_strings
If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator.
Python3
soup = BeautifulSoup(ht_doc, 'html.parser')
for string in soup.strings :
print(repr(string))
Output :
'\n'
'Geeks For Geeks'
'\n'
'\n'
'most viewed courses in GFG,its all free'
'\n'
'Top 5 Popular Programming Languages'
'\n'
'Java'
'\n'
'c/c++'
'\n'
'Python'
'\n'
'Javascript'
'\n'
'Ruby'
'\naccording to an online survey. '
'\n'
' Programming Languages'
'\n'
For removal of extra whitespaces, we use .stripped_strings generator :
Python3
# embedding HTML document in BeautifulSoup-assigned variable
soup = BeautifulSoup(ht_doc, 'html.parser')
# iterating through string in stripped_strings of
# BeautifulSoup assigned variable
for string in soup.stripped_strings :
print(repr(string))
Output:
'Geeks For Geeks'
'most viewed courses in GFG,its all free'
'Top 5 Popular Programming Languages'
'Java'
'c/c++'
'Python'
'Javascript'
'Ruby'
'according to an online survey.'
'Programming Languages'
Navigating Upwards Through BeautifulSoup :
If we consider a “family tree” analogy, every tag and every string has a parent: the tag that contains it:
Example 1: .parent.
.parent tag is used for retrieving the element's parent element
Syntax : (BeautifulSoup Variable).parent
Code:
Python3
ht_doc = """
<html><head><title>Geeks For Geeks</title></head>
<body>
<p class="title"><b>most viewed courses in GFG,its all free</b></p>
<p class ="prog">Top 5 Popular Programming Languages</p>
<a href="https://www.geeksforgeeks.org/java-programming-examples/"\
class="prog" id="link1">Java</a>
<a href="https://www.geeksforgeeks.org/cc-programs/" class="prog" \
id="link2">c/c++</a>
<a href="https://www.geeksforgeeks.org/python-programming-examples/"\
class="prog" id="link3">Python</a>
<a href="https://https://www.geeksforgeeks.org/introduction-to-javascript/"\
class="prog" id="link4">Javascript</a>
<a href="https://www.geeksforgeeks.org/ruby-programming-language/"\
class="prog" id="link5">Ruby</a>
according to an online survey. </a>
<p class="prog"> Programming Languages</p>
</body></html>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(ht_doc, 'html.parser')
# embedding html document
Itag = soup.title
# assigning title tag of BeautifulSoup-assigned variable
# to print parent element in Itag variable
print(Itag.parent)
htmlTag = soup.html
print(type(htmlTag.parent))
print(soup.parent)
Output:
<head><title>Geeks For Geeks</title></head>
<class 'bs4.BeautifulSoup'>
None
Example 2: .parents
For iterating all over the parent elements, .parents tag can be used :
Syntax :(BeautifulSoup Variable).parents
Python3
# embedding html doc into BeautifulSoup
soup = BeautifulSoup(ht_doc, 'html.parser')
# embedding a tag into link variable
link = soup.a
print(link)
# iterating through parent in link variable
for parent in link.parents :
# printing statement for Parent is empty case
if parent is None :
print(parent)
else :
print(parent.name)
Output:
<a class="prog" href="https://www.geeksforgeeks.org/java-programming-examples/" id="link1">Java</a>
body
html
[document]
Navigating Sideways With BeautifulSoup
.next_sibling and .previous_sibling are the tags that are used for navigating between page elements that are on same level of the parse tree.
Syntax:
(BeautifulSoup Variable).(tag attribute).next_sibling
(BeautifulSoup Variable).(tag attribute).previous_sibling
Code:
Python3
from bs4 import BeautifulSoup
sibling_soup = BeautifulSoup("<a><b>Geeks For Geeks</b><c><strong>The \
Biggest Online Tutorials Library, It's all Free</strong></b></a>")
# to retrieve next sibling of b tag
print(sibling_soup.b.next_sibling)
# for retrieving previous sibling of c tag
print(sibling_soup.c.previous_sibling)
Output:
<c><strong>The Biggest Online Tutorials Library, It's all Free</strong></c>
<b>Geeks For Geeks</b>
Similar Reads
Implementing Web Scraping in Python with BeautifulSoup There are mainly two ways to extract data from a website:Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.Access the HTML of the webpage and extract useful information/data from it. This technique is called
8 min read
Installing and Loading BeautifulSoup
Navigating the HTML structure With Beautiful Soup
Searching and Extract for specific tags With Beautiful Soup
Python BeautifulSoup - find all classPrerequisite:- Requests , BeautifulSoup The task is to write a program to find all the classes for a given Website URL. In Beautiful Soup there is no in-built method to find all classes. Module needed: bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This modu
2 min read
BeautifulSoup - Search by text inside a tagPrerequisites: Beautifulsoup Beautifulsoup is a powerful python module used for web scraping. This article discusses how a specific text can be searched inside a given tag. INTRODUCTION: BeautifulSoup is a Python library for parsing HTML and XML documents. It provides a simple and intuitive API for
4 min read
Scrape Google Search Results using Python BeautifulSoupIn this article, we are going to see how to Scrape Google Search Results using Python BeautifulSoup. Module Needed:bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the te
3 min read
Get tag name using Beautifulsoup in PythonPrerequisite: Beautifulsoup Installation Name property is provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. Name object corresponds to the name of an XML or HTML t
1 min read
Extracting an attribute value with beautifulsoup in PythonPrerequisite: Beautifulsoup Installation Attributes are provided by Beautiful Soup which is a web scraping framework for Python. Web scraping is the process of extracting data from the website using automated tools to make the process faster. A tag may have any number of attributes. For example, the
2 min read
BeautifulSoup - Modifying the treePrerequisites: BeautifulSoup Beautifulsoup is a Python library used for web scraping. This powerful python tool can also be used to modify html webpages. This article depicts how beautifulsoup can be employed to modify the parse tree. BeautifulSoup is used to search the parse tree and allow you to m
5 min read
Find the text of the given tag using BeautifulSoupWeb scraping is a process of using software bots called web scrapers in extracting information from HTML or XML content of a web page. Beautiful Soup is a library used for scraping data through python. Beautiful Soup works along with a parser to provide iteration, searching, and modifying the conten
2 min read
Remove spaces from a string in PythonRemoving spaces from a string is a common task in Python that can be solved in multiple ways. For example, if we have a string like " g f g ", we might want the output to be "gfg" by removing all the spaces. Let's look at different methods to do so:Using replace() methodTo remove all spaces from a s
2 min read
Understanding Character EncodingEver imagined how a computer is able to understand and display what you have written? Ever wondered what a UTF-8 or UTF-16 meant when you were going through some configurations? Just think about how "HeLLo WorlD" should be interpreted by a computer. We all know that a computer stores data in bits an
6 min read
ASCII Vs UNICODEOverview :Unicode and ASCII are the most popular character encoding standards that are currently being used all over the world. Unicode is the universal character encoding used to process, store and facilitate the interchange of text data in any language while ASCII is used for the representation of
3 min read
HTML TablesHTML (HyperText Markup Language) is the standard markup language used to create and structure web pages. It defines the layout of a webpage using elements and tags, allowing for the display of text, images, links, and multimedia content. As the foundation of nearly all websites, HTML is used in over
10 min read
Creating new HTML elements With Beautiful Soup
Modifying HTML with BeautifulSoup
How to insert a new tag into a BeautifulSoup object?In this article, we will see how to insert a new tag into a BeautifulSoup object. See the below examples to get a better idea about the topic. Example: HTML_DOC : Â """ Â Â Â Â Â Â Â <html> Â Â Â Â Â Â Â <head> Â Â Â Â Â Â Â Â Â <title> Table Data </title> Â Â Â Â Â Â Â </he
5 min read
How to declare a custom attribute in HTML ?In this article, we will learn how to declare a custom attribute in HTML. Attributes are extra information that provides for the HTML elements. There are lots of predefined attributes in HTML. When the predefined attributes do not make sense to store extra data, custom attributes allow users to crea
2 min read
How to Remove tags using BeautifulSoup in Python?Prerequisite- Beautifulsoup module In this article, we are going to draft a python script that removes a tag from the tree and then completely destroys it and its contents. For this, decompose() method is used which comes built into the module. Syntax: Beautifulsoup.Tag.decompose() Tag.decompose() r
2 min read
Remove all style, scripts, and HTML tags using BeautifulSoupPrerequisite: BeautifulSoup, Requests Beautiful Soup is a Python library for pulling data out of HTML and XML files. In this article, we are going to discuss how to remove all style, scripts, and HTML tags using beautiful soup. Required Modules: bs4: Beautiful Soup (bs4) is a python library primaril
2 min read
BeautifulSoup - Remove the contents of tagIn this article, we are going to see how to remove the content tag from HTML using BeautifulSoup. BeautifulSoup is a python library used for extracting html and xml files. Modules needed: BeautifulSoup: Our primary module contains a method to access a webpage over HTTP. For installation run this com
2 min read
HTML Cleaning and Entity Conversion | PythonThe very important and always ignored task on web is the cleaning of text. Whenever one thinks to parse HTML, embedded Javascript and CSS is always avoided. The users are only interested in tags and text present on the webserver. lxml installation - It is a Python binding for C libraries - libxslt a
3 min read
Working with CSS selectors With Beautiful Soup
Handling cookies and sessions with BeautifulSoup