Scraping Wikipedia table with Pandas using read_html()
Last Updated :
02 Apr, 2025
In this article, we will discuss a particular function named read_html() which is used to read HTML tables directly from a webpage into a Pandas DataFrame without knowing how to scrape a website's HTML, this tool can be useful for swiftly combining tables from numerous websites. However, the data must have to be cleaned further, So let's see how we can work on these data.
What is pd.read_html?
Pandas read_html() is one of the easiest ways to scrape web data. The data can further be cleaned as per the requirements of the user.
Syntax of pandas.read_html()
Syntax: pandas.read_htlm(io)
Where, io can be an HTML String, a File, or a URL.
Example 1: Using an Html string
In this example, we are storing a multiline string using the notation ‘’’ in a variable called html_string. Then, we call the function read_html and pass the html_string to it. This function extracts all the HTML tables and returns a list of all the tables.
Python
import pandas as pd
html_string = '''
<table>
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
</table>
'''
df_1 = pd.read_html(html_string)
df_1
Output:
Further, if you want to look at the datatypes, you can do so by calling the info() function as follows:
df_1[0].info()
Example 2: Reading HTML Data From URL
In this example, let us try to read HTML from a web page. We are using a Wikipedia page "Demographics of India". From this webpage, I want to scrape the contents of the following table, We need to extract the highlighted columns below:
There are almost 37 tables on the webpage and to find a particular table, we can use the parameter “match”. To find out the length of the data frame, we use the len() function as follows:
Python
import pandas as pd
import numpy as np
dfs = pd.read_html('https://en.wikipedia.org\
/wiki/Demographics_of_India')
len(dfs)
Output:
37
Example 3: Find the specific table from a webpage
Let us pass the value “Population distribution by states/union territories (2011)” to the parameter match.
Python
my_table = pd.read_html('https://en.wikipedia.org/wiki/\
Demographics_of_India',
match='Population distribution by states/union territories')
my_table[0].head()
Example 4: Fetch column data
So, we have to get the column 'State/UT' and also the column 'Population'
Python
states = my_table[0]['State/UT']
states
Similarly, we get the column Population
Python
population = my_table[0]['Population[57]']
population
Example 5: Merging two columns
Let us store the two columns in a new DataFrame.
Python
df1 = pd.DataFrame({'State': states,
'Population': population})
df1
Example 6: Dropping row data
Let's try to Remove the last row with the help of drop() in Pandas. i.e. total
Python
df1.drop(df1.tail(1).index,
inplace = True)
df1
Output:
Example 7: Data Visualisation of table
Here we are using the Matplotlib module to plot the given HTML data in a graphic format.
Python
import matplotlib.pyplot as plt
df1.plot(x='State',y='Population',
kind="barh",figsize=(10,8))
Example 8: Writing HTML Tables with Python's Pandas
Here, we created a DataFrame, and we converted it into an HTML file, we have also passed some HTML attributes for making it table beautiful.
Python
import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df.to_html('write_html.html', index=False,
border=3, justify='center')
Output:
Example 9: Error while rendering an HTML page
If the HTML page doesn’t contain any tables, a value error will return.
Python
import pandas as pd
import numpy as np
dfs=pd.read_html('https://codebestway.\
wordpress.com/')
Similar Reads
Automated Website Scraping using Scrapy
Scrapy is a Python framework for web scraping on a large scale. It provides with the tools we need to extract data from websites efficiently, processes it as we see fit, and store it in the structure and format we prefer. Zyte (formerly Scrapinghub), a web scraping development and services company,
5 min read
Web Scraping Tables with Selenium and Python
Selenium is the automation software testing tool that obtains the website, performs various actions, or obtains the data from the website. It was chiefly developed for easing the testing work by automating web applications. Nowadays, apart from being used for testing, it can also be used for making
4 min read
Reading selected webpage content using Python Web Scraping
Prerequisite: Downloading files in Python, Web Scraping with BeautifulSoup We all know that Python is a very easy programming language but what makes it cool are the great number of open source library written for it. Requests is one of the most widely used library. It allows us to open any HTTP/HTT
3 min read
Scraping weather data using Python to get umbrella reminder on email
In this article, we are going to see how to scrape weather data using Python and get reminders on email. If the weather condition is rainy or cloudy this program will send you an "umbrella reminder" to your email reminding you to pack an umbrella before leaving the house. Â We will scrape the weather
5 min read
Scrape Tables From any website using Python
Scraping is a very essential skill for everyone to get data from any website. Scraping and parsing a table can be very tedious work if we use standard Beautiful soup parser to do so. Therefore, here we will be describing a library with the help of which any table can be scraped from any website easi
3 min read
Reading rpt files with Pandas
In most cases, we usually have a CSV file to load the data from, but there are other formats such as JSON, rpt, TSV, etc. that can be used to store data. Pandas provide us with the utility to load data from them. In this article, we'll see how we can load data from an rpt file with the use of Pandas
2 min read
Web Scraping using lxml and XPath in Python
Prerequisites: Introduction to Web Scraping In this article, we will discuss the lxml python library to scrape data from a webpage, which is built on top of the libxml2 XML parsing library written in C. When compared to other python web scraping libraries like BeautifulSoup and Selenium, the lxml pa
3 min read
Scraping Javascript Enabled Websites using Scrapy-Selenium
Scrapy-selenium is a middleware that is used in web scraping. scrapy do not support scraping modern sites that uses javascript frameworks and this is the reason that this middleware is used with scrapy to scrape those modern sites.Scrapy-selenium provide the functionalities of selenium that help in
4 min read
Clean Web Scraping Data Using clean-text in Python
If you like to play with API's or like to scrape data from various websites, you must've come around random annoying text, numbers, keywords that come around with data. Sometimes it can be really complicating and frustrating to clean scraped data to obtain the actual data that we want. In this arti
2 min read
How to skip rows while reading csv file using Pandas?
Python is a good language for doing data analysis because of the amazing ecosystem of data-centric python packages. Pandas package is one of them and makes importing and analyzing data so much easier. Here, we will discuss how to skip rows while reading csv file. We will use read_csv() method of Pan
3 min read