Working with Excel files using Pandas
Last Updated :
07 Aug, 2024
Excel sheets are very instinctive and user-friendly, which makes them ideal for manipulating large datasets even for less technical folks. If you are looking for places to learn to manipulate and automate stuff in Excel files using Python, look no further. You are at the right place.
In this article, you will learn how to use Pandas to work with Excel spreadsheets. In this article we will learn about:
- Read Excel File using Pandas in Python
- Installing and Importing Pandas
- Reading multiple Excel sheets using Pandas
- Application of different Pandas functions
Reading Excel File using Pandas in Python
Installating Pandas
To install Pandas in Python, we can use the following command in the command prompt:
pip install pandas
To install Pandas in Anaconda, we can use the following command in Anaconda Terminal:
conda install pandas
Importing Pandas
First of all, we need to import the Pandas module which can be done by running the command:
Python
Input File: Let's suppose the Excel file looks like this
Sheet 1:
Sheet 1Sheet 2:
Sheet 2Now we can import the Excel file using the read_excel function in Pandas to read Excel file using Pandas in Python. The second statement reads the data from Excel and stores it into a pandas Data Frame which is represented by the variable newData.
Python
df = pd.read_excel('Example.xlsx')
print(df)
Output:
Roll No. English Maths Science
0 1 19 13 17
1 2 14 20 18
2 3 15 18 19
3 4 13 14 14
4 5 17 16 20
5 6 19 13 17
6 7 14 20 18
7 8 15 18 19
8 9 13 14 14
9 10 17 16 20
Loading multiple sheets using Concat() method
If there are multiple sheets in the Excel workbook, the command will import data from the first sheet. To make a data frame with all the sheets in the workbook, the easiest method is to create different data frames separately and then concatenate them. The read_excel method takes argument sheet_name and index_col where we can specify the sheet of which the frame should be made of and index_col specifies the title column, as is shown below:
Example:
The third statement concatenates both sheets. Now to check the whole data frame, we can simply run the following command:
Python
file = 'Example.xlsx'
sheet1 = pd.read_excel(file,
sheet_name = 0,
index_col = 0)
sheet2 = pd.read_excel(file,
sheet_name = 1,
index_col = 0)
# concatinating both the sheets
newData = pd.concat([sheet1, sheet2])
print(newData)
Output:
Roll No. English Maths Science
1 19 13 17
2 14 20 18
3 15 18 19
4 13 14 14
5 17 16 20
6 19 13 17
7 14 20 18
8 15 18 19
9 13 14 14
10 17 16 20
1 14 18 20
2 11 19 18
3 12 18 16
4 15 18 19
5 13 14 14
6 14 18 20
7 11 19 18
8 12 18 16
9 15 18 19
10 13 14 14
Head() and Tail() methods in Pandas
To view 5 columns from the top and from the bottom of the data frame, we can run the command. This head() and tail() method also take arguments as numbers for the number of columns to show.
Python
print(newData.head())
print(newData.tail())
Output:
English Maths Science
Roll No.
1 19 13 17
2 14 20 18
3 15 18 19
4 13 14 14
5 17 16 20
English Maths Science
Roll No.
6 14 18 20
7 11 19 18
8 12 18 16
9 15 18 19
10 13 14 14
Shape() method
The shape() method can be used to view the number of rows and columns in the data frame as follows:
Python
Output:
(20, 3)
Sort_values() method in Pandas
If any column contains numerical data, we can sort that column using the sort_values() method in pandas as follows:
Python
sorted_column = newData.sort_values(['English'], ascending = False)
Now, let's suppose we want the top 5 values of the sorted column, we can use the head() method here:
Python
Output:
English Maths Science
Roll No.
1 19 13 17
6 19 13 17
5 17 16 20
10 17 16 20
3 15 18 19
We can do that with any numerical column of the data frame as shown below:
Python
Output:
Roll No.
1 13
2 20
3 18
4 14
5 16
Name: Maths, dtype: int64
Pandas Describe() method
Now, suppose our data is mostly numerical. We can get the statistical information like mean, max, min, etc. about the data frame using the describe() method as shown below:
Python
Output:
English Maths Science
count 20.00000 20.000000 20.000000
mean 14.30000 16.800000 17.500000
std 2.29645 2.330575 2.164304
min 11.00000 13.000000 14.000000
25% 13.00000 14.000000 16.000000
50% 14.00000 18.000000 18.000000
75% 15.00000 18.000000 19.000000
max 19.00000 20.000000 20.000000
This can also be done separately for all the numerical columns using the following command:
Python
newData['English'].mean()
Output:
14.3
Other statistical information can also be calculated using the respective methods. Like in Excel, formulas can also be applied, and calculated columns can be created as follows:
Python
newData['Total Marks'] =
newData["English"] + newData["Maths"] + newData["Science"]
newData['Total Marks'].head()
Output:
Roll No.
1 49
2 52
3 52
4 41
5 53
Name: Total Marks, dtype: int64
After operating on the data in the data frame, we can export the data back to an Excel file using the method to_excel. For this, we need to specify an output Excel file where the transformed data is to be written, as shown below:
Python
newData.to_excel('Output File.xlsx')
Output:
Final Sheet
Similar Reads
Data Analysis (Analytics) Tutorial Data Analytics is a process of examining, cleaning, transforming and interpreting data to discover useful information, draw conclusions and support decision-making. It helps businesses and organizations understand their data better, identify patterns, solve problems and improve overall performance.
4 min read
Prerequisites for Data Analysis
Exploratory Data Analysis (EDA) with NumPy, Pandas, Matplotlib and SeabornExploratory Data Analysis (EDA) serves as the foundation of any data science project. It is an essential step where data scientists investigate datasets to understand their structure, identify patterns, and uncover insights. Data preparation involves several steps, including cleaning, transforming,
4 min read
SQL for Data AnalysisSQL (Structured Query Language) is an indispensable tool for data analysts, providing a powerful way to query and manipulate data stored in relational databases. With its ability to handle large datasets and perform complex operations, SQL has become a fundamental skill for anyone involved in data a
7 min read
Python | Math operations for Data analysisPython is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.There are some important math operations that can be performed on a pandas series to si
2 min read
Python - Data visualization tutorialData visualization is a crucial aspect of data analysis, helping to transform analyzed data into meaningful insights through graphical representations. This comprehensive tutorial will guide you through the fundamentals of data visualization using Python. We'll explore various libraries, including M
7 min read
Free Public Data Sets For AnalysisData analysis is a crucial aspect of modern decision-making processes across various domains, including business, academia, healthcare, and government. However, obtaining high-quality datasets for analysis can be challenging and costly. Fortunately, there are numerous free public datasets available
5 min read
Data Analysis Libraries
Understanding the Data
What is Data ?Data is a word we hear everywhere nowadays. In general, data is a collection of facts, information, and statistics and this can be in various forms such as numbers, text, sound, images, or any other format.In this article, we will learn about What is Data, the Types of Data, Importance of Data, and
9 min read
Understanding Data Attribute Types | Qualitative and QuantitativeWhen we talk about data mining , we usually discuss knowledge discovery from data. To learn about the data, it is necessary to discuss data objects, data attributes, and types of data attributes. Mining data includes knowing about data, finding relations between data. And for this, we need to discus
6 min read
Univariate, Bivariate and Multivariate data and its analysisIn this article,we will be discussing univariate, bivariate, and multivariate data and their analysis. Univariate data: Univariate data refers to a type of data in which each observation or data point corresponds to a single variable. In other words, it involves the measurement or observation of a s
5 min read
Attributes and its Types in Data AnalyticsIn this article, we are going to discuss attributes and their various types in data analytics. We will also cover attribute types with the help of examples for better understanding. So let's discuss them one by one. What are Attributes?Attributes are qualities or characteristics that describe an obj
4 min read
Loading the Data
Data Cleaning
What is Data Cleaning?Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting (or removing) errors, inconsistencies, and inaccuracies within a dataset. This crucial step in the data management and data science pipeline ensures that the data is accurate, consistent, and
12 min read
ML | Overview of Data CleaningData cleaning is a important step in the machine learning (ML) pipeline as it involves identifying and removing any missing duplicate or irrelevant data. The goal of data cleaning is to ensure that the data is accurate, consistent and free of errors as raw data is often noisy, incomplete and inconsi
13 min read
Best Data Cleaning Techniques for Preparing Your DataData cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting errors, inconsistencies, and inaccuracies in datasets to improve their quality, accuracy, and reliability for analysis or other applications. It involves several steps aimed at detecting and r
6 min read
Handling Missing Data
Working with Missing Data in PandasIn Pandas, missing data occurs when some values are missing or not collected properly and these missing values are represented as:None: A Python object used to represent missing values in object-type arrays.NaN: A special floating-point value from NumPy which is recognized by all systems that use IE
5 min read
Drop rows from Pandas dataframe with missing values or NaN in columnsWe are given a Pandas DataFrame that may contain missing values, also known as NaN (Not a Number), in one or more columns. Our task is to remove the rows that have these missing values to ensure cleaner and more accurate data for analysis. For example, if a row contains NaN in any specified column,
4 min read
Count NaN or missing values in Pandas DataFrameIn this article, we will see how to Count NaN or missing values in Pandas DataFrame using isnull() and sum() method of the DataFrame. 1. DataFrame.isnull() MethodDataFrame.isnull() function detect missing values in the given object. It return a boolean same-sized object indicating if the values are
3 min read
ML | Handling Missing ValuesMissing values are a common issue in machine learning. This occurs when a particular variable lacks data points, resulting in incomplete information and potentially harming the accuracy and dependability of your models. It is essential to address missing values efficiently to ensure strong and impar
12 min read
Working with Missing Data in PandasIn Pandas, missing data occurs when some values are missing or not collected properly and these missing values are represented as:None: A Python object used to represent missing values in object-type arrays.NaN: A special floating-point value from NumPy which is recognized by all systems that use IE
5 min read
ML | Handle Missing Data with Simple ImputerSimpleImputer is a scikit-learn class which is helpful in handling the missing data in the predictive model dataset. It replaces the NaN values with a specified placeholder. It is implemented by the use of the SimpleImputer() method which takes the following arguments : missing_values : The missing_
2 min read
How to handle missing values of categorical variables in Python?Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed. Often we come across datasets in which some values are missing from the columns. This causes problems when we apply a machine learning model to the dataset. This increases the cha
4 min read
Replacing missing values using Pandas in PythonDataset is a collection of attributes and rows. Data set can have missing data that are represented by NA in Python and in this article, we are going to replace missing values in this article We consider this data set: Dataset data set In our data contains missing values in quantity, price, bought,
2 min read
Outliers Detection
Exploratory Data Analysis
Time Series Data Analysis