Using Excel With Pandas
Using Excel With Pandas
Excel is one of the most popular and widely-used data tools; it's hard to find an organization that
doesn't work with it in some way. From analysts, to sales VPs, to CEOs, various professionals use
Excel for both quick stats and serious data crunching.
With Excel being so pervasive, data professionals must be familiar with it. You'll also want a tool
that can easily read and write Excel files — pandas is perfect for this.
Pandas has excellent methods for reading all kinds of data from Excel files. You can also export
your results from pandas back to Excel, if that's preferred by your intended audience. Pandas is great
for other routine data analysis tasks, such as:
Pandas is better at automating data processing tasks than Excel, including processing Excel files.
In this tutorial, we are going to show you how to work with Excel files in pandas. We will cover the
following concepts.
Note that this tutorial does not provide a deep dive into pandas. To explore pandas more, check out
our course.
System prerequisites
We will use Python 3 and Jupyter Notebook to demonstrate the code in this tutorial.
In addition to Python and Jupyter Notebook, you will need the following Python modules:
There are multiple ways to get set up with all the modules. We cover three of the most common
scenarios below.
If you have Python installed via Anaconda package manager, you can install the required modules
using the command conda install. For example, to install pandas, you would execute the
command - conda install pandas.
If you already have a regular, non-Anaconda Python installed on the computer, you can install the
required modules using pip. Open your command line program and execute command pip install
<module name> to install a module. You should replace <module name> with the actual name of the
module you are trying to install. For example, to install pandas, you would execute command - pip
install pandas.
If you don't have Python already installed, you should get it through the Anaconda package manager.
Anaconda provides installers for Windows, Mac, and Linux Computers. If you choose the full
installer, you will get all the modules you need, along with Python and pandas within a single
package. This is the easiest and fastest way to get started.
Our Excel file has three sheets: '1900s,' '2000s,' and '2010s.' Each sheet has data for movies from
those years.
We will use this data set to find the ratings distribution for the movies, visualize movies with highest
ratings and net earnings and calculate statistical information about the movies. We will be analyzing
and exploring this data using pandas, thus demonstrating pandas capabilities to work with Excel
data.
import pandas as pd
We then use the pandas' read_excel method to read in data from the Excel file. The easiest way to
call this method is to pass the file name. If no sheet name is specified then it will read the first sheet
in the index (as shown below).
excel_file = 'movies.xls'
movies = pd.read_excel(excel_file)
Here, the read_excel method read the data from the Excel file into a pandas DataFrame object.
Pandas defaults to storing data in DataFrames. We then stored this DataFrame into a variable called
movies.
Pandas has a built-in DataFrame.head() method that we can use to easily display the first few rows
of our DataFrame. If no argument is passed, it will display first five rows. If a number is passed, it
will display the equal number of rows from the top.
movies.head()
ContentFacebook
GrossFacebook
Likes
Facebook
-Likes
Facebook
-Likes
Facebook
- Likes
Facenumber
-likes
Reviews
-Reviews
in by by
Title YearGenres
Language
Country
Aspect
Duration
Budget
Ratio
... User Votes
IMDB Score
Rating
Earnings
Actor 1Actor 2Actorcast
3 Total Movie postersUsers
Crtiics
Intolerance: Love's Struggle
0 1916
Drama|History|War
NaN
USA
Not
123
1.33
385907.0
Rated
NaN
...436 22 9.0 481 691 1 10718
88 69.08.0
Throughout the Ages
Over the
1 Hill to the
1920
Crime|Drama
Poorhouse
NaN
USA
NaN
110
1.33
100000.0
3000000.0
...2 2 0.0 4 0 1 51 1.0 4.8
The Big
2 Parade 1925
Drama|Romance|War
NaN
USA
Not
151
1.33
245000.0
Rated
NaN
...81 12 6.0 108 226 0 4849
45 48.08.3
Metropolis
3 1927
Drama|Sci-Fi
German
Germany
Not145
1.33
6000000.0
Rated
26435.0
...136 23 18.0 203 120001 111841
413260.0
8.3
Pandora's
4 Box 1929
Crime|Drama|Romance
German
Germany
Not
110
1.33
NaN
Rated
9950.0
...426 20 3.0 455 926 1 7431
84 71.08.0
5 rows × 25 columns
Excel files quite often have multiple sheets and the ability to read a specific sheet or all of them is
very important. To make this easy, the pandas read_excel method takes an argument called
sheetname that tells pandas which sheet to read in the data from. For this, you can either use the
sheet name or the sheet number. Sheet numbers start with zero. If the sheetname argument is not
given, it defaults to zero and pandas will import the first sheet.
By default, pandas will automatically assign a numeric index or row label starting with zero. You
may want to leave the default index as such if your data doesn't have a column with unique values
that can serve as a better index. In case there is a column that you feel would serve as a better index,
you can override the default behavior by setting index_col property to a column. It takes a numeric
value for setting a single column as index or a list of numeric values for creating a multi-index.
In the below code, we are choosing the first column, 'Title', as index (index=0) by passing zero to the
index_col argument.
5 rows × 24 columns
As you noticed above, our Excel data file has three sheets. We already read the first sheet in a
DataFrame above. Now, using the same syntax, we will read in rest of the two sheets too.
5 rows × 24 columns
5 rows × 24 columns
Since all the three sheets have similar data but for different records\movies, we will create a single
DataFrame from all the three DataFrames we created above. We will use the pandas concat method
for this and pass in the names of the three DataFrames we just created and assign the results to a new
DataFrame object, movies. By keeping the DataFrame name same as before, we are over-writing the
previously created DataFrame.
We can check if this concatenation by checking the number of rows in the combined DataFrame by
calling the method shape on it that will give us the number of rows and columns.
movies.shape
(5042, 24)
We can also use the ExcelFile class to work with multiple sheets from the same Excel file. We first
wrap the Excel file using ExcelFile and then pass it to read_excel method.
xlsx = pd.ExcelFile(excel_file)
xlsx = pd.ExcelFile(excel_file)
movies_sheets = []
for sheet in xlsx.sheet_names:
movies_sheets.append(xlsx.parse(sheet))
movies = pd.concat(movies_sheets)
If you are reading an Excel file with a lot of sheets and are creating a lot of DataFrames, ExcelFile
is more convenient and efficient in comparison to read_excel. With ExcelFile, you only need to
pass the Excel file once, and then you can use it to get the DataFrames. When using read_excel,
you pass the Excel file every time and hence the file is loaded again for every sheet. This can be a
huge performance drag if the Excel file has many sheets with a large number of rows.
We already introduced the method head in the previous section that displays few rows from the top
from the DataFrame. Let's look at few more methods that come in handy while exploring the data
set.
We can use the shape method to find out the number of rows and columns for the DataFrame.
movies.shape
(5042, 25)
This tells us our Excel file has 5042 records and 25 columns or observations. This can be useful in
reporting the number of records and columns and comparing that with the source data set.
We can use the tail method to view the bottom rows. If no parameter is passed, only the bottom
five rows are returned.
movies.tail()
Content
Facebook
Gross
Facebook
Likes
Facebook
Likes
Facebook
- Likes
-Facebook
Likes
Facenumber
- likes
-Reviews
Reviews
- inbyby
Title
Year Genres Language
Country
Aspect
Duration
Budget
Ratio
... User IMDB
Votes Score
Rating
Earnings
ActorActor
1 Actor
2 cast
3 Total
Movie
posters
Users
Crtiics
War1599
&
NaN
Drama|History|Romance|War
Peace English
UK
TV-14
NaN
16.00
NaN
NaN
...1000.0
888.0502.04528 110001.0 9277
44.010.0
8.2
Wings
NaN
Comedy|Drama
1600 English
USA
NaN
30.0
1.33
NaN
NaN
...685.0511.0424.01884 10005.0 7646
56.019.0
7.3
Wolf1601
NaN
Creek
Drama|Horror|Thriller English
Australia
NaN
NaN
2.00
NaN
NaN
...511.0457.0206.01617 954 0.0 726
6.02.07.1
Wuthering
NaN
1602Drama|Romance
Heights English
UK
NaN
142.0
NaN
NaN
NaN
...27000.0
698.0427.029196 0 2.0 6053
33.0
9.07.7
Yu-Gi-Oh! Duel
NaN
Action|Adventure|Animation|Family|Fantasy
1603 Japanese
Japan
NaN
24.0
NaN
NaN
NaN
...0.0 NaN NaN 0 124 0.0 12417
51.0
6.07.0
Monsters
5 rows × 25 columns
In Excel, you're able to sort a sheet based on the values in one or more columns. In pandas, you can
do the same thing with the sort_values method. For example, let's sort our movies DataFrame
based on the Gross Earnings column.
Since we have the data sorted by values in a column, we can do few interesting things with it. For
example, we can display the top 10 movies by Gross Earnings.
sorted_by_gross["Gross Earnings"].head(10)
1867 760505847.0
1027 658672302.0
1263 652177271.0
610 623279547.0
611 623279547.0
1774 533316061.0
1281 474544677.0
226 460935665.0
1183 458991599.0
618 448130642.0
Name: Gross Earnings, dtype: float64
We can also create a plot for the top 10 movies by Gross Earnings. Pandas makes it easy to visualize
your data with plots and charts through matplotlib, a popular data visualization library. With a
couple lines of code, you can start plotting. Moreover, matplotlib plots work well inside Jupyter
Notebooks since you can displace the plots right under the code.
First, we import the matplotlib module and set matplotlib to display the plots right in the Jupyter
Notebook.
We will draw a bar plot where each bar will represent one of the top 10 movies. We can do this by
calling the plot method and setting the argument kind to barh. This tells matplotlib to draw a
horizontal bar plot.
sorted_by_gross['Gross Earnings'].head(10).plot(kind="barh")
plt.show()
Let's create a histogram of IMDB Scores to check the distribution of IMDB Scores across all movies.
Histograms are a good way to visualize the distribution of a data set. We use the plot method on the
IMDB Scores series from our movies DataFrame and pass it the argument kind="hist".
movies['IMDB Score'].plot(kind="hist")
plt.show()
This data visualization suggests that most of the IMDB Scores fall between six and eight.
movies.describe()
Gross
Facebook Facebook
Likes - Facebook
Likes - Facebook
Likes - Facebook
Likes - Likes
Facebook
- castFacenumber
likes - Reviews
Reviews
in by by
Aspect
Duration
Year
Budget
Ratio User Votes
IMDB Score
Earnings
Director Actor 1 Actor 2 Actor 3 Total Movie posters Users Crtiics
4935.000000
5028.000000
4714.000000
4.551000e+03
4.159000e+03
4938.000000
count 5035.000000
5029.000000
5020.000000
5042.0000005042.000000
5029.000000
5.042000e+03
5022.000000
4993.000000
5042.000000
2002.470517
107.201074
2.220403
3.975262e+07
4.846841e+07
mean686.6217096561.323932
1652.080533
645.009761
9700.9591437527.457160
1.3714468.368475e+04
272.770808
140.194272
6.442007
12.474599
25.197441
1.385113
2.061149e+08
6.845299e+07
2813.602405
std 15021.977635
4042.774685
1665.041728
18165.101925
19322.070537
2.0136831.384940e+05
377.982886
121.601675
1.125189
1916.000000
7.000000
1.180000
2.180000e+02
1.620000e+02
0.000000 0.000000 0.000000 0.000000 0.000000 0.0000000.0000005.000000e+00
min 1.000000
1.000000
1.600000
1999.000000
93.000000
1.850000
6.000000e+06
5.340988e+06
7.000000 614.500000
25% 281.000000
133.000000
1411.2500000.0000000.0000008.599250e+03
65.000000
50.000000
5.800000
2005.000000
103.000000
2.350000
2.000000e+07
2.551750e+07
49.000000 988.000000
50% 595.000000
371.500000
3091.000000166.000000
1.0000003.437100e+04
156.000000
110.000000
6.600000
2011.000000
118.000000
2.350000
4.500000e+07
6.230944e+07
194.75000011000.000000
75% 918.000000
636.000000
13758.750000
3000.000000
2.0000009.634700e+04
326.000000
195.000000
7.200000
2016.000000
511.000000
16.000000
1.221550e+10
7.605058e+08
23000.000000
max 640000.000000
137000.000000
23000.000000
656730.000000
349000.000000
43.000000
1.689764e+06
5060.000000
813.000000
9.500000
The describe method displays below information for each of the columns.
Please note that this information will be calculated only for the numeric values.
We can also use the corresponding method to access this information one at a time. For example, to
get the mean of a particular column, you can use the mean method on that column.
movies["Gross Earnings"].mean()
48468407.526809327
Just like mean, there are methods available for each of the statistical information we want to access.
You can read about these methods in our free pandas cheat sheet.
For example, look at the top few rows of this Excel file.
This file obviously has no header and first four rows are not actual records and hence should not be
read in. We can tell read_excel there is no header by setting argument header to None and we can
skip first four rows by setting argument skiprows to four.
5 rows × 25 columns
We skipped four rows from the sheet and used none of the rows as the header. Also, notice that one
can combine different options in a single read statement. To skip rows at the bottom of the sheet, you
can use option skip_footer, which works just like skiprows, the only difference being the rows are
counted from the bottom upwards.
The column names in the previous DataFrame are numeric and were allotted as default by the
pandas. We can rename the column names to descriptive ones by calling the method columns on the
DataFrame and passing the column names as a list.
5 rows × 25 columns
Now that we have seen how to read a subset of rows from the Excel file, we can learn how to read a
subset of columns.
Alternatively, you can pass in a list of numbers, which will let you import columns at particular
indexes.
Above, we used pandas to create a new column called Net Earnings, and populated it with the
difference of Gross Earnings and Budget. It's worth noting the difference here in how formulas are
treated in Excel versus pandas. In Excel, a formula lives in the cell and updates when the data
changes - with Python, the calculations happen and the values are stored - if Gross Earnings for one
movie was manually changed, Net Earnings won't be updated.
Let's use the sot_values method to sort the data by the new column we created and visualize the
top 10 movies by Net Earnings.
We need to first identify the column or columns that will serve as the index, and the column(s) on
which the summarizing formula will be applied. Let's start small, by choosing Year as the index
column and Gross Earnings as the summarization column and creating a separate DataFrame from
this data.
We now call pivot_table on this subset of data. The method pivot_table takes a parameter
index. As mentioned, we want to use Year as the index.
earnings_by_year = movies_subset.pivot_table(index=['Year'])
earnings_by_year.head()
Gross Earnings
Year
NaN
1916.0
3000000.0
1920.0
NaN
1925.0
26435.0
1927.0
1408975.0
1929.0
This gave us a pivot table with grouping on Year and summarization on the sum of Gross Earnings.
Notice, we didn't need to specify Gross Earnings column explicitly as pandas automatically
identified it the values on which summarization should be applied.
We can use this pivot table to create some data visualizations. We can call the plot method on the
DataFrame to create a line plot and call the show method to display the plot in the notebook.
earnings_by_year.plot()
plt.show()
We saw how to pivot with a single column as the index. Things will get more interesting if we can
use multiple columns. Let's create another DataFrame subset but this time we will choose the
columns, Country, Language and Gross Earnings.
We will use columns Country and Language as the index for the pivot table. We will use Gross
Earnings as summarization table, however, we do not need to specify this explicitly as we saw
earlier.
Let's visualize this pivot table with a bar plot. Since there are still few hundred records in this pivot
table, we will plot just a few of them.
earnings_by_co_lang.head(20).plot(kind='bar', figsize=(20,8))
plt.show()
Exporting the results to Excel
If you're going to be working with colleagues who use Excel, saving Excel files out of pandas is
important. You can export or write a pandas DataFrame to an Excel file using pandas to_excel
method. Pandas uses the xlwt Python module internally for writing to Excel files. The to_excel
method is called on the DataFrame we want to export.We also need to pass a filename to which this
DataFrame will be written.
movies.to_excel('output.xlsx')
By default, the index is also saved to the output file. However, sometimes the index doesn't provide
any useful information. For example, the movies DataFrame has a numeric auto-increment index,
that was not part of the original Excel data.
movies.head()
Content
Facebook
Gross Facebook
LikesFacebook
- LikesFacebook
- Likes
Facenumber
- likes
Reviews
- Reviews
in by by
Title YearGenres
Language
Country
Aspect
Duration
Budget
Ratio
... User Votes
IMDB
Net Earnings
Score
Rating
Earnings
Actor 2Actor 3cast TotalMoviepostersUsers
Crtiics
Intolerance: Love's Struggle
0 1916.0
Drama|History|War
NaN
USA
Not
123.0
1.33
385907.0
Rated
NaN
...22.0 9.0 481 691 1.0 10718
88.0
69.08.0
NaN
Throughout the Ages
Over the
1 Hill to the Poorhouse
1920.0
Crime|Drama
NaN
USA
NaN
110.0
1.33
100000.0
3000000.0
...2.0 0.0 4 0 1.0 51.01.0 4.8
2900000.0
The Big
2 Parade 1925.0
Drama|Romance|War
NaN
USA
Not
151.0
1.33
245000.0
Rated
NaN
...12.0 6.0 108 226 0.0 4849
45.0
48.08.3
NaN
Metropolis
3 1927.0
Drama|Sci-Fi
German
Germany
Not145.0
1.33
6000000.0
Rated
26435.0
...23.0 18.0 203 120001.0 111841
413.0
260.0
8.3
-5973565.0
Pandora's
4 Box 1929.0
Crime|Drama|Romance
German
Germany
Not
110.0
1.33
NaN
Rated
9950.0
...20.0 3.0 455 926 1.0 7431
84.0
71.08.0
NaN
5 rows × 26 columns
You can choose to skip the index by passing along index-False.
movies.to_excel('output.xlsx', index=False)
We need to be able to make our output files look nice before we can send it out to our co-workers.
We can use pandas ExcelWriter class along with the XlsxWriter Python module to apply the
formatting.
We can do use these advanced output options by creating a ExcelWriter object and use this object
to write to the EXcel file.
workbook = writer.book
worksheet = writer.sheets['report']
We can apply customizations by calling add_format on the workbook we are writing to. Here we
are setting header format as bold.
Finally, we save the output file by calling the method save on the writer object.
writer.save()
As an example, we saved the data with column headers set as bold. And the saved file looks like the
image below.
Like this, one can use XlsxWriter to apply various formatting to the output Excel file.
Conclusion
Pandas is not a replacement for Excel. Both tools have their place in the data analysis workflow and
can be very great companion tools. As we demonstrated, pandas can do a lot of complex data
analysis and manipulations, which depending on your need and expertise, can go beyond what you
can achieve if you are just using Excel. One of the major benefits of using Python and pandas over
Excel is that it helps you automate Excel file processing by writing scripts and integrating with your
automated data workflow. Pandas also has excellent methods for reading all kinds of data from
Excel files. You can export your results from pandas back to Excel too if that's preferred by your
intended audience.
On the other hand, Excel is a such a widely used data tool, it's not a wise to ignore it. Acquiring
expertise in both pandas and Excel and making them work together gives you skills that can help you
stand out in your organization.
https://www.dataquest.io/blog/excel-and-pandas/ 15-12-2018