0% found this document useful (0 votes)
188 views

Using Excel With Pandas

The document discusses using pandas to work with Excel files. Pandas has excellent methods for reading data from Excel files and exporting results back to Excel. It can also be used for routine data analysis tasks like exploratory data analysis, data visualization, and machine learning. The document then demonstrates how to read Excel data into pandas, explore the data, visualize it, manipulate it, and export it back to Excel. It uses a movie dataset from multiple Excel sheets to show pandas' capabilities of analyzing and exploring Excel data.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
188 views

Using Excel With Pandas

The document discusses using pandas to work with Excel files. Pandas has excellent methods for reading data from Excel files and exporting results back to Excel. It can also be used for routine data analysis tasks like exploratory data analysis, data visualization, and machine learning. The document then demonstrates how to read Excel data into pandas, explore the data, visualize it, manipulate it, and export it back to Excel. It uses a movie dataset from multiple Excel sheets to show pandas' capabilities of analyzing and exploring Excel data.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Using Excel with pandas

Excel is one of the most popular and widely-used data tools; it's hard to find an organization that
doesn't work with it in some way. From analysts, to sales VPs, to CEOs, various professionals use
Excel for both quick stats and serious data crunching.

With Excel being so pervasive, data professionals must be familiar with it. You'll also want a tool
that can easily read and write Excel files — pandas is perfect for this.

Pandas has excellent methods for reading all kinds of data from Excel files. You can also export
your results from pandas back to Excel, if that's preferred by your intended audience. Pandas is great
for other routine data analysis tasks, such as:

 quick Exploratory Data Analysis (EDA)


 drawing attractive plots
 feeding data into machine learning tools like scikit-learn
 building machine learning models on your data
 taking cleaned and processed data to any number of data tools

Pandas is better at automating data processing tasks than Excel, including processing Excel files.

In this tutorial, we are going to show you how to work with Excel files in pandas. We will cover the
following concepts.

 setting up your computer with the necessary software


 reading in data from Excel files into pandas
 data exploration in pandas
 visualizing data in pandas using the matplotlib visualization library
 manipulating and reshaping data in pandas
 moving data from pandas into Excel

Note that this tutorial does not provide a deep dive into pandas. To explore pandas more, check out
our course.
System prerequisites
We will use Python 3 and Jupyter Notebook to demonstrate the code in this tutorial.
In addition to Python and Jupyter Notebook, you will need the following Python modules:

 matplotlib - data visualization


 NumPy - numerical data functionality
 OpenPyXL - read/write Excel 2010 xlsx/xlsm files
 pandas - data import, clean-up, exploration, and analysis
 xlrd - read Excel data
 xlwt - write to Excel
 XlsxWriter - write to Excel (xlsx) files

There are multiple ways to get set up with all the modules. We cover three of the most common
scenarios below.

 If you have Python installed via Anaconda package manager, you can install the required modules
using the command conda install. For example, to install pandas, you would execute the
command - conda install pandas.
 If you already have a regular, non-Anaconda Python installed on the computer, you can install the
required modules using pip. Open your command line program and execute command pip install
<module name> to install a module. You should replace <module name> with the actual name of the
module you are trying to install. For example, to install pandas, you would execute command - pip
install pandas.
 If you don't have Python already installed, you should get it through the Anaconda package manager.
Anaconda provides installers for Windows, Mac, and Linux Computers. If you choose the full
installer, you will get all the modules you need, along with Python and pandas within a single
package. This is the easiest and fastest way to get started.

The data set


In this tutorial, we will use a multi-sheet Excel file we created from Kaggle's IMDB Scores data.
You can download the file here.

Our Excel file has three sheets: '1900s,' '2000s,' and '2010s.' Each sheet has data for movies from
those years.
We will use this data set to find the ratings distribution for the movies, visualize movies with highest
ratings and net earnings and calculate statistical information about the movies. We will be analyzing
and exploring this data using pandas, thus demonstrating pandas capabilities to work with Excel
data.

Read data from the Excel file


We need to first import the data from the Excel file into pandas. To do that, we start by importing the
pandas module.

import pandas as pd

We then use the pandas' read_excel method to read in data from the Excel file. The easiest way to
call this method is to pass the file name. If no sheet name is specified then it will read the first sheet
in the index (as shown below).

excel_file = 'movies.xls'
movies = pd.read_excel(excel_file)

Here, the read_excel method read the data from the Excel file into a pandas DataFrame object.
Pandas defaults to storing data in DataFrames. We then stored this DataFrame into a variable called
movies.

Pandas has a built-in DataFrame.head() method that we can use to easily display the first few rows
of our DataFrame. If no argument is passed, it will display first five rows. If a number is passed, it
will display the equal number of rows from the top.

movies.head()
ContentFacebook
GrossFacebook
Likes
Facebook
-Likes
Facebook
-Likes
Facebook
- Likes
Facenumber
-likes
Reviews
-Reviews
in by by
Title YearGenres
Language
Country
Aspect
Duration
Budget
Ratio
... User Votes
IMDB Score
Rating
Earnings
Actor 1Actor 2Actorcast
3 Total Movie postersUsers
Crtiics
Intolerance: Love's Struggle
0 1916
Drama|History|War
NaN
USA
Not
123
1.33
385907.0
Rated
NaN
...436 22 9.0 481 691 1 10718
88 69.08.0
Throughout the Ages
Over the
1 Hill to the
1920
Crime|Drama
Poorhouse
NaN
USA
NaN
110
1.33
100000.0
3000000.0
...2 2 0.0 4 0 1 51 1.0 4.8
The Big
2 Parade 1925
Drama|Romance|War
NaN
USA
Not
151
1.33
245000.0
Rated
NaN
...81 12 6.0 108 226 0 4849
45 48.08.3
Metropolis
3 1927
Drama|Sci-Fi
German
Germany
Not145
1.33
6000000.0
Rated
26435.0
...136 23 18.0 203 120001 111841
413260.0
8.3
Pandora's
4 Box 1929
Crime|Drama|Romance
German
Germany
Not
110
1.33
NaN
Rated
9950.0
...426 20 3.0 455 926 1 7431
84 71.08.0

5 rows × 25 columns

Excel files quite often have multiple sheets and the ability to read a specific sheet or all of them is
very important. To make this easy, the pandas read_excel method takes an argument called
sheetname that tells pandas which sheet to read in the data from. For this, you can either use the
sheet name or the sheet number. Sheet numbers start with zero. If the sheetname argument is not
given, it defaults to zero and pandas will import the first sheet.

By default, pandas will automatically assign a numeric index or row label starting with zero. You
may want to leave the default index as such if your data doesn't have a column with unique values
that can serve as a better index. In case there is a column that you feel would serve as a better index,
you can override the default behavior by setting index_col property to a column. It takes a numeric
value for setting a single column as index or a list of numeric values for creating a multi-index.

In the below code, we are choosing the first column, 'Title', as index (index=0) by passing zero to the
index_col argument.

movies_sheet1 = pd.read_excel(excel_file, sheetname=0, index_col=0)


movies_sheet1.head()
Content
Gross
FacebookFacebook
Likes
Facebook
Likes
- Facebook
Likes
- Facebook
-Likes
Facenumber
-likes
Reviews
Reviews
- in by by
YearGenres
Language
Country
Aspect
Duration
Budget
Director
Ratio ... User Votes
IMDB Score
Rating
Earnings Actor Actor
1 Actor
2 cast
3 TotalMoviepostersUsers
Crtiics
Title
Intolerance: Love's Struggle
1916
Drama|History|War
NaN
USA
Not
123
1.33
385907.0
Rated
NaN
D.W....436
Griffith
22 9.0 481 691 1 10718
88 69.08.0
Throughout the Ages
Over the Hill to the
1920
Crime|Drama
NaN
USA
NaN
110
1.33
100000.0
3000000.0
Harry
...2 F. Millarde
2 0.0 4 0 1 51 1.0 4.8
Poorhouse
1925
Drama|Romance|War
The Big Parade NaN
USA
Not
151
1.33
245000.0
Rated
NaN
King...81
Vidor12 6.0 108 226 0 4849
45 48.08.3
1927
Drama|Sci-Fi
Metropolis German
Germany
Not145
1.33
6000000.0
Rated
26435.0
Fritz...136
Lang 23 18.0 203 120001 111841
413260.0
8.3
Georg Wilhelm
Pandora's1929
Crime|Drama|Romance
Box German
Germany
Not
110
1.33
NaN
Rated
9950.0...426 20 3.0 455 926 1 7431
84 71.08.0
Pabst

5 rows × 24 columns

As you noticed above, our Excel data file has three sheets. We already read the first sheet in a
DataFrame above. Now, using the same syntax, we will read in rest of the two sheets too.

movies_sheet2 = pd.read_excel(excel_file, sheetname=1, index_col=0)


movies_sheet2.head()
Content
Gross
FacebookFacebook
Likes
Facebook
- LikesFacebook
- LikesFacebook
- Likes
Facenumber
- likesReviews
- Reviews
in by by
Year Genres
Language
Country
Aspect
Duration
Budget
Director
Ratio ... User Votes
IMDB Score
Rating
Earnings Actor 1Actor 2Actor 3cast TotalMoviepostersUsersCrtiics
Title
102 2000
Adventure|Comedy|Family
Dalmatians English
USA
G 100.0
1.85
85000000.0
66941559.0
Kevin
...2000.0
Lima 795.0 439.0 4182 372 1 26413
77.084.04.8
2000
Comedy|Drama
28 Days English
USA
PG-13
103.0
1.37
43000000.0
37035515.0
Betty
...12000.0
Thomas10000.0664.0 23864 0 1 34597
194.0
116.0
6.0
32000
Comedy
Strikes English
USA
R 82.0
1.85
6000000.0
9821335.0
DJ Pooh
...939.0 706.0 585.0 3354 118 1 1415
10.022.04.0
Hans Petter
2000
Drama
Aberdeen English
UK
NaN
106.0
1.85
6500000.0
64148.0
...844.0 2.0 0.0 846 260 0 2601
35.028.07.3
Moland
All the Pretty Billy Bob
2000
Drama|Romance|Western
English
USA
PG-13
220.0
2.35
57000000.0
15527125.0
...13000.0861.0 820.0 15006 652 2 11388
183.0
85.05.8
Horses Thornton

5 rows × 24 columns

movies_sheet3 = pd.read_excel(excel_file, sheetname=2, index_col=0)


movies_sheet3.head()
Content
Gross
Facebook
Facebook
Likes
Facebook
Likes
-Facebook
Likes
-Facebook
Likes
-Facenumber
likes
- Reviews
Reviews
- inbyby
Year Genres Language
Country
Aspect
Duration
Budget
Director
Ratio
... User IMDB
Votes Score
Rating
EarningsActorActor
1 Actor
2 cast3 Total
Movie
posters
Users
Crtiics
Title
Content
Gross
Facebook
Facebook
Likes
Facebook
Likes
-Facebook
Likes
-Facebook
Likes
-Facenumber
likes
- Reviews
Reviews
- inbyby
Year Genres Language
Country
Aspect
Duration
Budget
Director
Ratio
... User IMDB
Votes Score
Rating
EarningsActorActor
1 Actor
2 cast3 Total
Movie
posters
Users
Crtiics
Title
127 2010.0
Adventure|Biography|Drama|Thriller
Hours English
USA
R94.0
1.85
18000000.0
18329466.0
Danny
...11000.0
Boyle
642.0 223.0 11984 63000
0.0 279179
440.0
450.0
7.6
Eric
2010.0
Drama
3 Backyards English
USA
R88.0
NaN
300000.0
NaN...795.0659.0 301.0 1884 92 0.0 554
23.0
20.0
5.2
Mendelsohn
2010.0
3Comedy|Drama|Romance
German
Germany
Unrated
119.0
2.35
NaN
59774.0
Tom
...24.0
Tykwer
20.0 9.0 69 20000.0 4212
18.0
76.0
6.8
8: The Mormon
2010.0
Documentary English
USA
R80.0
1.78
2500000.0
99851.0
Reed
...191.0
Cowan
12.0 5.0 210 0 0.0 1138
30.0
28.0
7.1
Proposition
A Turtle's Tale:
2010.0
Adventure|Animation|Family
English
France
PG
88.0
2.35
NaN
NaN
Ben
...783.0
Stassen
749.0 602.0 3874 0 2.0 5385
22.0
56.0
6.1
Sammy's Adventures

5 rows × 24 columns

Since all the three sheets have similar data but for different records\movies, we will create a single
DataFrame from all the three DataFrames we created above. We will use the pandas concat method
for this and pass in the names of the three DataFrames we just created and assign the results to a new
DataFrame object, movies. By keeping the DataFrame name same as before, we are over-writing the
previously created DataFrame.

movies = pd.concat([movies_sheet1, movies_sheet2, movies_sheet3])

We can check if this concatenation by checking the number of rows in the combined DataFrame by
calling the method shape on it that will give us the number of rows and columns.

movies.shape
(5042, 24)

Using the ExcelFile class to read multiple sheets

We can also use the ExcelFile class to work with multiple sheets from the same Excel file. We first
wrap the Excel file using ExcelFile and then pass it to read_excel method.

xlsx = pd.ExcelFile(excel_file)
xlsx = pd.ExcelFile(excel_file)
movies_sheets = []
for sheet in xlsx.sheet_names:
movies_sheets.append(xlsx.parse(sheet))
movies = pd.concat(movies_sheets)

If you are reading an Excel file with a lot of sheets and are creating a lot of DataFrames, ExcelFile
is more convenient and efficient in comparison to read_excel. With ExcelFile, you only need to
pass the Excel file once, and then you can use it to get the DataFrames. When using read_excel,
you pass the Excel file every time and hence the file is loaded again for every sheet. This can be a
huge performance drag if the Excel file has many sheets with a large number of rows.

Exploring the data


Now that we have read in the movies data set from our Excel file, we can start exploring it using
pandas. A pandas DataFrame stores the data in a tabular format, just like the way Excel displays the
data in a sheet. Pandas has a lot of built-in methods to explore the DataFrame we created from the
Excel file we just read in.

We already introduced the method head in the previous section that displays few rows from the top
from the DataFrame. Let's look at few more methods that come in handy while exploring the data
set.

We can use the shape method to find out the number of rows and columns for the DataFrame.

movies.shape
(5042, 25)

This tells us our Excel file has 5042 records and 25 columns or observations. This can be useful in
reporting the number of records and columns and comparing that with the source data set.

We can use the tail method to view the bottom rows. If no parameter is passed, only the bottom
five rows are returned.

movies.tail()
Content
Facebook
Gross
Facebook
Likes
Facebook
Likes
Facebook
- Likes
-Facebook
Likes
Facenumber
- likes
-Reviews
Reviews
- inbyby
Title
Year Genres Language
Country
Aspect
Duration
Budget
Ratio
... User IMDB
Votes Score
Rating
Earnings
ActorActor
1 Actor
2 cast
3 Total
Movie
posters
Users
Crtiics
War1599
&
NaN
Drama|History|Romance|War
Peace English
UK
TV-14
NaN
16.00
NaN
NaN
...1000.0
888.0502.04528 110001.0 9277
44.010.0
8.2
Wings
NaN
Comedy|Drama
1600 English
USA
NaN
30.0
1.33
NaN
NaN
...685.0511.0424.01884 10005.0 7646
56.019.0
7.3
Wolf1601
NaN
Creek
Drama|Horror|Thriller English
Australia
NaN
NaN
2.00
NaN
NaN
...511.0457.0206.01617 954 0.0 726
6.02.07.1
Wuthering
NaN
1602Drama|Romance
Heights English
UK
NaN
142.0
NaN
NaN
NaN
...27000.0
698.0427.029196 0 2.0 6053
33.0
9.07.7
Yu-Gi-Oh! Duel
NaN
Action|Adventure|Animation|Family|Fantasy
1603 Japanese
Japan
NaN
24.0
NaN
NaN
NaN
...0.0 NaN NaN 0 124 0.0 12417
51.0
6.07.0
Monsters

5 rows × 25 columns

In Excel, you're able to sort a sheet based on the values in one or more columns. In pandas, you can
do the same thing with the sort_values method. For example, let's sort our movies DataFrame
based on the Gross Earnings column.

sorted_by_gross = movies.sort_values(['Gross Earnings'], ascending=False)

Since we have the data sorted by values in a column, we can do few interesting things with it. For
example, we can display the top 10 movies by Gross Earnings.

sorted_by_gross["Gross Earnings"].head(10)
1867 760505847.0
1027 658672302.0
1263 652177271.0
610 623279547.0
611 623279547.0
1774 533316061.0
1281 474544677.0
226 460935665.0
1183 458991599.0
618 448130642.0
Name: Gross Earnings, dtype: float64

We can also create a plot for the top 10 movies by Gross Earnings. Pandas makes it easy to visualize
your data with plots and charts through matplotlib, a popular data visualization library. With a
couple lines of code, you can start plotting. Moreover, matplotlib plots work well inside Jupyter
Notebooks since you can displace the plots right under the code.

First, we import the matplotlib module and set matplotlib to display the plots right in the Jupyter
Notebook.

import matplotlib.pyplot as plt


%matplotlib inline

We will draw a bar plot where each bar will represent one of the top 10 movies. We can do this by
calling the plot method and setting the argument kind to barh. This tells matplotlib to draw a
horizontal bar plot.

sorted_by_gross['Gross Earnings'].head(10).plot(kind="barh")
plt.show()

Let's create a histogram of IMDB Scores to check the distribution of IMDB Scores across all movies.
Histograms are a good way to visualize the distribution of a data set. We use the plot method on the
IMDB Scores series from our movies DataFrame and pass it the argument kind="hist".

movies['IMDB Score'].plot(kind="hist")
plt.show()
This data visualization suggests that most of the IMDB Scores fall between six and eight.

Getting statistical information about the data


Pandas has some very handy methods to look at the statistical data about our data set. For example,
we can use the describe method to get a statistical summary of the data set.

movies.describe()
Gross
Facebook Facebook
Likes - Facebook
Likes - Facebook
Likes - Facebook
Likes - Likes
Facebook
- castFacenumber
likes - Reviews
Reviews
in by by
Aspect
Duration
Year
Budget
Ratio User Votes
IMDB Score
Earnings
Director Actor 1 Actor 2 Actor 3 Total Movie posters Users Crtiics
4935.000000
5028.000000
4714.000000
4.551000e+03
4.159000e+03
4938.000000
count 5035.000000
5029.000000
5020.000000
5042.0000005042.000000
5029.000000
5.042000e+03
5022.000000
4993.000000
5042.000000
2002.470517
107.201074
2.220403
3.975262e+07
4.846841e+07
mean686.6217096561.323932
1652.080533
645.009761
9700.9591437527.457160
1.3714468.368475e+04
272.770808
140.194272
6.442007
12.474599
25.197441
1.385113
2.061149e+08
6.845299e+07
2813.602405
std 15021.977635
4042.774685
1665.041728
18165.101925
19322.070537
2.0136831.384940e+05
377.982886
121.601675
1.125189
1916.000000
7.000000
1.180000
2.180000e+02
1.620000e+02
0.000000 0.000000 0.000000 0.000000 0.000000 0.0000000.0000005.000000e+00
min 1.000000
1.000000
1.600000
1999.000000
93.000000
1.850000
6.000000e+06
5.340988e+06
7.000000 614.500000
25% 281.000000
133.000000
1411.2500000.0000000.0000008.599250e+03
65.000000
50.000000
5.800000
2005.000000
103.000000
2.350000
2.000000e+07
2.551750e+07
49.000000 988.000000
50% 595.000000
371.500000
3091.000000166.000000
1.0000003.437100e+04
156.000000
110.000000
6.600000
2011.000000
118.000000
2.350000
4.500000e+07
6.230944e+07
194.75000011000.000000
75% 918.000000
636.000000
13758.750000
3000.000000
2.0000009.634700e+04
326.000000
195.000000
7.200000
2016.000000
511.000000
16.000000
1.221550e+10
7.605058e+08
23000.000000
max 640000.000000
137000.000000
23000.000000
656730.000000
349000.000000
43.000000
1.689764e+06
5060.000000
813.000000
9.500000

The describe method displays below information for each of the columns.

 the count or number of values


 mean
 standard deviation
 minimum, maximum
 25%, 50%, and 75% quantile

Please note that this information will be calculated only for the numeric values.

We can also use the corresponding method to access this information one at a time. For example, to
get the mean of a particular column, you can use the mean method on that column.
movies["Gross Earnings"].mean()
48468407.526809327

Just like mean, there are methods available for each of the statistical information we want to access.
You can read about these methods in our free pandas cheat sheet.

Reading files with no header and skipping records


Earlier in this tutorial, we saw some ways to read a particular kind of Excel file that had headers and
no rows that needed skipping. Sometimes, the Excel sheet doesn't have any header row. For such
instances, you can tell pandas not to consider the first row as header or columns names. And If the
Excel sheet's first few rows contain data that should not be read in, you can ask the read_excel
method to skip a certain number of rows, starting from the top.

For example, look at the top few rows of this Excel file.

This file obviously has no header and first four rows are not actual records and hence should not be
read in. We can tell read_excel there is no header by setting argument header to None and we can
skip first four rows by setting argument skiprows to four.

movies_skip_rows = pd.read_excel("movies-no-header-skip-rows.xls", header=None,


skiprows=4)
movies_skip_rows.head(5)
0 1 2 3456789...15
16
17
18
19
20
21
22
23
24
Metropolis
0 1927
Drama|Sci-Fi
German
Germany
Not
145
1.33
6000000.0
26435.0
Rated
...136
23
18.0
203
12000
1111841
413
260.0
8.3
Pandora's
1 1929
Box
Crime|Drama|Romance
German
Germany
Not
110
1.33
NaN
9950.0
Rated
...426
20
3.0
455
926
17431
84
71.0
8.0
The Broadway
2 1929
Musical|Romance
Melody
English
USA
Passed
100
1.37
379000.0
2808000.0
...77
28
4.0
109
167
84546
71
36.0
6.3
Hell's 3Angels
1930
Drama|War
English
USA
Passed
96
1.20
3950000.0
NaN
...431
12
4.0
457
279
13753
53
35.0
7.8
A Farewell
4 1932
Drama|Romance|War
to Arms English
USA
Unrated
79
1.37
800000.0
NaN
...998
164
99.0
1284
213
13519
46
42.0
6.6

5 rows × 25 columns

We skipped four rows from the sheet and used none of the rows as the header. Also, notice that one
can combine different options in a single read statement. To skip rows at the bottom of the sheet, you
can use option skip_footer, which works just like skiprows, the only difference being the rows are
counted from the bottom upwards.
The column names in the previous DataFrame are numeric and were allotted as default by the
pandas. We can rename the column names to descriptive ones by calling the method columns on the
DataFrame and passing the column names as a list.

movies_skip_rows.columns = ['Title', 'Year', 'Genres', 'Language', 'Country',


'Content Rating',
'Duration', 'Aspect Ratio', 'Budget', 'Gross Earnings', 'Director',
'Actor 1', 'Actor 2', 'Actor 3', 'Facebook Likes - Director',
'Facebook Likes - Actor 1', 'Facebook Likes - Actor 2',
'Facebook Likes - Actor 3', 'Facebook Likes - cast Total',
'Facebook likes - Movie', 'Facenumber in posters', 'User Votes',
'Reviews by Users', 'Reviews by Crtiics', 'IMDB Score']
movies_skip_rows.head()
Content
Gross
Facebook Facebook
Likes Facebook
- LikesFacebook
- Likes -Likes
Facebook
- cast
Facenumber
likes Reviews
- Reviews
in by by
Title
YearGenres
Language
Country
Aspect
Duration
Budget
Ratio
... User Votes
IMDB Score
Rating
EarningsActor 1 Actor 2 Actor 3 Total MoviepostersUsers Crtiics
Metropolis
01927
Drama|Sci-Fi
German
Germany
Not145
1.33
6000000.0
Rated
26435.0
...136 23 18.0 203 12000 1 111841
413260.0
8.3
Pandora's
11929
Crime|Drama|Romance
Box German
Germany
Not
110
1.33
NaN
Rated
9950.0
...426 20 3.0 455 926 1 7431
84 71.08.0
The Broadway
21929
Musical|Romance
English
USA
Passed
100
1.37
379000.0
2808000.0
...77 28 4.0 109 167 8 4546
71 36.06.3
Melody
Hell's 31930
Angels
Drama|War
English
USA
Passed
96
1.20
3950000.0
NaN
...431 12 4.0 457 279 1 3753
53 35.07.8
A Farewell
41932
Drama|Romance|War
to Arms
English
USA
Unrated
79
1.37
800000.0
NaN
...998 164 99.0 1284 213 1 3519
46 42.06.6

5 rows × 25 columns

Now that we have seen how to read a subset of rows from the Excel file, we can learn how to read a
subset of columns.

Reading a subset of columns


Although read_excel defaults to reading and importing all columns, you can choose to import only
certain columns. By passing parse_cols=6, we are telling the read_excel method to read only the
first columns till index six or first seven columns (the first column being indexed zero).

movies_subset_columns = pd.read_excel(excel_file, parse_cols=6)


movies_subset_columns.head()
Title YearGenres
Content
Language
Country
Duration
Rating
Intolerance:
0 Love's Struggle Throughout
1916
Drama|History|War
the Ages
NaN
USA
Not123
Rated
Over the
1 Hill to the Poorhouse 1920
Crime|Drama
NaN
USA
NaN
110
The Big
2 Parade 1925
Drama|Romance|War
NaN
USA
Not151
Rated
Metropolis
3 1927
Drama|Sci-Fi
German
Germany
Not145
Rated
Pandora's
4 Box 1929
Crime|Drama|Romance
German
Germany
Not110
Rated

Alternatively, you can pass in a list of numbers, which will let you import columns at particular
indexes.

Applying formulas on the columns


One of the much-used features of Excel is to apply formulas to create new columns from existing
column values. In our Excel file, we have Gross Earnings and Budget columns. We can get Net
earnings by subtracting Budget from Gross earnings. We could then apply this formula in the Excel
file to all the rows. We can do this in pandas also as shown below.

movies["Net Earnings"] = movies["Gross Earnings"] - movies["Budget"]

Above, we used pandas to create a new column called Net Earnings, and populated it with the
difference of Gross Earnings and Budget. It's worth noting the difference here in how formulas are
treated in Excel versus pandas. In Excel, a formula lives in the cell and updates when the data
changes - with Python, the calculations happen and the values are stored - if Gross Earnings for one
movie was manually changed, Net Earnings won't be updated.

Let's use the sot_values method to sort the data by the new column we created and visualize the
top 10 movies by Net Earnings.

sorted_movies = movies[['Net Earnings']].sort_values(['Net Earnings'],


ascending=[False])
sorted_movies.head(10)['Net Earnings'].plot.barh()
plt.show()

Pivot Table in pandas


Advanced Excel users also often use pivot tables. A pivot table summarizes the data of another table
by grouping the data on an index and applying operations such as sorting, summing, or averaging.
You can use this feature in pandas too.

We need to first identify the column or columns that will serve as the index, and the column(s) on
which the summarizing formula will be applied. Let's start small, by choosing Year as the index
column and Gross Earnings as the summarization column and creating a separate DataFrame from
this data.

movies_subset = movies[['Year', 'Gross Earnings']]


movies_subset.head()
Gross
Year
Earnings
Gross
Year
Earnings
1916.0
NaN 0
1920.0
3000000.0
1
1925.0
NaN 2
1927.0
26435.0
3
1929.0
9950.0
4

We now call pivot_table on this subset of data. The method pivot_table takes a parameter
index. As mentioned, we want to use Year as the index.

earnings_by_year = movies_subset.pivot_table(index=['Year'])
earnings_by_year.head()
Gross Earnings
Year
NaN
1916.0
3000000.0
1920.0
NaN
1925.0
26435.0
1927.0
1408975.0
1929.0

This gave us a pivot table with grouping on Year and summarization on the sum of Gross Earnings.
Notice, we didn't need to specify Gross Earnings column explicitly as pandas automatically
identified it the values on which summarization should be applied.

We can use this pivot table to create some data visualizations. We can call the plot method on the
DataFrame to create a line plot and call the show method to display the plot in the notebook.

earnings_by_year.plot()
plt.show()
We saw how to pivot with a single column as the index. Things will get more interesting if we can
use multiple columns. Let's create another DataFrame subset but this time we will choose the
columns, Country, Language and Gross Earnings.

movies_subset = movies[['Country', 'Language', 'Gross Earnings']]


movies_subset.head()
Gross
Country
Language
Earnings
USA
NaN
NaN0
USA
NaN
3000000.0
1
USA
NaN
NaN2
Germany
German
26435.0
3
Germany
German
9950.0
4

We will use columns Country and Language as the index for the pivot table. We will use Gross
Earnings as summarization table, however, we do not need to specify this explicitly as we saw
earlier.

earnings_by_co_lang = movies_subset.pivot_table(index=['Country', 'Language'])


earnings_by_co_lang.head()
Gross Earnings
Country
Language
1.127331e+06
Afghanistan
Dari
7.230936e+06
Argentina
Spanish
1.007614e+07
Aruba
English
6.165429e+06
Australia
Aboriginal
5.052950e+05
Dzongkha

Let's visualize this pivot table with a bar plot. Since there are still few hundred records in this pivot
table, we will plot just a few of them.

earnings_by_co_lang.head(20).plot(kind='bar', figsize=(20,8))
plt.show()
Exporting the results to Excel
If you're going to be working with colleagues who use Excel, saving Excel files out of pandas is
important. You can export or write a pandas DataFrame to an Excel file using pandas to_excel
method. Pandas uses the xlwt Python module internally for writing to Excel files. The to_excel
method is called on the DataFrame we want to export.We also need to pass a filename to which this
DataFrame will be written.

movies.to_excel('output.xlsx')

By default, the index is also saved to the output file. However, sometimes the index doesn't provide
any useful information. For example, the movies DataFrame has a numeric auto-increment index,
that was not part of the original Excel data.

movies.head()
Content
Facebook
Gross Facebook
LikesFacebook
- LikesFacebook
- Likes
Facenumber
- likes
Reviews
- Reviews
in by by
Title YearGenres
Language
Country
Aspect
Duration
Budget
Ratio
... User Votes
IMDB
Net Earnings
Score
Rating
Earnings
Actor 2Actor 3cast TotalMoviepostersUsers
Crtiics
Intolerance: Love's Struggle
0 1916.0
Drama|History|War
NaN
USA
Not
123.0
1.33
385907.0
Rated
NaN
...22.0 9.0 481 691 1.0 10718
88.0
69.08.0
NaN
Throughout the Ages
Over the
1 Hill to the Poorhouse
1920.0
Crime|Drama
NaN
USA
NaN
110.0
1.33
100000.0
3000000.0
...2.0 0.0 4 0 1.0 51.01.0 4.8
2900000.0
The Big
2 Parade 1925.0
Drama|Romance|War
NaN
USA
Not
151.0
1.33
245000.0
Rated
NaN
...12.0 6.0 108 226 0.0 4849
45.0
48.08.3
NaN
Metropolis
3 1927.0
Drama|Sci-Fi
German
Germany
Not145.0
1.33
6000000.0
Rated
26435.0
...23.0 18.0 203 120001.0 111841
413.0
260.0
8.3
-5973565.0
Pandora's
4 Box 1929.0
Crime|Drama|Romance
German
Germany
Not
110.0
1.33
NaN
Rated
9950.0
...20.0 3.0 455 926 1.0 7431
84.0
71.08.0
NaN

5 rows × 26 columns
You can choose to skip the index by passing along index-False.

movies.to_excel('output.xlsx', index=False)

We need to be able to make our output files look nice before we can send it out to our co-workers.
We can use pandas ExcelWriter class along with the XlsxWriter Python module to apply the
formatting.

We can do use these advanced output options by creating a ExcelWriter object and use this object
to write to the EXcel file.

writer = pd.ExcelWriter('output.xlsx', engine='xlsxwriter')

movies.to_excel(writer, index=False, sheet_name='report')

workbook = writer.book

worksheet = writer.sheets['report']

We can apply customizations by calling add_format on the workbook we are writing to. Here we
are setting header format as bold.

header_fmt = workbook.add_format({'bold': True})


worksheet.set_row(0, None, header_fmt)

Finally, we save the output file by calling the method save on the writer object.

writer.save()

As an example, we saved the data with column headers set as bold. And the saved file looks like the
image below.

Like this, one can use XlsxWriter to apply various formatting to the output Excel file.

Conclusion
Pandas is not a replacement for Excel. Both tools have their place in the data analysis workflow and
can be very great companion tools. As we demonstrated, pandas can do a lot of complex data
analysis and manipulations, which depending on your need and expertise, can go beyond what you
can achieve if you are just using Excel. One of the major benefits of using Python and pandas over
Excel is that it helps you automate Excel file processing by writing scripts and integrating with your
automated data workflow. Pandas also has excellent methods for reading all kinds of data from
Excel files. You can export your results from pandas back to Excel too if that's preferred by your
intended audience.

On the other hand, Excel is a such a widely used data tool, it's not a wise to ignore it. Acquiring
expertise in both pandas and Excel and making them work together gives you skills that can help you
stand out in your organization.

https://www.dataquest.io/blog/excel-and-pandas/ 15-12-2018

You might also like