UNIT II
EDA USING PYTHON
UNIT II EDA USING PYTHON
Data Manipulation using Pandas – Pandas Objects
– Data Indexing and Selection – Operating on
Data – Handling Missing Data – Hierarchical
Indexing – Combining datasets – Concat, Append,
Merge and Join – Aggregation and grouping –
Pivot Tables – Vectorized String Operations
Installing and Using Pandas
Once Pandas is installed, you can import it and check the
version:
In[1]: import pandas
pandas.__version__
Out[1]: '0.18.1'
Just as we generally import NumPy under the alias np, we will
import Pandas under the alias pd:
In[2]: import pandas as p
For example, to display all the contents of the pandas
namespace, you can type this:
In [3]: pd.<TAB>
And to display the built-in Pandas documentation, you can use
this:
In [4]: pd?
Introducing Pandas Objects
Pandas objects can be thought of as enhanced versions of NumPy
structured arrays in which the rows and columns are identified
with labels rather than simple integer indices.
Pandas provides a host of useful tools, methods, and functionality
on top of the basic data structures, but nearly everything that
follows will require an understanding of what these structures
are.
Thus, before we go any further, let’s introduce these three
fundamental Pandas data structures: the Series, DataFrame,
and Index.
We will start our code sessions with the standard NumPy and
Pandas imports:
In[1]: import numpy as np
import pandas as pd
Introducing Pandas Objects
Series as generalized NumPy array
The essential difference is the presence of the index: while the NumPy array has
an implicitly defined integer index used to access the values, the Pandas Series
has an explicitly defined index associated with the values.
Series as specialized dictionary
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values,
and a Series is a structure that maps typed keys to a set of typed values.
Constructing Series objects
The Pandas DataFrame Object
The Pandas DataFrame Object
DataFrame as specialized dictionary
Indexing and Selecting Data with Pandas
Indexing in Pandas :
Indexing in pandas means simply selecting particular
rows and columns of data from a DataFrame. Indexing
could mean selecting all the rows and some of the
columns, some of the rows and all of the columns, or
some of each of the rows and columns. Indexing can
also be known as Subset Selection.
Indexing and Selecting Data with Pandas
Indexing and Selecting Data with Pandas
Indexing and Selecting Data with Pandas
Indexing and Selecting Data with Pandas
Indexing and Selecting Data with Pandas
Pandas Indexing using [ ], .loc[], .iloc[ ], .ix[ ]
There are a lot of ways to pull the elements, rows, and columns
from a DataFrame. There are some indexing method in Pandas
which help in getting an element from a DataFrame. These
indexing methods appear very similar but behave very differently.
Pandas support four types of Multi-axes indexing they are:
Dataframe.[ ] ; This function also known as indexing operator
Dataframe.loc[ ] : This function is used for labels.
Dataframe.iloc[ ] : This function is used for positions or integer
based
Dataframe.ix[] : This function is used for both label and integer
based
Collectively, they are called the indexers. These are by far the most
common ways to index data. These are four function which help in
getting the elements, rows, and columns from a DataFrame.
Indexing and Selecting Data with Pandas
Selecting a single columns
In order to select a single column, we
simply put the name of the column in-
between the brackets
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving columns by indexing operator
first = data["Age"]
print(first)
Indexing and Selecting Data with Pandas
Selecting multiple columns
In order to select multiple columns, we
have to pass a list of columns in an
indexing operator.
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col="Name")
# retrieving multiple columns by indexing
operator
first = data[["Age", "College", "Salary"]]
first
Indexing and Selecting Data with Pandas
Indexing a DataFrame using .loc[ ] :
This function selects data by the label of the rows and
columns. The df.loc indexer selects data in a different way
than just the indexing operator. It can select subsets of
rows or columns. It can also simultaneously select
subsets of rows and columns.
Selecting a single row
In order to select a single row using .loc[], we put a single
row label in a .loc function.
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving row by loc method
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
print(first, "\n\n\n", second)
Indexing and Selecting Data with Pandas
Selecting multiple rows
In order to select multiple rows, we put
all the row labels in a list and pass
that to .loc function.
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col
="Name")
# retrieving multiple rows by loc method
first = data.loc[["Avery Bradley", "R.J. Hunter"]]
print( first)
Indexing and Selecting Data with Pandas
Selecting two rows and three columns
In order to select two rows and three columns, we select a two
rows which we want to select and three columns and put it in
a separate list like this:
Dataframe.loc[["row1", "row2"], ["column1", "column2", "column3"]]
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving two rows and three columns by loc method
first = data.loc[["Avery Bradley", "R.J. Hunter"],
["Team", "Number", "Position"]]
print(first)
Indexing and Selecting Data with Pandas
Selecting all of the rows and some columns
In order to select all of the rows and some
columns, we use single colon [:] to select all of
rows and list of some columns which we want
to select like this:
Dataframe.loc[:, ["column1", "column2", "column3"]]
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving all rows and some columns by loc
method
first = data.loc[:, ["Team", "Number", "Position"]]
print( first)
Indexing and Selecting Data with Pandas
Indexing a DataFrame using .iloc[ ] :
This function allows us to retrieve rows and columns by
position. In order to do that, we’ll need to specify the
positions of the rows that we want, and the positions of
the columns that we want as well. The df.iloc indexer is
very similar to df.loc but only uses integer locations to
make its selections.
Selecting a single row
In order to select a single row using .iloc[], we can pass a
single integer to .iloc[] function.
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv", index_col ="Name")
# retrieving rows by iloc method
row2 = data.iloc[3]
print(row2)
Indexing and Selecting Data with Pandas
Indexing a using Dataframe.ix[ ] :
Early in the development of pandas, there existed another indexer, ix. This
indexer was capable of selecting both by label and by integer location. While it
was versatile, it caused lots of confusion because it’s not explicit. Sometimes
integers can also be labels for rows or columns. Thus there were instances
where it was ambiguous. Generally, ix is label based and acts just as
the .loc indexer. However, .ix also supports integer type selections (as in .iloc)
where passed an integer. This only works where the index of the DataFrame is
not integer based .ix will accept any of the inputs of .loc and .iloc.
Hierarchical Indexing
The index is like an address, that’s how any data point across the data
frame or series can be accessed. Rows and columns both have indexes,
rows indices are called index and for columns, it’s general column
names.
Hierarchical Indexes
Hierarchical Indexes are also known as multi-indexing is setting more
than one column name as the index. In this article, we are going to use
homelessness.csv file.
Hierarchical Indexing
# importing pandas library as alias pd
import pandas as pd
# calling the pandas read_csv() function.
# and storing the result in DataFrame df
df = pd.read_csv('homelessness.csv')
print(df.head())
Hierarchical Indexing
Columns in the Dataframe:
# using the pandas columns attribute.
col = df.columns
print(col)
Output:
Index([‘Unnamed: 0’, ‘region’, ‘state’, ‘individuals’,
‘family_members’,
‘state_pop’],
dtype=’object’)
Hierarchical Indexing
To make the column an index, we use the Set_index() function of pandas. If
we want to make one column an index, we can simply pass the name of the
column as a string in set_index(). If we want to do multi-indexing or
Hierarchical Indexing, we pass the list of column names in the set_index().
Below Code demonstrates Hierarchical Indexing in pandas:
# using the pandas set_index() function.
df_ind3 = df.set_index(['region', 'state', 'individuals'])
# we can sort the data by using sort_index()
df_ind3.sort_index()
print(df_ind3.head(10))
Hierarchical Indexing
Now the dataframe is using Hierarchical Indexing or multi-indexing.
Note that here we have made 3 columns as an index (‘region’, ‘state’,
‘individuals’ ). The first index ‘region’ is called level(0) index, which is on
top of the Hierarchy of indexes, next index ‘state’ is level(1) index which
is below the main or level(0) index, and so on. So, the Hierarchy of
indexes is formed that’s why this is called Hierarchical indexing.
We may sometimes need to make a column as an index, or we want to
convert an index column into the normal column, so there is a pandas
reset_index(inplace = True) function, which makes the index column the
normal column.
Hierarchical Indexing
Selecting Data in a Hierarchical Index or using the Hierarchical
Indexing:For selecting the data from the dataframe using the .loc()
method we have to pass the name of the indexes in a list.
# selecting the 'Pacific' and 'Mountain'
# region from the dataframe.
# selecting data using level(0) index or main index.
df_ind3_region = df_ind3.loc[['Pacific', 'Mountain']]
print(df_ind3_region.head(10))
Hierarchical Indexing
We cannot use only level(1) index for getting data from the dataframe,
if we do so it will give an error. We can only use level (1) index or the
inner indexes with the level(0) or main index with the help list of
tuples.
# using the inner index 'state' for getting data.
df_ind3_state = df_ind3.loc[['Alaska', 'California', 'Idaho']]
print(df_ind3_state.head(10))
Hierarchical Indexing
Using inner levels indexes with the help of a list of tuples:
Syntax:
df.loc[[ ( level( 0 ) , level( 1 ) , level( 2 ) ) ]]Python3
# selecting data by passing all levels index.
df_ind3_region_state = df_ind3.loc[[("Pacific", "Alaska", 1434),
("Pacific", "Hawaii", 4131),
("Mountain", "Arizona", 7259),
("Mountain", "Idaho", 1297)]]
df_ind3_region_state
Combine datasets
In Pandas forusing Pandas merge(),
a horizontal join(), concat()
combination we haveand append()and join(), whereas for
merge()
vertical combination we can use concat() and append(). Merge and join perform
similar tasks but internally they have some differences, similar to concat and
append.
1.merge() is used for combining data on common columns
or indices.
import pandas as pd
d1 = {‘Id’: [‘A1’, ‘A2’, ‘A3’, ‘A4’,’A5'], ‘Name’:[‘Vivek’, ‘Rahul’,
‘Gaurav’, ‘Ankit’,’Vishakha’], ‘Age’:[27, 24, 22, 32, 28],}
d2 = {‘Id’: [‘A1’, ‘A2’, ‘A3’, ‘A4’], ‘Address’:[‘Delhi’, ‘Gurgaon’,
‘Noida’, ‘Pune’], ‘Qualification’:[‘Btech’, ‘B.A’, ‘Bcom’, ‘B.hons’]}
df1=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
Case 1. merging data on common columns ‘Id’
#Inner Join
pd.merge(df1,df2)
pd.merge(df1,df2, how='inner)
Left Join pd.merge(df1,df2,how=’left’)
#matching and non matching records from left DF which is df1 is present in
result data frame
Right Join pd.merge(df1,df2,how=’right’)
#matching and non matching records from right DF, df2 will come in result df
#outer join pd.merge(df1,df2,how=’outer’)
#all the matching and non matching records are
available in resultant dataset from both data frames
2. join() is used for combining data on a key column
or an index.
import pandas as pd
df1 = pd.DataFrame({‘key’: [‘K0’, ‘K1’, ‘K5’, ‘K3’, ‘K4’,
‘K2’], ‘A’: [‘A0’, ‘A1’, ‘A5’, ‘A3’, ‘A4’, ‘A2’]})
df2 = pd.DataFrame({‘key’: [‘K0’, ‘K1’, ‘K2’], ‘B’: [‘B0’, ‘B1’,
‘B2’]})
Case 1. join on indexes
By default, pandas join operation is performed on
indexes both data frames have default indexes values,
so no need to specify any join key, join will implicitly
be performed on indexes.
Case 1.nature
#default joinofon indexes
pandas join is left outer join
df1.join(df2, lsuffix=’_l’, rsuffix=’_r’)
Index values in both data frames are different, in the case
of inner/equi join resultant data set will be empty but data
is present from left DF (df1).
Create two data frames with different index values
df1 = pd.DataFrame({‘key’: [‘K0’, ‘K1’, ‘K5’, ‘K3’, ‘K4’, ‘K2’], ‘A’:
[‘A0’, ‘A1’, ‘A5’, ‘A3’, ‘A4’, ‘A2’]}, index=[0,1,2,3,4,5])
df2 = pd.DataFrame({‘key’: [‘K0’, ‘K1’, ‘K2’], ‘B’: [‘B0’, ‘B1’,
‘B2’]},index=[6,7,8])
df1.join(df2,lsuffix=’_l’,rsuffix=’_r’)
#df1 is left DF and df2 is right DF
#inner join
df1.join(df2,lsuffix=’_l’,rsuffix=’_r’,
how=’inner’)
#outer join
df1.join(df2,lsuffix=’_l’,rsuffix=’_r’,
how=’outer’)
Case 2. join on columns
Data frames can be joined on columns as well, but as joins work on
indexes, we need to convert the join key into the index and then
perform join, rest every thin is similar.
df1.set_index(‘key1’).join(df2.set_index(‘key2’))
3. concat() is used for combining Data Frames across
rows or columns.
Case 1. concat data frames on axis=0, default
operation
import pandas as pd
m1 = pd.DataFrame({ ‘Name’: [‘Alex’, ‘Amy’, ‘Allen’, ‘Alice’,
‘Ayoung’], ‘subject_id’ : [ ‘ sub1 ’,’ sub2 ',’ sub4 ',’ sub6',’sub5'],
‘Marks_scored’:[98,90,87,69,78]}, index=[1,2,3,4,5])
m2 = pd.DataFrame({ ‘Name’: [‘Billy’, ‘Brian’, ‘Bran’, ‘Bryce’,
‘Betty’], ‘subject_id’:[‘sub2’,’sub4',’sub3',’sub6',’sub5'],
‘Marks_scored’:[89,80,79,97,88]}, index=[4,5,6,7,8])
pd.concat([m1,m2])
Case 1. concat data frames on axis=0, default operation
pd.concat([m1,m2],ignore_index=True)
Case 2. concat operation on axis=1, horizontal
operation
pd.concat([m1,m2],axis=1)
4. append() combine data frames vertically
fashion
Case 1. appending data frames, duplicate
index issue
m1 = pd.DataFrame({ ‘Name’: [‘Vivek’, ‘Vishakha’, ‘Ash’,
‘Natalie’, ‘Ayoung’], ‘subject_id’ : [ ‘sub1’ ,’ sub2 ',’ sub4 ',’ sub6
',’sub5'], ‘Marks_scored’:[98,90,87,69,78], ‘ Rank ’ :
[1,3,6,20,13]}, index=[1,2,3,4,5])
m2 = pd.DataFrame({ ‘Name’: [‘Barak’, ‘Wayne’, ‘ Saurav ’ ,
‘Yuvraj’, ‘Suresh’], ‘ subject_id ’ : [ ‘ sub2 ’,’ sub4 ',’
sub3',’sub6',’sub5'], ‘Marks_scored’:[89,80,79,97,88],},
index=[1,2,3,4,5])
m1.append(m2)
Case 1. appending data frames, duplicate index issue
m1.append(m2)
Aggregation and grouping
Grouping and aggregating will help to achieve data analysis easily using
various functions. These methods will help us to the group and
summarize our data and make complex analysis comparatively easy.
Aggregation and grouping
Aggregation and grouping
Aggregation in Pandas
Aggregation in pandas provides various functions that perform a mathematical or logical
operation on our dataset and returns a summary of that function. Aggregation can be used to get
a summary of columns in our dataset like getting sum, minimum, maximum, etc. from a
particular column of our dataset. The function used for aggregation is agg(), the parameter is the
function we want to perform.
Some functions used in the aggregation are:
Function Description:
sum() :Compute sum of column values
min() :Compute min of column values
max() :Compute max of column values
mean() :Compute mean of column
size() :Compute column sizes
describe() :Generates descriptive statistics
first() :Compute first of group values
last() :Compute last of group values
count() :Compute count of column values
std() :Standard deviation of column
var() :Compute variance of column
sem() :Standard error of the mean of column
df.sum()
df.agg(['sum', 'min', 'max'])
Grouping in Pandas
Grouping is used to group data using some criteria from our
dataset. It is used as split-apply-combine strategy.
Splitting the data into groups based on some criteria.
Applying a function to each group independently.
Combining the results into a data structure.
Applying groupby() function to group the data on
“Maths” value. To view result of formed groups use
first() function.
a = df.groupby('Maths')
a.first()
b = df.groupby(['Maths', 'Science'])
b.first()
Vectorized String Operations
Introducing Pandas String Operations
We saw in previous sections how tools like NumPy and Pandas
generalize arithmetic operations so that we can easily and quickly
perform the same operation on many array elements. For
example:
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2
Output:
array([ 4, 6, 10, 14, 22, 26])
This vectorization of operations simplifies the syntax of operating
on arrays of data: we no longer have to worry about the size or
shape of the array, but just about what operation we want done.
Eg1:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]
Output:
['Peter', 'Paul', 'Mary', 'Guido']
Eg2:
import pandas as pd
names = pd.Series(data) names
Output:
0 peter
1 Paul
2 None
3 MARY
4 gUIDO
dtype: object
Tables of Pandas String Methods
If you have a good understanding of string manipulation in
Python, most of Pandas string syntax is intuitive enough
that it's probably sufficient to just list a table of available
methods; we will start with that here, before diving deeper
into a few of the subtleties. The examples in this section
use the following series of names:
monte = pd.Series(['Graham Chapman', 'John Cleese',
'Terry Gilliam', 'Eric Idle', 'Terry Jones', 'Michael Palin'])