ML Lab8
ML Lab8
Artificial Intelligence
Experiment 8
Pandas-I (Series, DataFrame)
Pandas Library
Pandas is an open-source Python Library providing high-performance data manipulation and
analysis tool using its powerful data structures. The name Pandas is derived from the word
Panel Data – an Econometrics from Multidimensional data.
In 2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data.
Prior to Pandas, Python was majorly used for data munging and preparation. It had very little
contribution towards data analysis. Pandas solved this problem. Using Pandas, we can
accomplish five typical steps in the processing and analysis of data, regardless of the origin of
data — load, prepare, manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc.
Key Features of Pandas
Fast and efficient DataFrame object with default and customized indexing.
Tools for loading data into in-memory data objects from different file formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of date sets.
Label-based slicing, indexing and subsetting of large data sets.
Columns from a data structure can be deleted or inserted.
Group by data for aggregation and transformations.
High performance merging and joining of data.
Time Series functionality.
Standard Python distribution doesn't come bundled with Pandas module. A lightweight
alternative is to install pandas using popular Python package installer, pip.
pip install pandas
Pandas deals with the following three data structures −
Series
DataFrame
Panel
These data structures are built on top of Numpy array, which means they are fast.
Dimension & Description
The best way to think of these data structures is that the higher dimensional data structure is a
container of its lower dimensional data structure. For example, DataFrame is a container of
Series, Panel is a container of DataFrame.
Data Dimension Description
Structure s
Series 1 1D labeled homogeneous array, size immutable.
Data 2 General 2D labeled, size-mutable tabular structure with
Frames potentially heterogeneously typed columns.
Panel 3 General 3D labeled, size-mutable array.
Building and handling two or more dimensional arrays is a tedious task, burden is placed on the
user to consider the orientation of the data set when writing functions. But using Pandas data
structures, the mental effort of the user is reduced.
For example, with tabular data (DataFrame) it is more semantically helpful to think of
the index (the rows) and the columns rather than axis 0 and axis 1.
Mutability
All Pandas data structures are value mutable (can be changed) and except Series all are size
mutable. Series is size immutable.
Note − DataFrame is widely used and one of the most important data structures. Panel is used
much less.
Series
Series is a one-dimensional array like structure with homogeneous data. For example, the
following series is a collection of integers 10, 23, 56, …
10 23 56 17 52 61 73 90 26 72
Key Points
Homogeneous data
Size Immutable
Values of Data Mutable
DataFrame
DataFrame is a two-dimensional array with heterogeneous data. For example,
Name Age Gender Rating
Steve 32 Male 3.45
Lia 28 Female 4.6
Vin 45 Male 3.9
Katie 38 Female 2.78
The table represents the data of a sales team of an organization with their overall performance
rating. The data is represented in rows and columns. Each column represents an attribute and
each row represents a person.
Data Type of Columns
The data types of the four columns are as follows −
Column Type
Name String
Age Integer
Gender String
Rating Float
Key Points
Heterogeneous data
Size Mutable
Data Mutable
Panel
Panel is a three-dimensional data structure with heterogeneous data. It is hard to represent the
panel in graphical representation. But a panel can be illustrated as a container of DataFrame.
Key Points
Heterogeneous data
Size Mutable
Data Mutable
Series is a one-dimensional labeled array capable of holding data of any type (integer, string,
float, python objects, etc.). The axis labels are collectively called index.
pandas.Series
A pandas Series can be created using the following constructor −
pandas.Series( data, index, dtype, copy)
The parameters of the constructor are as follows −
Sr.N Parameter & Description
o
1 data
data takes various forms like ndarray, list, constants
2 index
Index values must be unique and hashable, same length as data.
Default np.arrange(n) if no index is passed.
3 dtype
dtype is for data type. If None, data type will be inferred
4 copy
Copy data. Default False
A series can be created using various inputs like −
Array
Dict
Scalar value or constant
Create an Empty Series
A basic series, which can be created is an Empty Series.
Example
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the first three element
print s[:3]
Its output is as follows −
a 1
b 2
c 3
dtype: int64
Example 3
Retrieve the last three elements.
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
import pandas as pd
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in
rows and columns.
Features of DataFrame
Potentially columns are of different types
Size – Mutable
Labeled axes (rows and columns)
Can Perform Arithmetic operations on rows and columns
Structure
Let us assume that we are creating a data frame with student’s data.
You can think of it as an SQL table or a spreadsheet data representation.
pandas.DataFrame
A pandas DataFrame can be created using the following constructor −
pandas.DataFrame( data, index, columns, dtype, copy)
The parameters of the constructor are as follows −
Sr.N Parameter & Description
o
1 data
data takes various forms like ndarray, series, map, lists, dict, constants and also
another DataFrame.
2 index
For the row labels, the Index to be used for the resulting frame is Optional Default
np.arange(n) if no index is passed.
3 columns
For column labels, the optional default syntax is - np.arange(n). This is only true if no
index is passed.
4 dtype
Data type of each column.
5 copy
This command (or whatever it is) is used for copying of data, if the default is False.
Create DataFrame
A pandas DataFrame can be created using various inputs like −
Lists
dict
Series
Numpy ndarrays
Another DataFrame
Create an Empty DataFrame
A basic DataFrame, which can be created is an Empty Dataframe.
Example
import pandas as pd
data = [1,2,3,4,5]
df = pd.DataFrame(data)
print df
Its output is as follows −
0
0 1
1 2
2 3
3 4
4 5
Example 2
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'])
print df
Its output is as follows −
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
Example 3
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print df
Its output is as follows −
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
Note − Observe, the dtype parameter changes the type of Age column to floating point.
Create a DataFrame from Dict of ndarrays / Lists
All the ndarrays must be of same length. If index is passed, then the length of the index should
equal to the length of the arrays.
If no index is passed, then by default, index will be range(n), where n is the array length.
Example 1
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print df
Its output is as follows −
Age Name
0 28 Tom
1 34 Jack
2 29 Steve
3 42 Ricky
Note − Observe the values 0,1,2,3. They are the default index assigned to each using the
function range(n).
Example 2
Let us now create an indexed DataFrame using arrays.
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print df
Its output is as follows −
Age Name
rank1 28 Tom
rank2 34 Jack
rank3 29 Steve
rank4 42 Ricky
Note − Observe, the index parameter assigns an index to each row.
Create a DataFrame from List of Dicts
List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are
by default taken as column names.
Example 1
The following example shows how to create a DataFrame by passing a list of dictionaries.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df
Its output is as follows −
a b c
0 1 2 NaN
1 5 10 20.0
Note − Observe, NaN (Not a Number) is appended in missing areas.
Example 2
The following example shows how to create a DataFrame by passing a list of dictionaries and
the row indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data, index=['first', 'second'])
print df
Its output is as follows −
a b c
first 1 2 NaN
second 5 10 20.0
Example 3
The following example shows how to create a DataFrame with a list of dictionaries, row
indices, and column indices.
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print df1
print df2
Its output is as follows −
#df1 output
a b
first 1 2
second 5 10
#df2 output
a b1
first 1 NaN
second 5 NaN
Note − Observe, df2 DataFrame is created with a column index other than the dictionary key;
thus, appended the NaN’s in place. Whereas, df1 is created with column indices same as
dictionary keys, so NaN’s appended.
Create a DataFrame from Dict of Series
Dictionary of Series can be passed to form a DataFrame. The resultant index is the union of all
the series indexes passed.
Example
import pandas as pd
df = pd.DataFrame(d)
print df
Its output is as follows −
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Note − Observe, for the series one, there is no label ‘d’ passed, but in the result, for the d label,
NaN is appended with NaN.
Let us now understand column selection, addition, and deletion through examples.
Column Selection
We will understand this by selecting a column from the DataFrame.
Example
import pandas as pd
df = pd.DataFrame(d)
print df ['one']
Its output is as follows −
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
Column Addition
We will understand this by adding a new column to an existing data frame.
Example
import pandas as pd
df = pd.DataFrame(d)
# Adding a new column to an existing DataFrame object with column label by passing new
series
print df
Its output is as follows −
Adding a new column by passing as Series:
one two three
a 1.0 1 10.0
b 2.0 2 20.0
c 3.0 3 30.0
d NaN 4 NaN
df = pd.DataFrame(d)
print ("Our dataframe is:")
print df
import pandas as pd
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print df.loc['b']
Its output is as follows −
one 2.0
two 2.0
Name: b, dtype: float64
The result is a series with labels as column names of the DataFrame. And, the Name of the
series is the label with which it is retrieved.
Selection by integer location
Rows can be selected by passing integer location to an iloc function.
import pandas as pd
df = pd.DataFrame(d)
print df.iloc[2]
Its output is as follows −
one 3.0
two 3.0
Name: c, dtype: float64
Slice Rows
Multiple rows can be selected using ‘ : ’ operator.
import pandas as pd
df = pd.DataFrame(d)
print df[2:4]
Its output is as follows −
one two
c 3.0 3
d NaN 4
Addition of Rows
Add new rows to a DataFrame using the append function. This function will append the rows at
the end.
import pandas as pd
df = df.append(df2)
print df
Its output is as follows −
a b
0 1 2
1 3 4
0 5 6
1 7 8
Deletion of Rows
Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple
rows will be dropped.
If you observe, in the above example, the labels are duplicate. Let us drop a label and will see
how many rows will get dropped.
import pandas as pd
df = df.append(df2)
print df
Its output is as follows −
ab
134
178
In the above example, two rows were dropped because those two contain the same label 0.
Series Basic Functionality
Sr.No. Attribute or Method & Description
1 axes
Returns a list of the row axis labels
2 dtype
Returns the dtype of the object.
3 empty
Returns True if series is empty.
4 ndim
Returns the number of dimensions of the underlying data, by definition 1.
5 size
Returns the number of elements in the underlying data.
6 values
Returns the Series as ndarray.
7 head()
Returns the first n rows.
8 tail()
Returns the last n rows.
Let us now create a Series and see all the above tabulated attributes operation.
Example
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data series is:")
print df
Its output is as follows −
Our data series is:
Age Name Rating
0 25 Tom 4.23
1 26 James 3.24
2 25 Ricky 3.98
3 23 Vin 2.56
4 30 Steve 3.20
5 29 Smith 4.60
6 23 Jack 3.80
T (Transpose)
Returns the transpose of the DataFrame. The rows and columns will interchange.
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Row axis labels and column axis labels are:")
print df.axes
Its output is as follows −
Row axis labels and column axis labels are:
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("The data types of each column are:")
print df.dtypes
Its output is as follows −
The data types of each column are:
Age int64
Name object
Rating float64
dtype: object
empty
Returns the Boolean value saying whether the Object is empty or not; True indicates that the
object is empty.
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Is the object empty?")
print df.empty
Its output is as follows −
Is the object empty?
False
ndim
Returns the number of dimensions of the object. By definition, DataFrame is a 2D object.
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The dimension of the object is:")
print df.ndim
Its output is as follows −
Our object is:
Age Name Rating
0 25 Tom 4.23
1 26 James 3.24
2 25 Ricky 3.98
3 23 Vin 2.56
4 30 Steve 3.20
5 29 Smith 4.60
6 23 Jack 3.80
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The s of the object is:")
print df.shape
Its output is as follows −
Our object is:
Age Name Rating
0 25 Tom 4.23
1 26 James 3.24
2 25 Ricky 3.98
3 23 Vin 2.56
4 30 Steve 3.20
5 29 Smith 4.60
6 23 Jack 3.80
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our object is:")
print df
print ("The actual data in our data frame is:")
print df.values
Its output is as follows −
Our object is:
Age Name Rating
0 25 Tom 4.23
1 26 James 3.24
2 25 Ricky 3.98
3 23 Vin 2.56
4 30 Steve 3.20
5 29 Smith 4.60
6 23 Jack 3.80
The actual data in our data frame is:
[[25 'Tom' 4.23]
[26 'James' 3.24]
[25 'Ricky' 3.98]
[23 'Vin' 2.56]
[30 'Steve' 3.2]
[29 'Smith' 4.6]
[23 'Jack' 3.8]]
Head & Tail
To view a small sample of a DataFrame object, use the head() and tail()
methods. head() returns the first n rows (observe the index values). The default number of
elements to display is five, but you may pass a custom number.
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print df
print ("The first two rows of the data frame is:")
print df.head(2)
Its output is as follows −
Our data frame is:
Age Name Rating
0 25 Tom 4.23
1 26 James 3.24
2 25 Ricky 3.98
3 23 Vin 2.56
4 30 Steve 3.20
5 29 Smith 4.60
6 23 Jack 3.80
import pandas as pd
import numpy as np
#Create a DataFrame
df = pd.DataFrame(d)
print ("Our data frame is:")
print df
print ("The last two rows of the data frame is:")
print df.tail(2)
Its output is as follows −
Our data frame is:
Age Name Rating
0 25 Tom 4.23
1 26 James 3.24
2 25 Ricky 3.98
3 23 Vin 2.56
4 30 Steve 3.20
5 29 Smith 4.60
6 23 Jack 3.80