Python Pandas New Sylabus
Python Pandas New Sylabus
Its output is as follows −
a b
First 1 2
Second 5 7
print(df2)
a b1
First 1 NaN
Second 5 NaN
Data Series
• Series is a one-dimensional labeled array capable of
holding data of any type (integer, string, float, python
objects, etc.). The axis labels are collectively called index.
• pandas.Series
• A pandas Series can be created using the following
constructor −
pandas.Series( data, index, dtype, copy)
A series can be created using various inputs like −
• Array
• Dict
• Scalar value or constant
The parameters of the constructor are as follows −
Sr.No Parameter & Description
1 data
data takes various forms like ndarray, list, constants
2 index
Index values must be unique and hashable, same length
as data. Default np.arange(n) if no index is passed.
3 dtype
dtype is for data type. If None, data type will be inferred
4 copy
Copy data. Default False
If data is an ndarray, then index passed must be of the same length. If no index is passed, then
by default index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1].
Example 1
We did not pass any index, so by default, it assigned the indexes ranging
from 0 to len(data)-1, i.e., 0 to 3.
#import
Example 2the pandas library and aliasing as pd
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print s
Its output is as follows −
100 a
101 b
102 c
103 d
dtype: object
We passed the index values here. Now we can see the customized indexed
values in the output.
Create a Series from dict
A dict canthe
#import
be passed
pandasas input and if no index
library and isaliasing
specified, then
as
the dictionary keys are taken in a sorted order to construct index.
pdIf index is passed, the values in data corresponding to the labels
import pandas
in the index as pd
will be pulled out.
import
Example 1numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print s
Its output is as follows −
a 0.0
b 1.0
c 2.0
dtype: float64
Observe − Dictionary keys are used to construct index.
Create a DataFrame from Dict of Series
Dictionary of Series can be passed to form a
DataFrame. The resultant index is the union of
all the series indexes passed.
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df)
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
Selecting/Accessing a Subset from a Dataframe using Row/Column
Names
df = pd.DataFrame(data,index=['Delhi','Mumbai','Kolkata','Chennai'])
print(df)
Population Average_income
Delhi 10927986 72167810876544
Mumbai 12691836 85007812691836
Kolkata 4631392 422678431392
Chennai 4328063 5261782328063
To access multiple rows make sure not to
miss the COLON after COMMA
Continue…
• To access selective columns, use :
<DF object>.loc[ : , <start column> :<end row>,:]
Make sure not to miss the COLON BEFORE
COMMA. Like rows, all columns falling between
start and end columns, will also be listed
• To access range of columns from a range of
rows, use:
<DF object>.loc[<startrow> : <endrow>,
<startcolumnn> : <endcolumn>]
To access selective columns make sure not to
miss the COLON before COMMA
To access range of columns from ranges of rows
Df.loc[<startrows>:<endrow>,<startcolumn>:<endcolumn>]
• import pandas as pd
• d = {'one' : pd.Series([1, 2, 3], index=['a', 'b',
'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b',
'c', 'd'])}
• df = pd.DataFrame(d)
• print df.loc['b']
• Its output is as follows −
• one 2.0
• two 2.0
• Name: b,
• dtype: float64
Obtaining subset from DataFrame using Rows/Columns
Numeric index position
Df.iloc[<startrow index>:<endrow index>, <startcolumn index>:<endcolumn index>]
• Selection by integer location
• Rows can be selected by passing integer location to
an iloc function.
• import pandas as pd
• d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' :
pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
• df = pd.DataFrame(d)
• print df.iloc[2]
• Its output is as follows −
• one 3.0
• two 3.0
• Name: c,
• dtype: float64
Selecting/Accessing individual Values
import pandas as pd
import numpy as np
df =
pd.DataFrame(np.random.randn(4,3),columns=['col1','col2',
'col3'])
for key,value in df.iteritems():
print key,value
iterrows()
iterrows() returns the iterator yielding each index value along with a
series containing the data in each row.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row_index,row in df.iterrows():
print (row_index,row)
Describe() Function
• The describe() function computes a summary of statistics
pertaining to the DataFrame columns.
import pandas as pd
import numpy as np
• #Create a Dictionary of series
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve',
'Smith','Jack', 'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80
,4.10,3.65]) }
• #Create a DataFrame
df = pd.DataFrame(d)
print df.describe()
Aggregation/Descriptive statistics - Dataframe
Data aggregation –
Aggregation is the process of turning the values of a dataset (or a
subset of it) into one single value or data aggregation is a
multivalued function ,which require multiple values and return a
single value as a result.There are number of aggregations possible
like count,sum,min,max,median,quartile etc. These(count,sum etc.)
are descriptive statistics and other related operations on
DataFrame Let us make this clear! If we have a DataFrame like…