Pandas - Datastructures
Pandas - Datastructures
import pandas as pd
In [ ]:
why use pandas?
Relevant data is very important in data science.
Pandas can clean messy data sets, and make them readable and relevant.
In [ ]:
Pandas generally provide two data structures for manipulating data, They ar
Series
DataFrame
DataFrame is widely used and one of the most important data structures.
In [ ]:
Pandas Series:
In [2]:
# Create a simple Pandas Series from a list:
# If nothing else is specified, the values are labeled with their index num
# First value has index 0, second value has index 1 etc.
import pandas as pd
sr = pd.Series(a)
print(sr)
0 10
1 20
2 30
dtype: int64
In [2]:
# This label can be used to access a specified value.
# Return the first value of the Series:
print(sr[0])
10
In [3]:
# With the index argument, you can name your own labels.
# Create your own labels:
import pandas as pd
print(sr)
x 10
y 20
z 30
dtype: int64
In [4]:
# we can access an item by referring to the label.
# Return the value of "y":
print(sr["y"])
20
In [5]:
# we can access an item by referring to the index.
# Return the value at index 0:
print(sr[1])
20
In [28]:
sr.index
In [27]:
sr.values
In [16]:
#Assigning Values to the Elements
sr[0] = 200
sr
x 200
Out[16]:
y 20
z 30
dtype: int64
In [17]:
#Assigning Values to the Elements
sr['y'] = 2000
sr
x 200
Out[17]:
y 2000
z 30
dtype: int64
In [18]:
#Operations and Mathematical Functions
sr / 2
x 100.0
Out[18]:
y 1000.0
z 15.0
dtype: float64
In [19]:
sr*2
x 400
Out[19]:
y 4000
z 60
dtype: int64
In [20]:
sr+10
x 210
Out[20]:
y 2010
z 40
dtype: int64
In [21]:
import numpy as np
np.log(sr)
x 5.298317
Out[21]:
y 7.600902
z 3.401197
dtype: float64
In [22]:
sr
x 200
Out[22]:
y 2000
z 30
dtype: int64
In [ ]:
Defining Series from NumPy Arrays
In [23]:
import numpy as np
arr = np.array([1, 2, 3, 4])
sr2 = pd.Series(arr)
sr2
0 1
Out[23]:
1 2
2 3
3 4
dtype: int32
In [ ]:
NaN Values
In [24]:
import pandas as pd
sr3 = pd.Series([5, -3,np.NaN, 14])
sr3
0 5.0
Out[24]:
1 -3.0
2 NaN
3 14.0
dtype: float64
In [25]:
sr3.isnull()
0 False
Out[25]:
1 False
2 True
3 False
dtype: bool
In [26]:
sr3.notnull()
0 True
Out[26]:
1 True
2 False
3 True
dtype: bool
In [27]:
# Create a simple Pandas Series from a dictionary:
# The keys of the dictionary become the labels.
import pandas as pd
sr4 = pd.Series(calories)
print(sr4)
day1 420
day2 380
day3 390
dtype: int64
In [30]:
#To select only some of the items in the dictionary, use the index argumen
import pandas as pd
print(sr5)
day1 420
day2 380
dtype: int64
In [32]:
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print (s)
#Observe − Index order is persisted and the missing element is filled with
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
In [29]:
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print (s)
0 5
1 5
2 5
3 5
dtype: int64
In [ ]:
Methods on Index
In [33]:
import pandas as pd
ser = pd.Series([5, 0, 3, 8, 4], index=['red', 'blue', 'yellow', 'white',
ser.idxmin()
'blue'
Out[33]:
In [34]:
import pandas as pd
ser = pd.Series([5, 0, 3, 8, 4], index=['red', 'blue', 'yellow', 'white',
ser.idxmax()
'white'
Out[34]:
In [35]:
serd = pd.Series(range(6), index=['white', 'white', 'blue', 'green', 'green
serd
white 0
Out[35]:
white 1
blue 2
green 3
green 4
yellow 5
dtype: int64
In [36]:
serd.index.is_unique
False
Out[36]:
In [ ]: Other Functionalities on Indexes
In [37]:
ser = pd.Series([2, 5, 7, 4], index=['one', 'two', 'three', 'four'])
ser
one 2
Out[37]:
two 5
three 7
four 4
dtype: int64
In [38]:
ser.reindex(['three', 'four', 'five', 'one'])
three 7.0
Out[38]:
four 4.0
five NaN
one 2.0
dtype: float64
In [39]:
ser3 = pd.Series([1, 5, 6, 3], index=[0, 3, 5, 6])
ser3
0 1
Out[39]:
3 5
5 6
6 3
dtype: int64
In [40]:
ser3.reindex(range(6), method='ffill')
#‘ffill’ stands for ‘forward fill’
#and will propagate last valid observation forward.
0 1
Out[40]:
1 1
2 1
3 5
4 5
5 6
dtype: int64
In [41]:
ser3.reindex(range(6), method='bfill')
#It will backward fill the NaN values that are present in the pandas datafr
0 1
Out[41]:
1 5
2 5
3 5
4 6
5 6
dtype: int64
In [ ]:
DataFrame:
In [7]:
#Create a simple Pandas DataFrame:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
print(df)
calories duration
0 420 50
1 380 40
2 390 45
In [10]:
#Pandas use the loc attribute to return one or more specified row(s)
#refer to the row index:
print(df.loc[0])
calories 420
duration 50
Name: 0, dtype: int64
In [11]:
#use a list of indexes:
print(df.loc[[0, 1]])
calories duration
0 420 50
1 380 40
In [12]:
#Named Indexes
#With the index argument, you can name your own indexes.
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
print(df)
calories duration
day1 420 50
day2 380 40
day3 390 45
In [13]:
##Locate Named Indexes
#Use the named index in the loc attribute to return the specified row(s).
#Return "day2":
#refer to the named index:
print(df.loc["day2"])
calories 380
duration 40
Name: day2, dtype: int64
In [51]:
df1 = pd.DataFrame(data, columns= ["calories"])
print(df1)
calories
0 420
1 380
2 390
In [12]:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
print(df)
df.columns
calories duration
day1 420 50
day2 380 40
day3 390 45
Index(['calories', 'duration'], dtype='object')
Out[12]:
In [10]:
df.index
In [9]:
df.values
array([[420, 50],
Out[9]:
[380, 40],
[390, 45]], dtype=int64)
In [14]:
df['duration']=100
df.values
array([[420, 100],
Out[14]:
[380, 100],
[390, 100]], dtype=int64)
In [15]:
df['duration']=[40,50,45]
df.values
array([[420, 40],
Out[15]:
[380, 50],
[390, 45]], dtype=int64)
In [20]:
#Deleting a Column
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df)
del df['duration']
print(df)
calories duration
0 420 50
1 380 40
2 390 45
calories
0 420
1 380
2 390
In [ ]:
#Transposition of a DataFrame
In [22]:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data)
print(df)
df_tran=df.T
print(df_tran)
calories duration
0 420 50
1 380 40
2 390 45
0 1 2
calories 420 380 390
duration 50 40 45
In [ ]:
Dropping
In [42]:
ser = pd.Series(np.arange(4.), index=['red', 'blue', 'yellow', 'white'])
ser
red 0.0
Out[42]:
blue 1.0
yellow 2.0
white 3.0
dtype: float64
In [40]:
ser.drop('yellow')
red 0.0
Out[40]:
blue 1.0
white 3.0
dtype: float64
In [6]:
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red', 'blue', 'yellow', 'white'],
columns=['ball', 'pen', 'pencil', 'paper'])
frame
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
In [6]:
frame.drop(['blue','yellow'])
red 0 1 2 3
white 12 13 14 15
In [9]:
frame.drop(['pen','pencil'],axis=1)
red 0 3
blue 4 7
yellow 8 11
white 12 15
In [ ]:
Arithmetic and Data Alignment
In [10]:
s1 = pd.Series([3,2,5,1],['white','yellow','green','blue'])
print(s1)
s2 = pd.Series([1,4,7,2,1],['white','yellow','black','blue','brown'])
print(s2)
s1 + s2
white 3
yellow 2
green 5
blue 1
dtype: int64
white 1
yellow 4
black 7
blue 2
brown 1
dtype: int64
black NaN
Out[10]:
blue 3.0
brown NaN
green NaN
white 4.0
yellow 6.0
dtype: float64
In [43]:
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red', 'blue', 'yellow', 'white'],
columns=['ball','pen','pencil','paper'])
frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
index=['blue', 'green', 'white', 'yellow'],
columns=['mug','pen','ball'])
In [44]:
print(frame1)
In [45]:
print(frame2)
In [46]:
print(frame1+frame2)
In [48]:
#Flexible Arithmetic Methods
frame3=frame1.add(frame2)
print(frame3)
In [ ]:
Operations between Data Structures
Operations between DataFrame and Series
In [50]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red', 'blue', 'yellow', 'white'],
columns=['ball','pen','pencil','paper'])
frame
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
In [51]:
ser = pd.Series(np.arange(4), index=['ball','pen','pencil','paper'])
ser
ball 0
Out[51]:
pen 1
pencil 2
paper 3
dtype: int32
In [52]:
print(frame - ser)
In [10]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red', 'blue', 'yellow', 'white'],
columns=['ball','pen','pencil','paper'])
print(frame)
In [11]:
print(frame.sum())
ball 24
pen 28
pencil 32
paper 36
dtype: int64
In [61]:
print(frame.mean())
ball 6.0
pen 7.0
pencil 8.0
paper 9.0
dtype: float64
In [62]:
print(frame.describe())
In [ ]:
Sorting and Ranking
In [46]:
ser = pd.Series([5, 0, 3, 8, 4], index=['red','blue','yellow','white','gree
ser
red 5
Out[46]:
blue 0
yellow 3
white 8
green 4
dtype: int64
In [47]:
ser.sort_index()
blue 0
Out[47]:
green 4
red 5
white 8
yellow 3
dtype: int64
In [48]:
ser.sort_index(ascending=False)
yellow 3
Out[48]:
white 8
red 5
green 4
blue 0
dtype: int64
In [52]:
ser.sort_values()
blue 0
Out[52]:
yellow 3
green 4
red 5
white 8
dtype: int64
In [55]:
ser.rank()
red 4.0
Out[55]:
blue 1.0
yellow 2.0
white 5.0
green 3.0
dtype: float64
In [57]:
ser.rank(ascending=False)
red 2.0
Out[57]:
blue 5.0
yellow 4.0
white 1.0
green 3.0
dtype: float64
In [49]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red', 'blue', 'yellow', 'white'],
columns=['ball','pen','pencil','paper'])
frame
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
In [50]:
frame.sort_index()
blue 4 5 6 7
red 0 1 2 3
white 12 13 14 15
yellow 8 9 10 11
In [51]:
frame.sort_index(axis=1)
red 0 3 1 2
blue 4 7 5 6
yellow 8 11 9 10
white 12 15 13 14
In [53]:
frame.sort_values(by='pen')
red 0 1 2 3
ball pen pencil paper
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
In [ ]:
FIltering Out NaN Values
In [72]:
ser = pd.Series([0, 1, np.NaN, 3, 9], index=['red','blue','yellow','white'
print(ser)
ser.dropna()
red 0.0
blue 1.0
yellow NaN
white 3.0
green 9.0
dtype: float64
red 0.0
Out[72]:
blue 1.0
white 3.0
green 9.0
dtype: float64
In [67]:
frame3 = pd.DataFrame([[6,np.nan,6],[np.nan,np.nan,np.nan],[2,np.nan,5]],
index=['blue','green','red'],
columns=['ball','mug','pen'])
frame3
In [68]:
frame3.dropna()
In [69]:
frame3.dropna(how='all')
In [ ]:
Filliing in NaN Occurrences
In [70]:
frame3.fillna(0)
Out[70]: ball mug pen
In [71]:
frame3.fillna({'ball':1, 'mug':0, 'pen': 99})
In [94]:
frame3.stack()
In [ ]:
Hierarchical Indexing and Levelling:
In [ ]:
import numpy as np
import pandas as pd
In [73]:
data=pd.Series(np.random.randn(8), index=[['a','a','a','b','b','c','c','c'
data
a 1 0.057011
Out[73]:
2 1.585588
3 0.532421
b 1 -1.145096
2 -0.762860
c 1 2.040339
2 0.661820
3 0.562723
dtype: float64
In [ ]:
What is MultiIndex?
MultiIndex allows you to select more than one row and column in your index
To understand MultiIndex, let’s see the indexes of the data, in the above e
In [75]:
data.index
MultiIndex([('a', 1),
Out[75]:
('a', 2),
('a', 3),
('b', 1),
('b', 2),
('c', 1),
('c', 2),
('c', 3)],
)
In [77]:
data['a']
1 0.057011
Out[77]:
2 1.585588
3 0.532421
dtype: float64
In [78]:
data['a':'b']
a 1 0.057011
Out[78]:
2 1.585588
3 0.532421
b 1 -1.145096
2 -0.762860
dtype: float64
In [89]:
print(data[:,2])
a 1.585588
b -0.762860
c 0.661820
dtype: float64
In [87]:
print(data[:,1])
a 0.057011
b -1.145096
c 2.040339
dtype: float64
In [90]:
data.unstack()
Out[90]: 1 2 3
In [93]:
data.unstack().stack()
a 1 0.057011
Out[93]:
2 1.585588
3 0.532421
b 1 -1.145096
2 -0.762860
c 1 2.040339
2 0.661820
3 0.562723
dtype: float64
In [ ]: Data frames can have hierarchical indexes.
In [3]:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.arange(12).reshape(4,3),
index=[['a','a','b','b'],[1,2,1,2]],
columns=[['anl','anl','theo'],['maths','stat','geo']])
df
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [4]:
df.index.names=['class','exam-no']
df.columns.names=['paper','title']
df
class exam-no
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [ ]:
Selecting in Hierarchical Indexing
You can select subgroups of data. For example, let’s select the index named
In [5]:
df['anl']
class exam-no
a 1 0 1
2 3 4
b 1 6 7
2 9 10
In [ ]:
What is Swaplevel?
Sometimes, you may want to swap the level of the indexes. You can use the s
The swaplevel method takes two levels and returns a new object.
In [6]: df.swaplevel('class',"exam-no")
exam-no class
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
In [ ]:
Sorting in Hierarchical Indexing
To sort the indexes by level, you can use the sort_index method.
For example, let’s sort the dataset by level 1.
In [7]:
df.sort_index(level=1)
class exam-no
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
In [ ]:
Summary Statistics in Hierarchical Indexing
Summary statistics in Series or DataFrame are found by one level.
If you have more than one level of data,
you can calculate summary statistics according to the level.
In [8]:
df.sum(level='exam-no')
C:\Users\user\AppData\Local\Temp/ipykernel_7744/3278500703.py:1: FutureWarn
ing: Using the level keyword in DataFrame and Series aggregations is deprec
ated and will be removed in a future version. Use groupby instead. df.sum(l
evel=1) should use df.groupby(level=1).sum().
df.sum(level='exam-no')
Out[8]: paper anl theo
exam-no
1 6 8 10
2 12 14 16