0% found this document useful (0 votes)
19 views

Pandas - Datastructures

Uploaded by

saraga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Pandas - Datastructures

Uploaded by

saraga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

In [ ]:

Pandas is a Python library.

Pandas is a Python library used for working with data sets.

It has functions for analyzing, cleaning, exploring, and manipulating data

The name "Pandas" has a reference to both "Panel Data",


and "Python Data Analysis" and was created by Wes McKinney in 2008.

Mainly, Pandas is used to analyze data. This module is generally imported a

import pandas as pd

In [ ]:
why use pandas?
Relevant data is very important in data science.

Pandas allows us to analyze big data and make conclusions based on


statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

In [ ]:
Pandas generally provide two data structures for manipulating data, They ar

Series
DataFrame

These data structures are built on top of Numpy array,


which means they are fast.

Series - 1 D - 1D labeled homogeneous array, value mutable, sizeimmutable.

Data Frames - 2 D -General 2D labeled, value mutable, size-mutable


tabular structure with potentially heterogeneously typed columns.

DataFrame is widely used and one of the most important data structures.

In [ ]:
Pandas Series:

A Pandas Series is like a column in a table.


It is a one-dimensional array holding data of any type.

In [2]:
# Create a simple Pandas Series from a list:
# If nothing else is specified, the values are labeled with their index num
# First value has index 0, second value has index 1 etc.

import pandas as pd

a = [10, 20, 30]

sr = pd.Series(a)

print(sr)

0 10
1 20
2 30
dtype: int64

In [2]:
# This label can be used to access a specified value.
# Return the first value of the Series:
print(sr[0])

10

In [3]:
# With the index argument, you can name your own labels.
# Create your own labels:
import pandas as pd

sr = pd.Series([10,20,30],index = ["x", "y", "z"])

print(sr)

x 10
y 20
z 30
dtype: int64

In [4]:
# we can access an item by referring to the label.
# Return the value of "y":

print(sr["y"])

20

In [5]:
# we can access an item by referring to the index.
# Return the value at index 0:

print(sr[1])

20

In [28]:
sr.index

Index(['x', 'y', 'z'], dtype='object')


Out[28]:

In [27]:
sr.values

array([10, 20, 30], dtype=int64)


Out[27]:

In [16]:
#Assigning Values to the Elements
sr[0] = 200
sr

x 200
Out[16]:
y 20
z 30
dtype: int64

In [17]:
#Assigning Values to the Elements
sr['y'] = 2000
sr
x 200
Out[17]:
y 2000
z 30
dtype: int64

In [18]:
#Operations and Mathematical Functions

sr / 2

x 100.0
Out[18]:
y 1000.0
z 15.0
dtype: float64

In [19]:
sr*2

x 400
Out[19]:
y 4000
z 60
dtype: int64

In [20]:
sr+10

x 210
Out[20]:
y 2010
z 40
dtype: int64

In [21]:
import numpy as np
np.log(sr)

x 5.298317
Out[21]:
y 7.600902
z 3.401197
dtype: float64

In [22]:
sr

x 200
Out[22]:
y 2000
z 30
dtype: int64

In [ ]:
Defining Series from NumPy Arrays

In [23]:
import numpy as np
arr = np.array([1, 2, 3, 4])
sr2 = pd.Series(arr)
sr2

0 1
Out[23]:
1 2
2 3
3 4
dtype: int32

In [ ]:
NaN Values

In [24]:
import pandas as pd
sr3 = pd.Series([5, -3,np.NaN, 14])
sr3

0 5.0
Out[24]:
1 -3.0
2 NaN
3 14.0
dtype: float64

In [25]:
sr3.isnull()

0 False
Out[25]:
1 False
2 True
3 False
dtype: bool

In [26]:
sr3.notnull()

0 True
Out[26]:
1 True
2 False
3 True
dtype: bool

In [27]:
# Create a simple Pandas Series from a dictionary:
# The keys of the dictionary become the labels.

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

sr4 = pd.Series(calories)

print(sr4)

day1 420
day2 380
day3 390
dtype: int64

In [30]:
#To select only some of the items in the dictionary, use the index argumen

#Create a Series using only data from "day1" and "day2":

import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

sr5 = pd.Series(calories, index = ["day1", "day2"])

print(sr5)

day1 420
day2 380
dtype: int64
In [32]:
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data,index=['b','c','d','a'])
print (s)

#Observe − Index order is persisted and the missing element is filled with

b 1.0
c 2.0
d NaN
a 0.0
dtype: float64

In [29]:
import pandas as pd
import numpy as np
s = pd.Series(5, index=[0, 1, 2, 3])
print (s)

0 5
1 5
2 5
3 5
dtype: int64

In [ ]:
Methods on Index

In [33]:
import pandas as pd
ser = pd.Series([5, 0, 3, 8, 4], index=['red', 'blue', 'yellow', 'white',
ser.idxmin()

'blue'
Out[33]:

In [34]:
import pandas as pd
ser = pd.Series([5, 0, 3, 8, 4], index=['red', 'blue', 'yellow', 'white',
ser.idxmax()

'white'
Out[34]:

In [35]:
serd = pd.Series(range(6), index=['white', 'white', 'blue', 'green', 'green
serd

white 0
Out[35]:
white 1
blue 2
green 3
green 4
yellow 5
dtype: int64

In [36]:
serd.index.is_unique

False
Out[36]:
In [ ]: Other Functionalities on Indexes

In [37]:
ser = pd.Series([2, 5, 7, 4], index=['one', 'two', 'three', 'four'])
ser

one 2
Out[37]:
two 5
three 7
four 4
dtype: int64

In [38]:
ser.reindex(['three', 'four', 'five', 'one'])

three 7.0
Out[38]:
four 4.0
five NaN
one 2.0
dtype: float64

In [39]:
ser3 = pd.Series([1, 5, 6, 3], index=[0, 3, 5, 6])
ser3

0 1
Out[39]:
3 5
5 6
6 3
dtype: int64

In [40]:
ser3.reindex(range(6), method='ffill')
#‘ffill’ stands for ‘forward fill’
#and will propagate last valid observation forward.

0 1
Out[40]:
1 1
2 1
3 5
4 5
5 6
dtype: int64

In [41]:
ser3.reindex(range(6), method='bfill')
#It will backward fill the NaN values that are present in the pandas datafr

0 1
Out[41]:
1 5
2 5
3 5
4 6
5 6
dtype: int64

In [ ]:
DataFrame:

A Pandas DataFrame is a 2 dimensional data structure,


like a 2 dimensional array, or a table with rows and columns.

In [7]:
#Create a simple Pandas DataFrame:
import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

#load data into a DataFrame object:


df = pd.DataFrame(data)

print(df)

calories duration
0 420 50
1 380 40
2 390 45

In [10]:
#Pandas use the loc attribute to return one or more specified row(s)
#refer to the row index:

print(df.loc[0])

calories 420
duration 50
Name: 0, dtype: int64

In [11]:
#use a list of indexes:
print(df.loc[[0, 1]])

calories duration
0 420 50
1 380 40

In [12]:
#Named Indexes
#With the index argument, you can name your own indexes.

#Add a list of names to give each row a name:

import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)

calories duration
day1 420 50
day2 380 40
day3 390 45

In [13]:
##Locate Named Indexes
#Use the named index in the loc attribute to return the specified row(s).

#Return "day2":
#refer to the named index:
print(df.loc["day2"])

calories 380
duration 40
Name: day2, dtype: int64

In [51]:
df1 = pd.DataFrame(data, columns= ["calories"])

print(df1)

calories
0 420
1 380
2 390

In [12]:
import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df)
df.columns

calories duration
day1 420 50
day2 380 40
day3 390 45
Index(['calories', 'duration'], dtype='object')
Out[12]:

In [10]:
df.index

Index(['day1', 'day2', 'day3'], dtype='object')


Out[10]:

In [9]:
df.values

array([[420, 50],
Out[9]:
[380, 40],
[390, 45]], dtype=int64)

In [14]:
df['duration']=100
df.values

array([[420, 100],
Out[14]:
[380, 100],
[390, 100]], dtype=int64)

In [15]:
df['duration']=[40,50,45]
df.values

array([[420, 40],
Out[15]:
[380, 50],
[390, 45]], dtype=int64)
In [20]:
#Deleting a Column
import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

df = pd.DataFrame(data)

print(df)

del df['duration']
print(df)

calories duration
0 420 50
1 380 40
2 390 45
calories
0 420
1 380
2 390

In [ ]:
#Transposition of a DataFrame

In [22]:
import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

df = pd.DataFrame(data)

print(df)

df_tran=df.T
print(df_tran)

calories duration
0 420 50
1 380 40
2 390 45
0 1 2
calories 420 380 390
duration 50 40 45

In [ ]:
Dropping

In [42]:
ser = pd.Series(np.arange(4.), index=['red', 'blue', 'yellow', 'white'])
ser

red 0.0
Out[42]:
blue 1.0
yellow 2.0
white 3.0
dtype: float64
In [40]:
ser.drop('yellow')

red 0.0
Out[40]:
blue 1.0
white 3.0
dtype: float64

In [6]:
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red', 'blue', 'yellow', 'white'],
columns=['ball', 'pen', 'pencil', 'paper'])
frame

Out[6]: ball pen pencil paper

red 0 1 2 3

blue 4 5 6 7

yellow 8 9 10 11

white 12 13 14 15

In [6]:
frame.drop(['blue','yellow'])

Out[6]: ball pen pencil paper

red 0 1 2 3

white 12 13 14 15

In [9]:
frame.drop(['pen','pencil'],axis=1)

Out[9]: ball paper

red 0 3

blue 4 7

yellow 8 11

white 12 15

In [ ]:
Arithmetic and Data Alignment

In [10]:
s1 = pd.Series([3,2,5,1],['white','yellow','green','blue'])
print(s1)
s2 = pd.Series([1,4,7,2,1],['white','yellow','black','blue','brown'])
print(s2)
s1 + s2

white 3
yellow 2
green 5
blue 1
dtype: int64
white 1
yellow 4
black 7
blue 2
brown 1
dtype: int64
black NaN
Out[10]:
blue 3.0
brown NaN
green NaN
white 4.0
yellow 6.0
dtype: float64

In [43]:
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red', 'blue', 'yellow', 'white'],
columns=['ball','pen','pencil','paper'])
frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
index=['blue', 'green', 'white', 'yellow'],
columns=['mug','pen','ball'])

In [44]:
print(frame1)

ball pen pencil paper


red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15

In [45]:
print(frame2)

mug pen ball


blue 0 1 2
green 3 4 5
white 6 7 8
yellow 9 10 11

In [46]:
print(frame1+frame2)

ball mug paper pen pencil


blue 6.0 NaN NaN 6.0 NaN
green NaN NaN NaN NaN NaN
red NaN NaN NaN NaN NaN
white 20.0 NaN NaN 20.0 NaN
yellow 19.0 NaN NaN 19.0 NaN

In [48]:
#Flexible Arithmetic Methods
frame3=frame1.add(frame2)
print(frame3)

ball mug paper pen pencil


blue 6.0 NaN NaN 6.0 NaN
green NaN NaN NaN NaN NaN
red NaN NaN NaN NaN NaN
white 20.0 NaN NaN 20.0 NaN
yellow 19.0 NaN NaN 19.0 NaN

In [ ]:
Operations between Data Structures
Operations between DataFrame and Series

In [50]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red', 'blue', 'yellow', 'white'],
columns=['ball','pen','pencil','paper'])
frame

Out[50]: ball pen pencil paper

red 0 1 2 3

blue 4 5 6 7

yellow 8 9 10 11

white 12 13 14 15

In [51]:
ser = pd.Series(np.arange(4), index=['ball','pen','pencil','paper'])
ser

ball 0
Out[51]:
pen 1
pencil 2
paper 3
dtype: int32

In [52]:
print(frame - ser)

ball pen pencil paper


red 0 0 0 0
blue 4 4 4 4
yellow 8 8 8 8
white 12 12 12 12

In [10]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red', 'blue', 'yellow', 'white'],
columns=['ball','pen','pencil','paper'])
print(frame)

ball pen pencil paper


red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15

In [11]:
print(frame.sum())

ball 24
pen 28
pencil 32
paper 36
dtype: int64

In [61]:
print(frame.mean())

ball 6.0
pen 7.0
pencil 8.0
paper 9.0
dtype: float64

In [62]:
print(frame.describe())

ball pen pencil paper


count 4.000000 4.000000 4.000000 4.000000
mean 6.000000 7.000000 8.000000 9.000000
std 5.163978 5.163978 5.163978 5.163978
min 0.000000 1.000000 2.000000 3.000000
25% 3.000000 4.000000 5.000000 6.000000
50% 6.000000 7.000000 8.000000 9.000000
75% 9.000000 10.000000 11.000000 12.000000
max 12.000000 13.000000 14.000000 15.000000

In [ ]:
Sorting and Ranking

In [46]:
ser = pd.Series([5, 0, 3, 8, 4], index=['red','blue','yellow','white','gree
ser

red 5
Out[46]:
blue 0
yellow 3
white 8
green 4
dtype: int64

In [47]:
ser.sort_index()

blue 0
Out[47]:
green 4
red 5
white 8
yellow 3
dtype: int64

In [48]:
ser.sort_index(ascending=False)

yellow 3
Out[48]:
white 8
red 5
green 4
blue 0
dtype: int64

In [52]:
ser.sort_values()

blue 0
Out[52]:
yellow 3
green 4
red 5
white 8
dtype: int64

In [55]:
ser.rank()

red 4.0
Out[55]:
blue 1.0
yellow 2.0
white 5.0
green 3.0
dtype: float64

In [57]:
ser.rank(ascending=False)

red 2.0
Out[57]:
blue 5.0
yellow 4.0
white 1.0
green 3.0
dtype: float64

In [49]:
frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red', 'blue', 'yellow', 'white'],
columns=['ball','pen','pencil','paper'])
frame

Out[49]: ball pen pencil paper

red 0 1 2 3

blue 4 5 6 7

yellow 8 9 10 11

white 12 13 14 15

In [50]:
frame.sort_index()

Out[50]: ball pen pencil paper

blue 4 5 6 7

red 0 1 2 3

white 12 13 14 15

yellow 8 9 10 11

In [51]:
frame.sort_index(axis=1)

Out[51]: ball paper pen pencil

red 0 3 1 2

blue 4 7 5 6

yellow 8 11 9 10

white 12 15 13 14

In [53]:
frame.sort_values(by='pen')

Out[53]: ball pen pencil paper

red 0 1 2 3
ball pen pencil paper

blue 4 5 6 7

yellow 8 9 10 11

white 12 13 14 15

In [ ]:
FIltering Out NaN Values

In [72]:
ser = pd.Series([0, 1, np.NaN, 3, 9], index=['red','blue','yellow','white'
print(ser)
ser.dropna()

red 0.0
blue 1.0
yellow NaN
white 3.0
green 9.0
dtype: float64
red 0.0
Out[72]:
blue 1.0
white 3.0
green 9.0
dtype: float64

In [67]:
frame3 = pd.DataFrame([[6,np.nan,6],[np.nan,np.nan,np.nan],[2,np.nan,5]],
index=['blue','green','red'],
columns=['ball','mug','pen'])
frame3

Out[67]: ball mug pen

blue 6.0 NaN 6.0

green NaN NaN NaN

red 2.0 NaN 5.0

In [68]:
frame3.dropna()

Out[68]: ball mug pen

In [69]:
frame3.dropna(how='all')

Out[69]: ball mug pen

blue 6.0 NaN 6.0

red 2.0 NaN 5.0

In [ ]:
Filliing in NaN Occurrences

In [70]:
frame3.fillna(0)
Out[70]: ball mug pen

blue 6.0 0.0 6.0

green 0.0 0.0 0.0

red 2.0 0.0 5.0

In [71]:
frame3.fillna({'ball':1, 'mug':0, 'pen': 99})

Out[71]: ball mug pen

blue 6.0 0.0 6.0

green 1.0 0.0 99.0

red 2.0 0.0 5.0

In [94]:
frame3.stack()

blue ball 6.0


Out[94]:
pen 6.0
red ball 2.0
pen 5.0
dtype: float64

In [ ]:
Hierarchical Indexing and Levelling:

Hierarchical indexing allow us to use multiple index levels on an axis.


Hierarchical indexing is also known as multiple indexing

In [ ]:
import numpy as np
import pandas as pd

In [73]:
data=pd.Series(np.random.randn(8), index=[['a','a','a','b','b','c','c','c'
data

a 1 0.057011
Out[73]:
2 1.585588
3 0.532421
b 1 -1.145096
2 -0.762860
c 1 2.040339
2 0.661820
3 0.562723
dtype: float64

In [ ]:
What is MultiIndex?
MultiIndex allows you to select more than one row and column in your index
To understand MultiIndex, let’s see the indexes of the data, in the above e

In [75]:
data.index

MultiIndex([('a', 1),
Out[75]:
('a', 2),
('a', 3),
('b', 1),
('b', 2),
('c', 1),
('c', 2),
('c', 3)],
)

In [77]:
data['a']

1 0.057011
Out[77]:
2 1.585588
3 0.532421
dtype: float64

In [78]:
data['a':'b']

a 1 0.057011
Out[78]:
2 1.585588
3 0.532421
b 1 -1.145096
2 -0.762860
dtype: float64

In [89]:
print(data[:,2])

a 1.585588
b -0.762860
c 0.661820
dtype: float64

In [87]:
print(data[:,1])

a 0.057011
b -1.145096
c 2.040339
dtype: float64

In [90]:
data.unstack()

Out[90]: 1 2 3

a 0.057011 1.585588 0.532421

b -1.145096 -0.762860 NaN

c 2.040339 0.661820 0.562723

In [93]:
data.unstack().stack()

a 1 0.057011
Out[93]:
2 1.585588
3 0.532421
b 1 -1.145096
2 -0.762860
c 1 2.040339
2 0.661820
3 0.562723
dtype: float64
In [ ]: Data frames can have hierarchical indexes.

In [3]:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.arange(12).reshape(4,3),
index=[['a','a','b','b'],[1,2,1,2]],
columns=[['anl','anl','theo'],['maths','stat','geo']])
df

Out[3]: anl theo

maths stat geo

a 1 0 1 2

2 3 4 5

b 1 6 7 8

2 9 10 11

In [4]:
df.index.names=['class','exam-no']
df.columns.names=['paper','title']
df

Out[4]: paper anl theo

title maths stat geo

class exam-no

a 1 0 1 2

2 3 4 5

b 1 6 7 8

2 9 10 11

In [ ]:
Selecting in Hierarchical Indexing
You can select subgroups of data. For example, let’s select the index named

In [5]:
df['anl']

Out[5]: title maths stat

class exam-no

a 1 0 1

2 3 4

b 1 6 7

2 9 10

In [ ]:
What is Swaplevel?
Sometimes, you may want to swap the level of the indexes. You can use the s
The swaplevel method takes two levels and returns a new object.
In [6]: df.swaplevel('class',"exam-no")

Out[6]: paper anl theo

title maths stat geo

exam-no class

1 a 0 1 2

2 a 3 4 5

1 b 6 7 8

2 b 9 10 11

In [ ]:
Sorting in Hierarchical Indexing
To sort the indexes by level, you can use the sort_index method.
For example, let’s sort the dataset by level 1.

In [7]:
df.sort_index(level=1)

Out[7]: paper anl theo

title maths stat geo

class exam-no

a 1 0 1 2

b 1 6 7 8

a 2 3 4 5

b 2 9 10 11

In [ ]:
Summary Statistics in Hierarchical Indexing
Summary statistics in Series or DataFrame are found by one level.
If you have more than one level of data,
you can calculate summary statistics according to the level.

In [8]:
df.sum(level='exam-no')

C:\Users\user\AppData\Local\Temp/ipykernel_7744/3278500703.py:1: FutureWarn
ing: Using the level keyword in DataFrame and Series aggregations is deprec
ated and will be removed in a future version. Use groupby instead. df.sum(l
evel=1) should use df.groupby(level=1).sum().
df.sum(level='exam-no')
Out[8]: paper anl theo

title maths stat geo

exam-no

1 6 8 10

2 12 14 16

You might also like