0% found this document useful (0 votes)
4 views

vertopal.com_12_Pandas

Good

Uploaded by

chauhanjeel57
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

vertopal.com_12_Pandas

Good

Uploaded by

chauhanjeel57
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 14

Pandas stands for Panel Data

We can perform filter data, create charts, create pivot table etc as like Microsoft excel.
Pandas is used for data analysis
Numpy Pandas

1D array => vector | Series


2D array => matrix | Data Frame

Create Series in Pandas


in Series by default index start from 0.
import pandas as pd

series1 = pd.Series(data=[10,20,30,40,50])

print(series1)

0 10
1 20
2 30
3 40
4 50
dtype: int64

import pandas as pd

series1 = pd.Series(data=[10,20,30,40,50],index=[1,2,3,4,5])

print(series1)

1 10
2 20
3 30
4 40
5 50
dtype: int64

import pandas as pd

series1 = pd.Series(data=[10,20,30,40,50],index=["a","b","c","d","e"],
dtype=float)

print(series1)
a 10.0
b 20.0
c 30.0
d 40.0
e 50.0
dtype: float64

Create Data Frame in Pandas


in Data Frame data should be written Dictonary Format
in Series by default index start from 0.
import pandas as pd

dataset1 = pd.DataFrame(data={"hindi":[10,20,30,40,50],"english":
[60,70,80,90,100]})

print(dataset1)

hindi english
0 10 60
1 20 70
2 30 80
3 40 90
4 50 100

import pandas as pd

dataset1 = pd.DataFrame(data={"hindi":[10,20,30,40,50],"english":
[60,70,80,90,100]})

print(dataset1)

hindi english
0 10 60
1 20 70
2 30 80
3 40 90
4 50 100

We can Create Data Frame Using Series


from numpy import insert
import pandas as pd

series1 = pd.Series([23,54,12,47,98],
index=["eng","hindi","sci","maths","ss"])
print(series1)

print("\n")
series2 = pd.Series([76,81,33,51,66],
index=["eng","hindi","sci","maths","ss"])
print(series2)

print("\n")

dataset = pd.DataFrame(data={"column1":series1,"column2":series2})
print(dataset)

eng 23
hindi 54
sci 12
maths 47
ss 98
dtype: int64

eng 76
hindi 81
sci 33
maths 51
ss 66
dtype: int64

column1 column2
eng 23 76
hindi 54 81
sci 12 33
maths 47 51
ss 98 66

Import a Data set Into Pandas


Read_csv(path_of_file)
Read_json(path_of_file)
Read_excel(path_of_file)
Read_html(path_of_file)
Read_sql(path_of_file)
Read_table(path_of_file)
... ... ...
import pandas as pd
jeel = pd.read_excel("F:/jeel.xlsx")

print(jeel)

No Name Address
0 101 Jeel Raigadh
1 102 Deep HMT
2 103 Aryan Talod
3 104 Vipul Modasa
4 105 Sachin Gambhoi
5 106 Rohit Vijapur
6 107 Jeel Raigadh
7 108 Deep HMT
8 109 Aryan Talod
9 110 Vipul Modasa
10 111 Sachin Gambhoi
11 112 Rohit Vijapur
12 113 Jeel Raigadh
13 114 Deep HMT
14 115 Aryan Talod
15 116 Vipul Modasa
16 117 Sachin Gambhoi
17 118 Rohit Vijapur
18 119 Vipul Modasa
19 120 Sachin Gambhoi

High level Understanding of Data


=>whenever we write down command on whole Data Frame

Low Level Understanding of Data


=>whenever we wrute down command on combination of different column in Data Frame
ex= we can write command on one column at time
ex = we can write command on two column at time
ex= we can write command on three column at time
1 column = univariate analysis
2 column = bivariate analysis
3 column = multivariate analysis

Pandas Method for High level understanding


head( ) => by default it return top 5 entry from Data
tail() => by default it return botton 5 entry from Data
dtypes => it return data tyoes of every column
info() => it return all information aout Data
describe() => it return statastical information
import pandas as pd

print(jeel.head())

No Name Address
0 101 Jeel Raigadh
1 102 Deep HMT
2 103 Aryan Talod
3 104 Vipul Modasa
4 105 Sachin Gambhoi

import pandas as pd

print(jeel.head(12))

No Name Address
0 101 Jeel Raigadh
1 102 Deep HMT
2 103 Aryan Talod
3 104 Vipul Modasa
4 105 Sachin Gambhoi
5 106 Rohit Vijapur
6 107 Jeel Raigadh
7 108 Deep HMT
8 109 Aryan Talod
9 110 Vipul Modasa
10 111 Sachin Gambhoi
11 112 Rohit Vijapur

import pandas as pd

print(jeel.tail())

No Name Address
15 116 Vipul Modasa
16 117 Sachin Gambhoi
17 118 Rohit Vijapur
18 119 Vipul Modasa
19 120 Sachin Gambhoi

import pandas as pd

print(jeel.tail(7))

No Name Address
13 114 Deep HMT
14 115 Aryan Talod
15 116 Vipul Modasa
16 117 Sachin Gambhoi
17 118 Rohit Vijapur
18 119 Vipul Modasa
19 120 Sachin Gambhoi

import pandas as pd

print(jeel.dtypes)

No int64
Name object
Address object
dtype: object

import pandas as pd

print(jeel.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 No 20 non-null int64
1 Name 20 non-null object
2 Address 20 non-null object
dtypes: int64(1), object(2)
memory usage: 612.0+ bytes
None

import pandas as pd

print(jeel.describe())

No
count 20.00000
mean 110.50000
std 5.91608
min 101.00000
25% 105.75000
50% 110.50000
75% 115.25000
max 120.00000
Pandas Method for Low Level Understanding
Fetch data from Column
import pandas as pd

jeel["Address"]

0 Raigadh
1 HMT
2 Talod
3 Modasa
4 Gambhoi
5 Vijapur
6 Raigadh
7 HMT
8 Talod
9 Modasa
10 Gambhoi
11 Vijapur
12 Raigadh
13 HMT
14 Talod
15 Modasa
16 Gambhoi
17 Vijapur
18 Modasa
19 Gambhoi
Name: Address, dtype: object

import pandas as pd

jeel["Address"].describe()

count 20
unique 6
top Gambhoi
freq 4
Name: Address, dtype: object

import pandas as pd

jeel.Address

0 Raigadh
1 HMT
2 Talod
3 Modasa
4 Gambhoi
5 Vijapur
6 Raigadh
7 HMT
8 Talod
9 Modasa
10 Gambhoi
11 Vijapur
12 Raigadh
13 HMT
14 Talod
15 Modasa
16 Gambhoi
17 Vijapur
18 Modasa
19 Gambhoi
Name: Address, dtype: object

idxmin( ) and idxmax() => find Maximum value over Index


import pandas as pd

arr = pd.DataFrame({"A":[4,5,2,6], "B":[11,2,5,8], "C":[1,8,66,4]})

print(arr)

print(arr.idxmin(axis=0))

print(arr.idxmax(axis=0))

A B C
0 4 11 1
1 5 2 8
2 2 5 66
3 6 8 4
A 2
B 1
C 0
dtype: int64
A 3
B 0
C 2
dtype: int64

Shorting Asending And Desceding Order Values in Pandas


import pandas as pd

name = pd.Series(["Sanjeev","Keshav","Rahul"])
age = pd.Series([37,42,38])
designation = pd.Series(["Manager","Cleark","Accountant"])

d1 =
pd.DataFrame(data={"Name":name,"Age":age,"Designation":designation})
print(d1)
print("\n")

asc1 = d1.sort_values(by='Age')
print(asc1)

print("\n")

desc1 = d1.sort_values(by="Age",ascending=0)
print(desc1)

Name Age Designation


0 Sanjeev 37 Manager
1 Keshav 42 Cleark
2 Rahul 38 Accountant

Name Age Designation


0 Sanjeev 37 Manager
2 Rahul 38 Accountant
1 Keshav 42 Cleark

Name Age Designation


1 Keshav 42 Cleark
2 Rahul 38 Accountant
0 Sanjeev 37 Manager

Example - 1
import pandas as pd

player = pd.Series(data=["hardik Pandya","KL Rahul","Andre


Russel","Jasprit Bumrah","Virat Kohli","Rohit Sharma"])
team = pd.Series(data=["Mumbai Indian","Kings Eleven","Kolkatta
Night Rider","Mumbai Indian","RCB","Mumbai Indian"])
category =
pd.Series(data=["Batsman","Batsman","Batsman","Bowler","Batsman","Bats
man",])
bidprice = pd.Series(data=[13,12,7,10,17,15])
runs = pd.Series(data=[1000,2400,900,200,3600,3700])

dataframe =
pd.DataFrame({"Player":player,"Team":team,"Category":category,"BidPric
e":bidprice,"Runs":runs})

dataframe
Player Team Category BidPrice Runs
0 hardik Pandya Mumbai Indian Batsman 13 1000
1 KL Rahul Kings Eleven Batsman 12 2400
2 Andre Russel Kolkatta Night Rider Batsman 7 900
3 Jasprit Bumrah Mumbai Indian Bowler 10 200
4 Virat Kohli RCB Batsman 17 3600
5 Rohit Sharma Mumbai Indian Batsman 15 3700

que => retrieve first two row


dataframe.head(2)

Player Team Category BidPrice Runs


0 hardik Pandya Mumbai Indian Batsman 13 1000
1 KL Rahul Kings Eleven Batsman 12 2400

dataframe.iloc[0:2,:]

Player Team
0 hardik Pandya Mumbai Indian
1 KL Rahul Kings Eleven

que => retrieve last 3 row


dataframe.tail(3)

Player Team Category BidPrice Runs


3 Jasprit Bumrah Mumbai Indian Bowler 10 200
4 Virat Kohli RCB Batsman 17 3600
5 Rohit Sharma Mumbai Indian Batsman 15 3700

dataframe.iloc[-3:,:]

Player Team Category BidPrice Runs


3 Jasprit Bumrah Mumbai Indian Bowler 10 200
4 Virat Kohli RCB Batsman 17 3600
5 Rohit Sharma Mumbai Indian Batsman 15 3700

que => find most expensive player


most_expensive_player = dataframe.loc[dataframe['BidPrice'].idxmax()]
print(most_expensive_player)

Player Virat Kohli


Team RCB
Category Batsman
BidPrice 17
Runs 3600
Name: 4, dtype: object

que => print total player per team


dataframe.groupby('Team').Player.count()

Team
Kings Eleven 1
Kolkatta Night Rider 1
Mumbai Indian 3
RCB 1
Name: Player, dtype: int64

que => find player who had highest BidPrice from each Team
val = dataframe.groupby('Team')
print(val['Player','BidPrice'].max())

----------------------------------------------------------------------
-----
ValueError Traceback (most recent call
last)
Cell In[63], line 2
1 val = dataframe.groupby('Team')
----> 2 print(val['Player','BidPrice'].max())

File c:\Users\chauh\AppData\Local\Programs\Python\Python313\Lib\site-
packages\pandas\core\groupby\generic.py:1947, in
DataFrameGroupBy.__getitem__(self, key)
1943 # per GH 23566
1944 if isinstance(key, tuple) and len(key) > 1:
1945 # if len == 1, then it becomes a SeriesGroupBy and this is
actually
1946 # valid syntax, so don't raise
-> 1947 raise ValueError(
1948 "Cannot subset columns with a tuple with more than one
element. "
1949 "Use a list instead."
1950 )
1951 return super().__getitem__(key)

ValueError: Cannot subset columns with a tuple with more than one
element. Use a list instead.

que => find average runs of each Team


dataframe.groupby('Team').Runs.mean()

Team
Kings Eleven 2400.000000
Kolkatta Night Rider 900.000000
Mumbai Indian 1633.333333
RCB 3600.000000
Name: Runs, dtype: float64

que => sort all player accourding to the BidPrice


dataframe.sort_values(by='BidPrice')

Player Team Category BidPrice Runs


2 Andre Russel Kolkatta Night Rider Batsman 7 900
3 Jasprit Bumrah Mumbai Indian Bowler 10 200
1 KL Rahul Kings Eleven Batsman 12 2400
0 hardik Pandya Mumbai Indian Batsman 13 1000
5 Rohit Sharma Mumbai Indian Batsman 15 3700
4 Virat Kohli RCB Batsman 17 3600

first_row = dataframe.iloc[0][1]
first_row

C:\Users\chauh\AppData\Local\Temp\ipykernel_2440\774073253.py:1:
FutureWarning: Series.__getitem__ treating keys as positions is
deprecated. In a future version, integer keys will always be treated
as labels (consistent with DataFrame behavior). To access a value by
position, use `ser.iloc[pos]`
first_row = dataframe.iloc[0][1]

'Mumbai Indian'

import pandas as pd

df1 = pd.DataFrame({'key':['a','b','c','d'],'value':[1,2,3,4]})
print(df1)
df2 = pd.DataFrame({'key':['a','b','e','b'],'value':[5,6,7,8]})
print(df2)
df3 = df1.merge(df2,on='key',how='inner')
print(df3)

key value
0 a 1
1 b 2
2 c 3
3 d 4
key value
0 a 5
1 b 6
2 e 7
3 b 8
key value_x value_y
0 a 1 5
1 b 2 6
2 b 2 8

Group By in Pandas
import pandas as pd

# Sample data
data = {
'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles',
'Chicago'],
'Category': ['A', 'A', 'B', 'B', 'A'],
'Sales': [200, 150, 300, 120, 250]
}

df = pd.DataFrame(data)
print(df)

City Category Sales


0 New York A 200
1 Los Angeles A 150
2 New York B 300
3 Los Angeles B 120
4 Chicago A 250

group1 = df.groupby('City').sum()
print(group1)

Category Sales
City
Chicago A 250
Los Angeles AB 270
New York AB 500

group1 = df.groupby('City')["Sales"].sum()
print(group1)

City
Chicago 250
Los Angeles 270
New York 500
Name: Sales, dtype: int64

group2 = df.groupby(['City','Category'])['Sales'].sum()
print(group2)

City Category
Chicago A 250
Los Angeles A 150
B 120
New York A 200
B 300
Name: Sales, dtype: int64

Find Maximum value over Index In Data Frame


import pandas as pd

# Step 1: Create a DataFrame


data = {
'A': [3, 7, 2, 9],
'B': [6, 2, 8, 1],
'C': [5, 3, 7, 4]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Step 2: Find the maximum value over each row (axis=1)


df['Max Value'] = df.max(axis=1)

# Display the updated DataFrame


print("\nDataFrame with Maximum Value for each Row:")
print(df)

Original DataFrame:
A B C
0 3 6 5
1 7 2 3
2 2 8 7
3 9 1 4

DataFrame with Maximum Value for each Row:


0 6
1 7
2 8
3 9
dtype: int64

You might also like