PANDAS
What is Pandas?
->It is an open source Python library that is build on top of numpy library.
->It is designed for Data Manipulation, Data Analysis, Data Cleaning
It can handle missing as well.
->It provides Flexible & Powerful Data Structures such as Series, DataFrame .
->It is fast and has high Performance & Productivity.
Features of Pandas
->Fast and Efficient data manipulation and analysis.
->Provides Time-series functionality
->Easily we can handle missing data
->Faster data merging and joining
->Flexible reshaping and pivoting of data
->Data from different file objects can be loaded
->Integrates with numpy
Data Structures in Pandas
-> Data Structures are used to Organize & Retrieve & Manipulate the Data
-> In pandas we have D.S are Series and Data Frame
What is Series
->Series is the one dimensional Labeled array
->It can hold any Data Type(int,string or python objects)
->It axis labels are also known as Index
->Series Contains homogeneous data
->Series are mutable means we can modify the elements And Size is Inmutable means
we can not change once its Declared
->Syntax:[Link]( data, index, dtype, copy)
->Parameters : Data(required) = it can be a list and dictionary
Index(optional)
Dtype(optional)
Copy(optional)= This makes a copy of the input data
Different ways to create a series in pandas
[Link] a empty series
import pandas as pd
print([Link]()) o/p Series([], dtype: object)
2. Creating a series from an Array
series_array=[Link](['m','Mukesh','bf','gf'])
[Link](series_array)
o/p 0 m
1 Mukesh
2 bf
3 gf
dtype: object
3. Create a series from an array with custom index
series_array=[Link](['m','Mukesh','bf','gf'])
[Link](series_array,index=[100,'Love',103,'No'])
O/p 100 m
Love Mukesh
103 bf
No gf
dtype: object
4. Creating a Series from List
list =['hi', 100,'Mukesh', 1000]
[Link](list)
O/p 0 hi
1 100
2 Mukesh
3 1000
dtype: object
5. Creating a series from dictionary
dict ={ 'k1': 1000,
'k2' : 2000,
'k3' : 3000,
'k4' : 4000 }
[Link](dict)
O/p k1 1000
k2 2000
k3 3000
k4 4000
dtype: int64
6. Creating a series using numpy functions
->[Link](start,stop,)
nu_fn=[Link]([Link](3,33,3))
nu_fn
O/p 0 3.0
1 18.0
2 33.0
dtype: float64
-> [Link](x)
nu_fn=[Link]([Link](3))
nu_fn
O/p 0 0.487446
1 0.375540
2 0.011341
dtype: float64
7. Creating a series using range function
range=[Link](range(5))
range
O/p 0 0
1 1
2 2 3 3 4 4 dtype : int 6
Accessing Data using Series Position (iloc)
->To access position of Series we use iloc(Integer based indexing)
->iLoc is allow you to access/select rows by there integer/index positions
->Ex: data=[10,20,30,40,50]
pos=[Link](data,index=['A','B','C','D','E'])
[Link][4] o/p 50
[Link][-1] o/p 50
[Link][:] o/p A 10
B 20
C 30
D 40
E 50
dtype: int64
[Link][Link] o/p B 20
D 40
Retrieve the data using Label(index) name (loc)
->Here we use loc(Label based indexing)
-> Ex: [Link][‘A’] o/p 10
[Link][‘A’ : ‘E’] (Slicing) o/p all the elements
Changing the type of data
data=[1,2,3,4,5,0]
s=[Link](data,dtype=object)
O/p 0 1
1 2
2 3
3 4
4 5
5 0
dtype: object
data=[1,2,3,4,5,0]
s=[Link](data,dtype=bool)
O/p 0 True
1 True
2 True
3 True
4 True
5 False
dtype: bool
What is DataFrame ?
->It is Data Structure in pandas library in python.
->It is a Two Dimensional labeled Data
->it has a labeled axis which means Both rows and columns have labels
Which makes easier to access or manipulate the specific data
->It is a heterogeneous type of data. A Dataframe can contains different
datatypes(int,float,string,object)
->Here size is mutable we can add or remove the rows and columns in DF
Different ways to access a Dataframe
[Link] a empty dataframe:
print([Link]())
O/p : Empty DataFrame
Columns: []
Index: []
[Link] a Dataframe using List:
list=['hii',1,2,3,'hwllo']
[Link](list)
(or)
Print([Link](list))
[Link] a dataframe using list of lists:
list_list=[[1,'Mukesh'],[2,'data_Science'],[3,'job']]
[Link](list_list,columns=['hii','Bye'])
[Link] a DataFrame using Dictionary:
dic={'team':['India','SouthAfrica','Austrilla','England','Newsland'],
'Ranking':[1,2,4,3,5]}
[Link](dic)
[Link] a Dataframe using list of Dictionaries:
list_dic=[{1:'Mukesh',2:'Bleson',3:'Srinivasan'},
{1:'Safa',2:'Sreya',3:'Fareedha'}]
[Link](list_dic)
[Link] DataFrame from Pandas Series:
sd=[Link](['hhi',1,3,4])
[Link](sd)
[Link] Dataframe using Dict of ndarrays:
se={1:[Link]([1,2,3]),
'hi':[Link](['ji','ki','li']),
3:[Link]([4,5,6])}
[Link](se)
[Link] Datframe using Dict of lists:
data = { 'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago'] }
[Link](data)
Column Selection
->It is a fundamental operation in data manipulation and analysis
-> Methods to select the column
[Link] a single column: (using Brackets)
dict={'programming':['SQL','Python','Java','Html'],
'level oo proficiency':[4,3,2,1],
'Trainers':['self-learn','Madha_kiran','Akila','self_learn']}
df=[Link](dict)
df
df['programming']
df[['programming']]
1. (Using Dot notation)
[Link]
2. Selecting Multipule columns ( using list of column names)
df[['programming','Trainers']]
3. Selecting the column by Label(loc) or by conduction
[Link][ :, 'programming' : 'Trainers'] =[ rows : columns]
4. Selecting column by index (iloc)
[Link][ :,0:3]
[Link] column by datatype
df.select_dtypes(include=['int'])
Column Addition
1. Addingg the new column by scaler value:
data={'A':[1,2,3],'B':[4,5,6]}
df=[Link](data)
df['C']=10
df
2. Adding a new column using list
df['D']=[9,8,7]
df
3. Addition with the help of ndarray
df['E']=[Link](['kii','kalkii','Prabhs'])
4. addition using arithmetic operations
df['F']=df['A']+df['B']
5. Joining the dataframs
dl=[[10,20,30],[40,50,60],[70,80,90]]
ds=[Link](dl)
ds
df=[Link](ds)
Column deletion:
[Link] drop function
1.1 droping single column:
data={'A':[1,2,3,4],
'B':[5,6,7,8],
'C':[9,10,11,12],
'D':[10,20,30,40],
'E':[40,50,60,70]}
df=[Link](data)
df=[Link](columns=['E'])
Df
1.2 Droping multiple columns:
[Link](columns=['D','C'],inplace=True)
Df
Inplace =It modifies original Data Framewithout creating a copy.
2. Using del keyword
del df['E']
Df
3. Using pop Keyword :
Pop method removes column and return it as series.
a=[Link]('B')
a
Descriptive Statistics
data = { 'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'B': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
'C': [9, 10, 11, 12, 13, 14, 15, 16, 17, 18],
'D': [13, 14, 15, 16, 17, 18, 19, 20, 21, 22] }
df=[Link](data)
Df
1. Describe()
[Link]()
2. Mean()
mean_values=[Link]()
mean_values
3. Medium()
median_values=[Link]()
median_values
4. Standard deviation()
std_=[Link]()
std_
5. Variance()
var_=[Link]()
var_
6. Skewness()
skew_=[Link]()
skew_
7. Kurtosis()
kurt_=[Link]()
kurt_
8. Min ()
min_=[Link]()
min_
9. Max()
max_=[Link]()
max_
10. Quantile ()
quantile_=[Link]([0.25,0.5,0.75])
quantile_
q1_A=df['A'].quantile(0.25)
q1_A
q3_D=df['D'].quantile(0.75)
q3_D
11. Co-Varience()
cov_=[Link]()
cov_
12. Co-Relation()
corr_=[Link]()
corr_
13. sum()
sum_=[Link]()
sum_
14. count()
count_=[Link]()
count_
15. cumsum()
-> it is used to calculate the cumulative sum of the elements along a given axis
cumsum_=[Link]()
cumsum_
0 : 1 , 1 : 1+2=3 , 2 : 3+3 =6 , 3 : 6 +4=10 , 4 : 10 +5=15 , ………
cumsum_=[Link](axis=1)
cumsum_
horizontal
16. cummin()
17. Cummax()
18. Cumprod()
Iteration:
Iteration a DataFrame :
iterrows()
Ex:
dic={'stu_id' : ['C1','C2','C3','C4'],
'Tool_Proficcency' : ['Powr bi','Tableau','Excel','Sql'],
'Ratings' : [4,5,4,3]}
df=[Link](dic)
for index, row in [Link]():
print(f'Index: {index}')
print(f" stu_id :{row['stu_id']} , Tool_proficcency: {row['Tool_Proficcency']}, Ratings{row['Ratings']}")
print(f"Row as Series:\n{row}\n")
-It returns index and series pairs from each row (each row is converted into series object)
-It allows you to access the rows data using column name Ex: {row['stu_id']}
-Index : The index of the row
- Series Pairs : each row of the dataframe returns a series object
Row as Series:
stu_id C1
Tool_Proficcency Powr bi
Ratings 4
Name: 0, dtype: object (This the series object)
-iterrows() is slower compered to itertuples()
- Because iterrows() convert the each row in to series object .
->Itertuples()
for row in [Link]():
print(f" stu_id :{row.stu_id} , Tool_proficcency: {row.Tool_Proficcency}, Ratings{[Link]}")
->It returns an each row as named tuple
-> It excludes the Index , but we can include by passing the parameters (index=True)
->Accessing the row data : using dot notation {row.stu_id}
Items()
->we are iterating over the datadrame column by column
->for each column I get column name and Series (column data)
for col_name,col_data in [Link]():
print(f" Column :{col_name}")
print(col_data)
Sorting
->Sorting is nothing but arranging data in the specific order, Like ascending descending order
-> we can apply sorting for datatypes such as numbers, strings , complex objects
->Sorting algorithums : Bubble sort, Merge sort, Quick sort, Insertion sort
Sorting by Values :
->short the dataframe by one or more columns
df = [Link]({
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
})
sort=df.sort_values(by='Name')
sort = df.sort_values(by='Age',ascending=False) # by default ascending = true
sort=df.sort_values(by=['City','Age'],ascending=False) # sorting multiple columns
Sorting by Indexes :
sort=df.sort_index(ascending=False)
# sorting the index by row
sort =df.sort_index(axis=1)
# sorting the index by column
Sorting vales In place:
->we use Inplace = True to modify the original dataframe
df.sort_values(by='City',inplace=True)
Sorting in the Series :
s = [Link]([3, 1, 4, 2], index=['d', 'b', 'a', 'c'])
sorts=s.sort_values(ascending=False)
# Soring by the values
sorts=s.sort_index(ascending=False)
# Sorting by the indexes
Groupby
->It is used to split the data in to groups based on the some criteria
And apply function to each group independently.
->We use groupby for aggregating data such as (sum, mean, count, max, min)
->Syntax : [Link](‘col_name’).function()
Grouping by the single Column
df = [Link]({
'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
'Region': ['North', 'North', 'South', 'South', 'North', 'South'],
'Sales': [100, 200, 150, 250, 120, 300]
})
group =[Link]('Region').sum()
Grouping the multiple columns:
group =[Link](['Region','Product']).sum()
Applying the multiple aggregations functions
group= [Link]('Region').agg({'Sales':['sum','mean','max','count']})
Resetting the Index :
->After grouping the Labels are converted into Indexes. We can reset the index to get the dataframe
group= [Link]('Region').agg({'Sales':['sum','mean','max','count']}).reset_index()
Merging/Joining the groups
->These operation allows multiple dataframes in to single dataframe based on common Keys or columns
Concatenating the dataframes:
df1 = [Link]({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})
df2 = [Link]({
'ID': [4, 5, 6],
'Name': ['David', 'Edward', 'Frank']
})
concat_df=[Link]([df1,df2])
concat_df
Merge Function :
->combine multiple dataframes based on the one or more keys
->Syntax : [Link](left, right, how='inner', on=None, left_on=None, right_on=None)
df1 = [Link]({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
df2 = [Link]({
'ID': [3, 4, 5, 6],
'Score': [85, 90, 75, 60]
})
merge=[Link](df1,df2, on='ID',how='right')
Merge
Join function:
->Join function is used to join dataframs based on there indexes or a key column
-> syntax : left_df.join(right_df, on=None, how='left', lsuffix='', rsuffix='', sort=False)
-> #Using set_index while creating dataframe
df1 = [Link]({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'ID': [1, 2, 3, 4]
}).set_index('ID')
# DataFrame 2
df2 = [Link]({
'Score': [85, 90, 75, 60],
'ID': [3, 4, 5, 6]
}).set_index('ID')
join=[Link](df2,how='left')
join
->#using the set_index while creating the join.
df1 = [Link]({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'ID': [1, 2, 3, 4]
})
df2 = [Link]({
'Score': [85, 90, 75, 60],
'ID': [3, 4, 5, 6]
})
# Join DataFrames on 'ID' column
join = df1.set_index('ID').join(df2.set_index('ID'),how='outer')
Join
Set_Index() = is the function is used to set one or more columns in datadrame as Indexs (row
lablesSyntax : df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
Setting Single coumn as index :
data = {
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Score': [85, 90, 75, 60]
}
df = [Link](data)
set_1 = df.set_index('ID')
set_1
Setting multiple columnas an index:
set_2 =df.set_index(['ID','Score'])
set_2
Keeping The orginal column
set_keep=df.set_index('ID',drop=False)
set_keep
Resetting the index :
set_reset=df.reset_index()
set_reset
Concatenation
->It used to Combain the multiple Sources into dataframe
->Syntax : result = [Link](objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None,
names=None, verify_integrity=False, sort=False)
Concatenation along the rows(Vertical Concate)
df1 = [Link]({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
})
df2 = [Link]({
'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']
})
concat=[Link]([df1,df2],axis=0,ignore_index=True)
concat
ignore_index : It is used to control the indexs will concatenating
When we concate the dataframes(Vertical concate) it will keep the orginal indexes(default=Flase).
to change the indexes into Sequential order (ignore_index=True)
Concatenation along Columns(Horizontal Concatenation)
concat=[Link]([df1,df2],axis=1,ignore_index=True)
Concat
Concatenate with there indexes
df3= [Link]({
'A': ['A0','A1','A2'],
'B':['B0','B1','B2']}, index=[0,1,2])
df4 = [Link]({
'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']
}, index=[3, 4, 5])
result=[Link]([df3,df4])
Result
Concatening with keys : We an create hierarchical index in the dataframe
result=[Link]([df1,df2],keys=['df1','muk'],axis=1) #axis=0
Result
Concatenate with different columns
df5 = [Link]({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
})
df6 = [Link]({
'A': ['A3', 'A4', 'A5'],
'C': ['C3', 'C4', 'C5']
})
result = [Link]([df5, df6], axis=0)
result