04/05/2023, 10:44 pandas - Jupyter Notebook
pandas
pandas stands for panel data and is the core library for data manipulation,data analysis.
it consists of single and multidimentional ds for data manipulation.
pandas is a python library used for working with data sets.
high performence data analysis tool
working with large data set
represents in tabular way
working on missing data
three ds in pands
1. series- one dimensional
2. dataframe-two dimentional
3. panel- multidimentional (data,major axis,minor axis)
create Pandas Series
In [1]:
import pandas as pd
import numpy as np
In [2]:
arr = np.array([1,2,3,4])
print(arr)
[1 2 3 4]
In [3]:
s = pd.Series(arr)
print(s)
print(type(s))
0 1
1 2
2 3
3 4
dtype: int64
<class 'pandas.core.series.Series'>
localhost:8888/notebooks/anaconda3/Python/pandas.ipynb 1/10
04/05/2023, 10:44 pandas - Jupyter Notebook
In [4]:
print(s[0:5])
0 1
1 2
2 3
3 4
dtype: int64
In [5]:
a[2]
--------------------------------------------------------------------
-------
NameError Traceback (most recent cal
l last)
/tmp/ipykernel_8943/4164697690.py in <module>
----> 1 a[2]
NameError: name 'a' is not defined
In [ ]:
a = pd.Series(['a','b','c'])
In [ ]:
In [ ]:
a = pd.date_range(start = '2023-03-01', end = '2023-03-28')
In [ ]:
In [ ]:
type(a)
Pandas dataframe
In [ ]:
arr = np.array([[1,2,3],[4,5,6]])
print(arr)
localhost:8888/notebooks/anaconda3/Python/pandas.ipynb 2/10
04/05/2023, 10:44 pandas - Jupyter Notebook
In [ ]:
df = pd.DataFrame(arr)
print(df)
In [ ]:
temp = np.random.randint(low = 20, high =100, size = [20,])
name = np.random.choice(['Abhay','Teclov','Geekshub','Ankit'],20)
random = np.random.choice([10,11,13,12,14],20)
In [ ]:
df = pd.DataFrame({"Temp":temp,"Name":name,"Random":random})
df
In [ ]:
a = list(zip(temp, name, random))
print(a)
In [ ]:
df = pd.DataFrame(data = a, columns=['Temp','Name','Random'])
In [ ]:
df
In [ ]:
type(df)
In [ ]:
temp = np.random.randint(low = 20, high =100, size = [20,])
name = np.random.choice(['Abhay','Teclov','Geekshub','Ankit'],20)
random = np.random.choice([10,11,13,12,14],20)
In [ ]:
df = pd.DataFrame({'temp':temp, 'name':name, 'random':random})
In [ ]:
type(df)
In [ ]:
df.head()
In [ ]:
df.tail()
localhost:8888/notebooks/anaconda3/Python/pandas.ipynb 3/10
04/05/2023, 10:44 pandas - Jupyter Notebook
In [ ]:
df.shape
In [ ]:
df.columns
In [ ]:
df.name
In [ ]:
df['name']
In [ ]:
df['temp'].describe()
In [ ]:
df.info()
In [ ]:
df.values
In [ ]:
df
In [ ]:
df.set_index('temp', inplace = True)
In [ ]:
df
In [ ]:
df.sort_index(axis =0, ascending=False)
In [ ]:
df.sort_values(by ='random', ascending = False)
In [ ]:
df.drop(['random'], axis =1)
localhost:8888/notebooks/anaconda3/Python/pandas.ipynb 4/10
04/05/2023, 10:44 pandas - Jupyter Notebook
In [ ]:
df.head()
In [ ]:
df.iloc[[0,1]]
In [ ]:
df.iloc[1:3,1]
In [ ]:
df.iloc[[True,True,False]]
In [ ]:
df.head()
In [ ]:
df.loc[:,:]
In [ ]:
df.loc[[39,84,34]]
In [ ]:
df.loc[[39,84],'name':'random']
In [ ]:
df.loc[[True, True, False, True]]
In [ ]:
df.loc[df.random > 13]
In [ ]:
df.loc[(df.random > 13) | (df.random == 10),:]
In [ ]:
# Merging & concat
d1 = pd.DataFrame([['a', 1], ['b', 2]],columns=['col1', 'number'])
d2 = pd.DataFrame([['c', 3, 'lion'], ['d', 4, 'tiger']],columns=['letter', 'numbe
In [ ]:
d1
localhost:8888/notebooks/anaconda3/Python/pandas.ipynb 5/10
04/05/2023, 10:44 pandas - Jupyter Notebook
In [ ]:
d2
In [ ]:
pd.concat([d1,d2],axis =0)
In [ ]:
pd.concat([d1,d2], axis =0, ignore_index=True)
In [ ]:
pd.concat([d1,d2], axis = 1)
In [ ]:
d1 = pd.DataFrame({
"city" : ["lucknow","kanpur","agra","delhi"],
"temperature" : [32,45,30,40]
})
In [ ]:
d1
In [ ]:
d2 = pd.DataFrame({
"city" : ["delhi","lucknow","kanpur"],
"humidity" : [68,65,75]
})
In [ ]:
d2
In [ ]:
df = pd.merge(d1,d2, on='city')
In [ ]:
df
In [ ]:
pd.merge(d1,d2, on=['city'], how ='outer')
In [ ]:
pd.merge(d1, d2, on =['city'], how='left')
localhost:8888/notebooks/anaconda3/Python/pandas.ipynb 6/10
04/05/2023, 10:44 pandas - Jupyter Notebook
In [ ]:
# dataset from https://github.com/codebasics/py/blob/master/pandas/6_handling_miss
In [ ]:
df1 = pd.read_csv("weather_data.csv")
In [ ]:
df1
In [ ]:
# pip3 install openpyxl
df1.to_excel('df_xl.xlsx', sheet_name = 'weather_data')
In [ ]:
# pip3 install xlrd
df2 = pd.read_excel('df_xl.xlsx')
In [ ]:
df2
In [ ]:
df2.to_csv('file.csv')
In [ ]:
df2.to_csv('file_noindex.csv', index = False)
In [ ]:
df_group = df2.groupby("event")
In [ ]:
df_group
In [ ]:
for temperature in df_group:
print(temperature)
In [ ]:
df_group.get_group('Rain')
In [ ]:
df_group.describe()
localhost:8888/notebooks/anaconda3/Python/pandas.ipynb 7/10
04/05/2023, 10:44 pandas - Jupyter Notebook
In [ ]:
def hot_temp(x):
return x > 30
In [ ]:
df2['hot_temp'] = df2['temperature'].apply(hot_temp)
In [ ]:
df2
In [ ]:
df2['hot_temp'] = df2['temperature'].apply(lambda x: x > 30)
In [ ]:
df2
In [ ]:
#pivot table
In [ ]:
df2.pivot_table(values = 'temperature', index = 'event', aggfunc = 'mean')
In [ ]:
df2.pivot_table(columns = 'temperature')
In [ ]:
help(pd.DataFrame.pivot_table)
In [ ]:
df3.to_csv("/home/apiiit-rkv/Desktop/dsp unit-3")
In [ ]:
import pandas as pd
In [ ]:
d=pd.read_excel("//home//apiiit-rkv//Desktop//marks.xlsx")
df=pd.DataFrame(d)
df
localhost:8888/notebooks/anaconda3/Python/pandas.ipynb 8/10
04/05/2023, 10:44 pandas - Jupyter Notebook
In [ ]:
#correlation
Correlation coefficients quantify the association between variables or features o
These statistics are of high importance for science and technology, and Python ha
tools that you can use to calculate them. SciPy, NumPy, and pandas correlation met
fast, comprehensive, and well-documented.
What Pearson, Spearman, and Kendall correlation coefficients are
How to use SciPy, NumPy, and pandas correlation functions
How to visualize data, regression lines, and correlation matrices with Matplot
1. Negative correlation (red dots): In the plot on the left, the y values tend
as the x values increase. This shows strong negative correlation, which o
large values of one feature correspond to small values of the other, and v
2.Weak or no correlation (green dots): The plot in the middle shows no obv
trend. This is a form of weak correlation, which occurs when an assoc
between two features is not obvious or is hardly observable.
Positive correlation (blue dots): In the plot on the right, the y val
to increase as the x values increase. This illustrates strong pos
correlation, which occurs when large values of one feature corres
large values of the other, and vice versa.
In [ ]:
import pandas as pd
x = pd.Series(range(10, 20))
x
In [ ]:
y = pd.Series([2, 1, 4, 5, 8, 12, 18, 25, 96, 48])
y
In [ ]:
x.corr(y) # Pearson's r
In [ ]:
y.corr(x)
In [ ]:
x.corr(y, method='spearman') # Spearman's rh
In [ ]:
x.corr(y, method='kendall')
In [ ]:
localhost:8888/notebooks/anaconda3/Python/pandas.ipynb 9/10
04/05/2023, 10:44 pandas - Jupyter Notebook
localhost:8888/notebooks/anaconda3/Python/pandas.ipynb 10/10