Loading Pandas
Loading Pandas
import pandas as pd
df = pd.read_csv('file_location\filename.txt', delimiter = "\t")
یا
The pandas is the base data structure of pandas. A series is similar to a NumPy
array, but it differs by having an index, which allows for much richer lookup of
items instead of just a zero-based array index value.
import pandas as pd
d=pd.Series([11,12,13,14])
d
Multiple items can be retrieved by specifying their labels in a Python list.
import pandas as pd
d[[1,3]]
Pandas: Series
d=pd.Series([11,12,13,14],index=['a','b','c','d'])
d[['a','b']] or d[[0,1]]
d=d.index
d1=pd.Series([11,12,13,14],index=['a','b','c','d'])
d2=pd.Series([1,2,3,5],index=[‘a',‘b','c','d'])
diff=d1-d2
print(diff)
diff.mean()
diff
Pandas: DataFrame
A pandas series can only have a single value associated with each index label.
To have multiple values per index label we can use a data frame. A data frame
represents one or more objects aligned by index label.
Each series will be a column in the data frame, and each column can have an
associated name
d1=pd.Series([11,12,13,14],index=['a','b','c','d'])
d2=pd.Series([1,2,3,4],index=['a','b','c','d'])
temp_df=pd.DataFrame({'value1':d1,'value2':d2})
temp_df
Columns in a object can be accessed using an array indexer with the name
of the column or a list of column names
temp_df['value1']
temp_df[['value1','value2']]
Pandas: DataFrame
temp_dfs=pd.DataFrame()
g=temp_df['value1']-temp_df['value2']
print(g)
temp_df['diff']=temp_df['value1']-temp_df['value2']
temp_df
The names of the columns in a DataFrame are accessible via the columns
property
temp_df.columns
Pandas: DataFrame
The DataFrame and Series objects can be sliced to retrieve specific rows
temp_df [0:3]
temp_df.value1[0:3]
Entire rows from a data frame can be retrieved using the .loc and .iloc properties.
.loc ensures that the lookup is by index label, where .iloc uses the 0-based position.
temp_df.loc['a']
temp_df.iloc[0]
temp_df.iloc[[1,3,5,7]].column_Name
Pandas: DataFrame
The following code shows values in the IMO column that are greater than 7
df2.IMO>7
import pandas as pd
df2 = pd.read_excel('2010.xlsx')
Df2=pd.read_csv('2010.csv')
type(df2.IMO[0])
Pandas: DataFrame
For traversing DataFrame (transposed), we use T assign
df2=df2.T
df2.loc[['IYR','IMO']]
df2.loc['IYR'][0]
Pandas: DataFrame
Deleting data from DataFrames using drop for rows or del for columns
df2.drop('IYR')
del df2['IMO']
df = df.drop(['IMO''], axis=1) # axis is important
Add column to DataFrames
df2['IMO']=0
df2['IMO']=df2['IMO']+2
Query for DataFrames
If you want accidents in months that is bigger than 6, we should write code below:
df2['IMO']>6
dfbigger=dfbigger.set_index('IYR')
print(dfbigger)
dfbigger=dfbigger.reset_index('IYR')
dfbigger
DataFrames: preProcess
Count non-NA cells for each column or row
df2.count(axis=0, numeric_only=False)
df4=df2.select_dtypes(include=['object'])
df2[~df2.isin(df4)]
another sample
import numpy as np
import matplotlib.pyplot as plt
t = np.arange(0., 5., 0.2)
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.title('some values')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['t','t**2','t**3'])
plt.show()
plot
another sample
import numpy as np
import matplotlib.pyplot as plt
t = np.arange(0., 5., 0.2)
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.title('some values')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['t','t**2','t**3'])
plt.show()
Plot dataframe
another sample
t=np.arange(0,5,0.2)
df=pd.DataFrame({0:t , 1:t**1.5 , 2:t**2 , 3:t**2.5 , 4:t**3})
legend_labels=['Solid' , 'Dashed' , 'Dotted' , 'Dot-dashed' , 'Points']
matplotlib.pyplot.subplots return an instance of Figure and an array of (or a single) Axes (array or
not depends on the number of subplots)
matplotlib.pyplot.subplot(*args, **kwargs)
import matplotlib.pyplot as plt
import numpy as np
plt.close('all')
f, axarr = plt.subplots(2, 2)
axarr[0, 0].plot(x, y)
axarr[0, 0].set_title('Axis [0,0]')
axarr[0, 1].scatter(x, y)
axarr[0, 1].set_title('Axis [0,1]')
axarr[1, 0].plot(x, y ** 2)
axarr[1, 0].set_title('Axis [1,0]')
axarr[1, 1].scatter(x, y ** 2)
axarr[1, 1].set_title('Axis [1,1]')
plt.show()
Calculate correlation by seaborn package
1-
colNames = ["Age", "type_employer", "fnlwgt", "Education", "Education-Num", "Martial","Occupation",
"Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
"H-per-week", "Country", "Label"]
data = pd.read_csv("adult-data.txt", names=colNames,delimiter=',',header=None)
data
3-
import seaborn as sns
%matplotlib inline
sns.heatmap(data.corr())
plt.show()
Read data
from sklearn import preprocessing
import pandas as pd
df2 = pd.read_excel('2010.xlsx')
df2
df2.isnull()
df2.notnull()
df2.isnull()[15:20]
It is possible to drop rows with NanValue:
df2 = df2.dropna()
df2=df2.dropna(axis='columns', how='all') //rows
df2.isnull().sum()
Delete Missing values or replace
df2.fillna(df2.mean(),inplace=True)
if a column like IYR of some accidents are NaN in our dataset. Let's
change NaN to mean value of