0% found this document useful (0 votes)
4 views

Loading Pandas

The document provides a comprehensive guide on using the Pandas library in Python for data manipulation, including reading files, creating Series and DataFrames, performing arithmetic operations, and querying data. It also covers data preprocessing techniques, handling missing values, and visualizing data using Matplotlib and Seaborn. Key functionalities such as indexing, slicing, and plotting are illustrated with code examples.

Uploaded by

Ali
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Loading Pandas

The document provides a comprehensive guide on using the Pandas library in Python for data manipulation, including reading files, creating Series and DataFrames, performing arithmetic operations, and querying data. It also covers data preprocessing techniques, handling missing values, and visualizing data using Matplotlib and Seaborn. Key functionalities such as indexing, slicing, and plotting are illustrated with code examples.

Uploaded by

Ali
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

‫نحوه خواندن فایل‬

import pandas as pd
df = pd.read_csv('file_location\filename.txt', delimiter = "\t")
‫یا‬

data = pd.read_csv('output_list.txt', sep=" ", header=None)


data.columns = ["a", "b", "c", "etc."]
‫یا‬

df = pd.read_csv('output_list.txt', sep=" ", header=None, names=["a", "b", "c"])


Pandas
we're going to deepen our investigation to how Python can be used to manipulate,
clean, and query data by looking at the Pandas data tool kit

 The pandas Series

 The pandas is the base data structure of pandas. A series is similar to a NumPy
array, but it differs by having an index, which allows for much richer lookup of
items instead of just a zero-based array index value.

import pandas as pd
d=pd.Series([11,12,13,14])
d
 Multiple items can be retrieved by specifying their labels in a Python list.

import pandas as pd
d[[1,3]]
Pandas: Series

d=pd.Series([11,12,13,14],index=['a','b','c','d'])
d[['a','b']] or d[[0,1]]

 We can examine the index of a using the property:

d=d.index

 Two objects can be applied to each other with an arithmetic operation

d1=pd.Series([11,12,13,14],index=['a','b','c','d'])
d2=pd.Series([1,2,3,5],index=[‘a',‘b','c','d'])
diff=d1-d2
print(diff)

diff.mean()
diff
Pandas: DataFrame

 A pandas series can only have a single value associated with each index label.

 To have multiple values per index label we can use a data frame. A data frame
represents one or more objects aligned by index label.

 Each series will be a column in the data frame, and each column can have an
associated name

d1=pd.Series([11,12,13,14],index=['a','b','c','d'])
d2=pd.Series([1,2,3,4],index=['a','b','c','d'])
temp_df=pd.DataFrame({'value1':d1,'value2':d2})
temp_df
 Columns in a object can be accessed using an array indexer with the name
of the column or a list of column names

temp_df['value1']
temp_df[['value1','value2']]
Pandas: DataFrame

 Passing a list to the [] operator of DataFrame retrieves the specified columns


whereas a Series would return rows.

 new column can be added to DataFrame simply by assigning another Series to a


column using the array indexer [] notation

temp_dfs=pd.DataFrame()
g=temp_df['value1']-temp_df['value2']
print(g)
temp_df['diff']=temp_df['value1']-temp_df['value2']
temp_df
 The names of the columns in a DataFrame are accessible via the columns
property
temp_df.columns
Pandas: DataFrame

 The DataFrame and Series objects can be sliced to retrieve specific rows

temp_df [0:3]

temp_df.value1[0:3]

 Entire rows from a data frame can be retrieved using the .loc and .iloc properties.
.loc ensures that the lookup is by index label, where .iloc uses the 0-based position.

temp_df.loc['a']

temp_df.iloc[0]

temp_df.iloc[[1,3,5,7]].column_Name
Pandas: DataFrame
 The following code shows values in the IMO column that are greater than 7

df2.IMO>7

 Loading data from files into a DataFrame

import pandas as pd
df2 = pd.read_excel('2010.xlsx')
Df2=pd.read_csv('2010.csv')

 Get type of column

type(df2.IMO[0])
Pandas: DataFrame
 For traversing DataFrame (transposed), we use T assign

df2=df2.T

 Loading data from row

df2.loc[['IYR','IMO']]

df2.loc['IYR'][0]
Pandas: DataFrame
 Deleting data from DataFrames using drop for rows or del for columns

df2.drop('IYR')
del df2['IMO']
df = df.drop(['IMO''], axis=1) # axis is important
 Add column to DataFrames

df2['IMO']=0

 Read data from DataFrames

df2['IMO']=df2['IMO']+2
Query for DataFrames
 If you want accidents in months that is bigger than 6, we should write code below:

df2['IMO']>6

 Now mask the answers by where attribute:


dfbigger=df2.where(df2['IMO']>6)

dfbigger=df2[(df2['IMO']>6) & (df2['DAY']>10)]


dfbigger

 Set or reset index for DataFrames

dfbigger=dfbigger.set_index('IYR')
print(dfbigger)
dfbigger=dfbigger.reset_index('IYR')
dfbigger
DataFrames: preProcess
 Count non-NA cells for each column or row

df2.count(axis=0, numeric_only=False)

 Get numeric columns or object columns


df2.dtypes
df2._get_numeric_data().columns
df2.select_dtypes(include=['object'])

df4=df2.select_dtypes(include=['object'])
df2[~df2.isin(df4)]

 Find empty cell and replace with nan

DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')

df2 = df2.replace(r'\s',np.nan, regex=True)


Plot

 matplotlib.pyplot is a collection of command style functions that make matplotlib


work like MATLAB
import matplotlib.pyplot as plt
Plt.plot([1,2,3], [1,2,3], 'go-', linewidth=2)
Plt.plot([1,2,3], [1,4,9], 'rs', markersize=14)
plt.show()

 another sample

import numpy as np
import matplotlib.pyplot as plt
t = np.arange(0., 5., 0.2)
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.title('some values')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['t','t**2','t**3'])
plt.show()
plot

 another sample

import numpy as np
import matplotlib.pyplot as plt
t = np.arange(0., 5., 0.2)
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.title('some values')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['t','t**2','t**3'])
plt.show()
Plot dataframe

 another sample

t=np.arange(0,5,0.2)
df=pd.DataFrame({0:t , 1:t**1.5 , 2:t**2 , 3:t**2.5 , 4:t**3})
legend_labels=['Solid' , 'Dashed' , 'Dotted' , 'Dot-dashed' , 'Points']

df.plot(style=['r-','g--', 'b:', 'm-.' , 'k:'])


plt.legend(legend_labels )
plt.show()
matplotlib.pyplot.subplot

matplotlib.pyplot.subplots return an instance of Figure and an array of (or a single) Axes (array or
not depends on the number of subplots)
matplotlib.pyplot.subplot(*args, **kwargs)
import matplotlib.pyplot as plt
import numpy as np

# Simple data to display in various forms


x = np.linspace(0, 2 * np.pi, 400)
y = np.sin(x ** 2)

plt.close('all')

# Just a figure and one subplot


f, ax = plt.subplots()
ax.plot(x, y)
ax.set_title('Simple plot')
plt.show()
matplotlib.pyplot.subplot

 A scatter plot displays the correlation between a pair of variables

 Define two subplot


f, axarr = plt.subplots(2, sharex=True)
axarr[0].plot(x, y)
axarr[0].set_title('Sharing X axis')
axarr[1].scatter(x, y)
plt.show()

 Define two subplot in one row

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)


ax1.plot(x, y)
ax1.set_title('Sharing Y axis')
ax2.scatter(x, y)
plt.show()
matplotlib.pyplot.subplot

 Define three subplot sharing both x and y axes

f, (ax1, ax2, ax3) = plt.subplots(3, sharex=True, sharey=True)


ax1.plot(x, y)
ax1.set_title('Sharing both axes')
ax2.scatter(x, y)
ax3.scatter(x, 2 * y ** 2 - 1, color='r')
plt.show()
matplotlib.pyplot.subplot

 Define Four axes, returned as a 2-d array

f, axarr = plt.subplots(2, 2)
axarr[0, 0].plot(x, y)
axarr[0, 0].set_title('Axis [0,0]')
axarr[0, 1].scatter(x, y)
axarr[0, 1].set_title('Axis [0,1]')
axarr[1, 0].plot(x, y ** 2)
axarr[1, 0].set_title('Axis [1,0]')
axarr[1, 1].scatter(x, y ** 2)
axarr[1, 1].set_title('Axis [1,1]')
plt.show()
Calculate correlation by seaborn package

1-
colNames = ["Age", "type_employer", "fnlwgt", "Education", "Education-Num", "Martial","Occupation",
"Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
"H-per-week", "Country", "Label"]
data = pd.read_csv("adult-data.txt", names=colNames,delimiter=',',header=None)
data

2- conda install seaborn

3-
import seaborn as sns
%matplotlib inline
sns.heatmap(data.corr())
plt.show()
 Read data
from sklearn import preprocessing
import pandas as pd
df2 = pd.read_excel('2010.xlsx')
df2

 Show Numeric Columns


df2.select_dtypes(include=[np.number])

 Replace empty cells with Nan value


df2 = df2.replace(r'\s',np.nan, regex=True)
 Drop all empty columns
df2=df2.dropna(axis='columns', how='all')
#df2.isnull().mean()
df2.fillna(df2.mean(),inplace=True)

Drop all empty columns with threshshold <0.5

#df2.columns[df2.isnull().mean() < 0.8]


df2=df2[df2.columns[df2.isnull().mean() < 0.5]]
Find Missing values

 Now let's see if we have any missing value

df2.isnull()
df2.notnull()
df2.isnull()[15:20]
 It is possible to drop rows with NanValue:

df2 = df2.dropna()
df2=df2.dropna(axis='columns', how='all') //rows

 If a Column like IMO2 is all Nan, we can drop it:

df2 = df2.drop(['IMO2'], axis=1)

 Show the summery of null value for each columns

df2.isnull().sum()
Delete Missing values or replace

 Fill all nan columns with mean

df2.fillna(df2.mean(),inplace=True)

 if a column like IYR of some accidents are NaN in our dataset. Let's
change NaN to mean value of

df2.IYR.iloc[[1, 2, 3]] =np.nan // df2.at[{0,11,12,13,14,15,16}, 'IYR']=np.nan


df2=df2.fillna({'IYR': df['IYR'].mean()})
df2[1:10]

You might also like