0% found this document useful (0 votes)
3 views

Intro Pandas

This document provides an introduction to the Pandas library in Python, covering its capabilities for data manipulation, including reading CSV files, data types, attributes, methods, and handling missing values. It also discusses grouping, filtering, slicing, sorting data, and basic plotting functionalities. The document includes code snippets to illustrate various operations and methods available in Pandas.

Uploaded by

duarte.denio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Intro Pandas

This document provides an introduction to the Pandas library in Python, covering its capabilities for data manipulation, including reading CSV files, data types, attributes, methods, and handling missing values. It also discusses grouping, filtering, slicing, sorting data, and basic plotting functionalities. The document includes code snippets to illustrate various operations and methods available in Pandas.

Uploaded by

duarte.denio
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Introduction to

Pandas
import pandas as pd

inspired from http://rcs.bu.edu/examples/python/data_analysis/


Before starting

● Datasets (available in our Google Drive)


○ Salaries.csv
○ flights.csv
● Prerequisites
○ Good command in Python
○ numpy
○ sklearn
Before starting

● Comando no Collab:
from google.colab import drive
drive.mount('/content/drive')
fpath='/content/drive/MyDrive/<your path>'
Pandas

● Allows working with data like a table (relational)


● Provides tools form data manipulation: sorting, slicing,
aggregation, among others.
● Tools for plotting data
● Statistics information
Pandas - DataFrame
import pandas as pd
# read a csv file
dfSal=pd.read_csv('Salaries.csv')
# show the first 5 rows (default)
dfSal.head()
dfSal.tail()
DataFrame data types

Pandas Type Native Type Description

object string Columns with strings and


mixed types

int64 int numeric

float64 float numeric with decimals

datetime64 N/A stores time series


DataFrame data types
Be careful: if the attribute’s name is a
dfSal['salary'].dtype pandas reserved word, you have to
use df[‘attribute’].xxxxx
dtype('int64')
dfSal.salary.dtype
dtype('int64')
dfSal.dtypes
rank object
discipline object
phd int64
service int64
sex object
salary int64
dtype: object
DataFrame attributes

Atributte Description

dtypes types of the columns

columns column names

axes list the row labels and column names

ndim number of dimensions

size number of elements

shape tuple with the dimensionality

values numpy representation

index row labels


DataFrame methods

Method Description

head([n]), tail([n]) first or last n rows (default 5)

describe() descriptive statistics (numeric ones)

max(), min() return max/min values for all attributes


df.attribute.min()/max() return max/min for a given attribute

mean(), median() return mean/median for all attributes or a given one

std() standard deviation

sample([n]) return a random sample of rows (default 1)

dropna() drop all rows with missing values

drop() drop specified labels from rows or columns.


Grouping data

● Pandas group by can use for:


○ Split data into groups based on some criteria
○ Calculate statistics (or apply a function) to each group
# grouping by rank (attribute)
dfRank=dfSal.groupby(['rank'])
dfRank.mean()
# or
dfSal.groupby(['rank']).mean()
# we can calculate statistics
dfSal.groupby('rank')[['salary']].mean()
# or
dfSal.groupby('rank').salary.mean()
Filtering data

● We can subset the data applying Boolean indexing (filter)


dfSalG12=dfSal[dfSal['salary'] > 120000]
dfSalG12.head()
# Any operator: > < == >= <= !=
dfWom=dfSal[dfSal['sex']== 'Female']
dfWom.head()
Slicing

● A dataframe can be sliced in several ways


# one particular columns
dfSal['salary'] # or dfSal.salary creates a Series
dfSal[['salary']] # creates a dataframe
# two or more columns
dfSal[['rank','salary']] # to store dfRS=dfSal[['rank','salary']]
# Selection rows by their position
dfSal[10:20] # from the 11th row to 20th (dataframe starts in 0)
# create a new dataframe from another dataframe selected rows
s=[dfSal[0:10],dfSal[50:60],dfSal[100:110]] # select the rows
dfSlice=pd.concat(s) # concat them to a new dataframe
# Selection rows (first 20) and some labels (attributes)
dfSal.loc[:20,['rank','sex','salary']]
# or by column position
dfSal.loc[:20,[0,4,5]]
Slicing

● More method iloc


dfSal.iloc[0] # or dfSal.salary creates a Series
dfSal.iloc[i] # (i+1)th row (remember 0 is the first one)
dfSal.iloc[-1] # last row or dfSal.tail(1)
dfSal.iloc[:,0] # first column or dfSal['rank']
dfSal.iloc[:,-1] # last column or dfSal['salary']
dfSal.iloc[0:7] # first 7 rows or dfSal.head(7)
dfSal.iloc[:,0:2] # first 2 columns or dfSal[['rank','discipline']]
dfSal.iloc[[0,5],[1,3]] # 1st and 6th rows and 2nd and 4th columns
Dropping

● Delete rows with drop


dfSal.drop([5,6], axis=0, inplace=True)
dfSal=dfSal.iloc[:100] # Overwrite the df with the first 100 rows
# deleting using conditions
dfSal.drop(dfSal[(dfSal['salary'] >1000) & (dfSal['sex']=='Male')].index, axis=0, inplace=True)
# delete columns
dfSal.drop(['salary'], axis=1, inplace=True)
# multiples
dfSal.drop(['sex','salary'], axis=1, inplace=True)
Sorting

● By default is in ascending and return the dataframe sorted


dfSal.sort_values(by='service') # default ascending=True inplace=False
dfSal.sort_values(by=['service','salary']) # sort salary within service
# sort by service ascending and salary descending
dfSal.sort_values(by=['service','salary'], ascending=[True, False])
# sort the dataframe by column label (attribute name)
dfSal.sort_index(axis=1,ascending=True, inplace=True)
dfSal.head(5)

Note: axis=0 refers to row


axis=1 refers to column
Missing values (NaN)

● Most of the time missing values are marked as NaN


dfFlig=pd.read_csv('flights.csv')
dfFlig.isnull()

dfFlig[dfFlig.isnull().any(axis=1)].head()

dfSal.iloc[0].isnull().sum() # number of null values in row 0


Missing values methods

Method Description

dropna() drop missing observations (rows)

dropna(how=’all’) drop missing observations (rows) where all attributes


are NaN

dropna(axis=1,how=’all’) drop columns if all values are missing

dropna(thresh=n) drop rows that contain less than n non-missing values

fillna(0) replace missing values with zeros

sample([n]) return a random sample of rows (default 1)

isnull() returns True if the value is missing

notnull() returns True if the value is non-missing


Graphics with Data Frame

● Pandas DataFrame offers some methods to plot data


○ %matplotlib inline
○ import matplotlib.pyplot as plt
dfSal.plot(x='rank',y='salary')
dfSal['salary'].plot.hist()

You might also like