0% found this document useful (0 votes)
5 views

Pandas Notes (1)

pandas python notes

Uploaded by

vajugoswami
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Pandas Notes (1)

pandas python notes

Uploaded by

vajugoswami
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Pandas

What is Pandas?

Pandas is a Python library used for working with data sets.


It has functions for analyzing, cleaning, exploring, and manipulating data.
Why Use Pandas?
Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
import pandas

mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}

myvar = pandas.DataFrame(mydataset)

print(myvar)
cars passings
0 BMW 3
1 Volvo 7
2 Ford 2
2 Ford 2

Example
import pandas as pd

mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)
0 cars passings
0 BMW 3
1 Volvo 7
2 Ford 2
BMW 3
2 Ford 2
Checking Pandas Version
The version string is stored under __version__ attribute.
Example
import pandas as pd

print(pd.__version__)
1.0.3
Pandas Series
What is a Series?
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.
Example
Create a simple Pandas Series from a list:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a)

print(myvar)
0 1
0 1
1 7
2 2
dtype: int64
7
Labels
If nothing else is specified, the values are labeled with their index number. First value has index
0, second value has index 1 etc.
This label can be used to access a specified value.
Example
Return the first value of the Series:
print(myvar[0])
1
1

Create Labels

With the index argument, you can name your own labels.
Example
Create your own labels:
import pandas as pd

a = [1, 7, 2]

myvar = pd.Series(a, index = ["x", "y", "z"])

print(myvar)
x 1
y 7
z 2
int64
When you have created labels, you can access an item by referring to the label.
Example
Return the value of "y":
print(myvar["y"])
7
7
Key/Value Objects as Series
You can also use a key/value object, like a dictionary, when creating a Series.
Example
Create a simple Pandas Series from a dictionary:
import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)
day1 420
day2 380
day3 390
dtype: int64
y1 420
Note: The keys of the dictionary become the labels.
To select only some of the items in the dictionary, use the index argument and specify only the
items you want to include in the Series.
Example
Create a Series using only data from "day1" and "day2":
import pandas as pd

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories, index = ["day1", "day2"])

print(myvar)
day1 420
day2 380
day1 420
DataFrames
Data sets in Pandas are usually multi-dimensional tables, called DataFrames. Series is like a
column, a DataFrame is the whole table.
Example
Create a DataFrame from two Series:
import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)

print(myvar)
calories duration
0 420 50
1 380 40
2 390 45
Pandas DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table
with rows and columns.
Example
Create a simple Pandas DataFrame:
import pandas as pd

data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}

#load data into a DataFrame object:


df = pd.DataFrame(data)

print(df)
calories duration
0 420 50
1 380 40
2 390 45
1 380 40
2 390 45
Locate Row
As you can see from the result above, the DataFrame is like a table with rows and columns.
Pandas use the loc attribute to return one or more specified row(s)
Example
Return row 0:
#refer to the row index:
print(df.loc[0]
calories 420
duration 50
Name: 0, dtype: int64
Example
Return row 0 and 1:

#use a list of indexes:


print(df.loc[[0, 1]])
calories duration
0 420 50
1 380 40
1 0 40
Note: When using [], the result is a Pandas DataFrame.
Named Indexes
With the index argument, you can name your own indexes.
Example
Add a list of names to give each row a name:
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)

calories duration
day1 420 50
day2 380 40
day3 390 45

day3 390 45
Locate Named Indexes

Use the named index in the loc attribute to return the specified row(s).
Example
Return "day2":
#refer to the named index:
print(df.loc["day2"])
calories 380
duration 40
Name: day2, dtype: int64
Load Files into a DataFrame
If your data sets are stored in a file, Pandas can load them into a DataFrame.
Example
Load a comma separated file (CSV file) into a DataFrame:

import pandas as pd
df = pd.read_csv('data.csv')

print(df)
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
.. ... ... ... ...
164 60 105 140 290.8
165 60 110 145 300.4
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4
Example
Load the CSV into a DataFrame:
import pandas as pd

df = pd.read_csv('data.csv')

print(df.to_string())
Tip: use to_string() to print the entire DataFrame.
If you have a large DataFrame with many rows, Pandas will only return the first 5 rows,
and the last 5 rows:
Example
Print the DataFrame without the to_string() method:

import pandas as pd

df = pd.read_csv('data.csv')

print(df)
max_rows
The number of rows returned is defined in Pandas option settings.
You can check your system's maximum rows with the pd.options.display.max_rows statement.
Example
Check the number of maximum returned rows:
import pandas as pd

print(pd.options.display.max_rows)
Example
Increase the maximum number of rows to display the entire DataFrame:
import pandas as pd

pd.options.display.max_rows = 9999

df = pd.read_csv('data.csv')

print(df)

Pandas - Cleaning Data


Data Cleaning
Data cleaning means fixing bad data in your data set.
Bad data could be:
Empty cells
Data in wrong format
Wrong data
Duplicates
Empty Cells
Empty cells can potentially give you a wrong result when you analyze data.

Remove Rows
One way to deal with empty cells is to remove rows that contain empty cells.
This is usually OK, since data sets can be very big, and removing a few rows will not have a big
impact on the result.
Example
Return a new Data Frame with no empty cells:

import pandas as pd

df = pd.read_csv('data.csv')

new_df = df.dropna()
print(new_df.to_string())
If you want to change the original DataFrame, use the inplace = True argument:

Example
Remove all rows with NULL values:
import pandas as pd

df = pd.read_csv('data.csv')

df.dropna(inplace = True)

print(df.to_string())
Replace Empty Values
Another way of dealing with empty cells is to insert a new value instead.
This way you do not have to delete entire rows just because of some empty cells.
The fillna() method allows us to replace empty cells with a value:
Example
Replace NULL values with the number 130:
import pandas as pd

df = pd.read_csv('data.csv')

df.fillna(130, inplace = True)


Replace Only For Specified Columns
The example above replaces all empty cells in the whole Data Frame.
To only replace empty values for one column, specify the column name for the DataFrame:
Example
Replace NULL values in the "Calories" columns with the number 130:
import pandas as pd

df = pd.read_csv('data.csv')
df["Calories"].fillna(130, inplace = True)

Replace Using Mean, Median, or Mode


A common way to replace empty cells, is to calculate the mean, median or mode value of the
column.
Pandas uses the mean() median() and mode() methods to calculate the respective values for a
specified column:
Example
Calculate the MEAN, and replace any empty values with it:

import pandas as pd

df = pd.read_csv('data.csv')

x = df["Calories"].mean()

df["Calories"].fillna(x, inplace = True)


Mean = the average value (the sum of all values divided by number of values).

Example
Calculate the MEDIAN, and replace any empty values with it:
import pandas as pd

df = pd.read_csv('data.csv')

x = df["Calories"].median()

df["Calories"].fillna(x, inplace = True)


Median = the value in the middle, after you have sorted all values ascending.
Example
Calculate the MODE, and replace any empty values with it:

import pandas as pd
df = pd.read_csv('data.csv')
x = df["Calories"].mode()[0]
df["Calories"].fillna(x, inplace = True)
Mode = the value that appears most frequently.

You might also like