FDS Notes Unit-4
FDS Notes Unit-4
21CSS202T
Unit-4
HANDLING DATA
Problem faced when handling large data
General techniques for handling large volume of data
General programming tips for dealing large data sets
Introduction to Pandas
Data Structure in pandas
Dataframe and Series
Accessing and Slicing of Series and Dataframes
Arithmetic and Logical Operations on Dataframe
Groupby operations on Dataframe
Pivot tables to understand the relationship between variables in the data with different
aggregation
Crosstab to understand the relationship between variables in the data
Handling missing data
Time Series
o Date Functionality
o Time Delta
Vectorization concept implementation using pandas
I/O tools of Pandas
Indexing, multi indexing concepts - Application.
Data Handling
o Categorical data
o Integer data
Computational tools
o Statistical functions
o Windowing Operations
Chart and Table Visualization in Pandas
PROBLEM FACED WHEN HANDLING LARGE DATA
Data quality: Large data sets can contain errors, duplicates, and incomplete records. Data
validation can help ensure that data is accurate, complete, and properly formatted.
Security and privacy: As the amount of data increases, so do the security and privacy
concerns. Organizations need to put in place strong data processes and governance policies to
ensure data is managed responsibly and ethically.
Cost: Managing large amounts of data can be expensive, especially for organizations that
generate large volumes of data daily. Organizations need to evaluate their storage and
processing needs and adopt cost-effective solutions.
Data integration: Big data comes from many different sources and in many different
formats. Data integration tools can help combine data from different sources and make it
available for analysis.
Accessibility: Organizations need to make data easy and convenient for users of all skill
levels to use.
Finding the right tools: Organizations need to find the right technology to work within their
established ecosystems and address their particular needs.
Uncovering insights: Organizations need to analyze big data to unearth intelligence to drive
better decision making.
Organizational resistance: Companies need to rethink processes, workflows, and even the
way problems are approached.
INTRODUCTION TO PANDAS
Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and
was created by Wes McKinney in 2008.
Pandas allow us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Data Science: is a branch of computer science where we study how to store, use and analyze
data for deriving information from it.
Installation of Pandas
pip install pandas
Import Pandas
import pandas
Import Pandas Aliasing
import pandas as pd
DATAFRAME
Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns).
A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in
rows and columns like a spreadsheet or SQL table, or a dict of Series objects.
Pandas DataFrame consists of three principal components: the data, rows, and columns.
Example:
import pandas as pd
data = { "calories": [420, 380, 390], "duration": [50, 40, 45] }
df = pd.DataFrame(data)
print(df)
Output:
calories duration
0 420 50
1 380 40
2 390 45
Locate Row
Example: Return row 0:
print(df.loc[0])
Output:
calories 420
duration 50
Example: Return row 0 and 1:
print(df.loc[[0, 1]])
Output:
calories duration
0 420 50
1 380 40
Note: When using [], the result is a Pandas DataFrame.
Named Indexes
import pandas as pd
data = { "calories": [420, 380, 390], "duration": [50, 40, 45] }
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
Output:
calories duration
day1 420 50
day2 380 40
day3 390 45
SERIES
A Pandas Series is like a column in a table.
It is a one-dimensional array holding data of any type.
Syntax: pandas.Series(data=None, index=None, dtype=None, name=None, copy=False,
fastpath=False)
Parameters:
data: array- Contains data stored in Series.
index: array-like or Index (1d)
dtype: str, numpy.dtype, or ExtensionDtype, optional
name: str, optional
copy: bool, default False
Labels
print(myvar[0])
Output:
1
Create Labels
With the index argument, you can name your own labels.
Example: Create your own labels
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
Output:
x 1
y 7
z 2
When you have created labels, you can access an item by referring to the label.
Example:
print(myvar["y"])
Output: 7
Slicing Pandas DataFrames is a powerful technique, allowing extraction of specific data subsets
based on integer positions.
Example:
import pandas as pd
data = pd.DataFrame({'Brand': ['Maruti', 'Hyundai', 'Tata','Mahindra', 'Maruti', 'Hyundai', 'Renault',
'Tata', 'Maruti'], 'Year': [2012, 2014, 2011, 2015, 2012, 2016, 2014, 2018, 2019],
'Kms Driven': [50000, 30000, 60000, 25000, 10000, 46000, 31000, 15000, 12000],
'City': ['Gurgaon', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Chennai', 'Ghaziabad'],
'Mileage': [28, 27, 25, 26, 28, 29, 24, 21, 24]})
print(data)
loc()
The loc() function is label based data selecting method which means that we have to pass the name of
the row or column which we want to select.
Example: Selecting Data According to Some Conditions
print(data.loc[(data.Brand == 'Maruti') & (data.Mileage > 25)])
Output:
Brand Year Kms Driven City Mileage
0 Maruti 2012 50000 Gurgaon 28
4 Maruti 2012 10000 Mumbai 28
at[]
Pandas at[] is used to return data in a dataframe at the passed location.
Syntax: Dataframe.at[position, label]
Parameters:
position: Position of element in column
label: Column name to be used
Return type: Single element at passed position
Example:
position = 2
label = 'Brand'
output = data.at[position, label]
print(output)
Output: Tata
iat[]
Pandas iat[] method is used to return data in a dataframe at the passed location.
Syntax: Dataframe.iat[row, column]
Parameters:
position: Position of element in column
label: Position of element in row
Return type: Single element at passed position
Example:
column = 3
row = 2
output = data.iat[row, column]
print(output)
Output: Mumbai
ARITHMETIC OPERATIONS ON DATAFRAME
Addition of 2 Series
import pandas as pd
series1 = pd.Series([1, 2, 3, 4, 5])
series2 = pd.Series([6, 7, 8, 9, 10])
series3 = series1 + series2
print(series3)
Output:
0 7
1 9
2 11
3 13
4 15
Subtraction of 2 Series
import pandas as pd
series1 = pd.Series([1, 2, 3, 4, 5])
series2 = pd.Series([6, 7, 8, 9, 10])
series3 = series1 - series2
print(series3)
Output:
0 -5
1 -5
2 -5
3 -5
4 -5
Multiplication of 2 Series
import pandas as pd
series1 = pd.Series([1, 2, 3, 4, 5])
series2 = pd.Series([6, 7, 8, 9, 10])
series3 = series1 * series2
print(series3)
Output:
0 6
1 14
2 24
3 36
4 50
Division of 2 Series
import pandas as pd
series1 = pd.Series([1, 2, 3, 4, 5])
series2 = pd.Series([6, 7, 8, 9, 10])
series3 = series1 / series2
print(series3)
Output:
0 0.166667
1 0.285714
2 0.375000
3 0.444444
4 0.500000
LOGICAL OPERATIONS ON DATAFRAME
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 32, 30, 29],
'Score': [85, 92, 88, 75]
}
df = pd.DataFrame(data)
AND Operation(&):
filtered_df = df[(df['Age'] > 30) & (df['Score'] > 80)]
print(filtered_df)
Output:
Name Age Score
1 Bob 32 92
OR Operation (|):
filtered_df_or = df[(df['Age'] > 30) | (df['Score'] > 80)]
print(filtered_df_or)
Output:
Name Age Score
0 Alice 25 85
1 Bob 32 92
2 Charlie 30 88
NOT Operation (~):
filtered_df_not = df[~(df['Score'] > 80)]
print(filtered_df_not)
Output:
Name Age Score
3 David 29 75
Example:
import pandas as pd
l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
df = pd.DataFrame(l, columns=["a", "b", "c"])
print(df)
print("Groupby operation")
print(df.groupby(by=["b"]).sum())
Output:
a b c
0 1 2.0 3
1 1 NaN 4
2 2 1.0 3
3 1 2.0 2
Groupby operation
a c
b
1.0 2 3
2.0 2 5
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'C': [27, 23, 21, 23, 24]})
print(df)
Output:
A B C
0 John Masters 27
1 Boby Graduate 23
2 Mina Graduate 21
3 Peter Masters 23
4 Nicky Graduate 24
Example:
import pandas
import numpy
a = numpy.array(["foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar", "foo", "foo", "foo"],
dtype=object)
b = numpy.array(["one", "one", "one", "two", "one", "one", "one", "two", "two", "two", "one"],
dtype=object)
c = numpy.array(["dull", "dull", "shiny", "dull", "dull", "shiny", "shiny", "dull", "shiny",
"shiny", "shiny"], dtype=object)
pandas.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
Output:
HANDLING MISSING DATA
Missing Data can occur when no information is provided for one or more items or for a whole unit.
Missing Data can also refer to as NA(Not Available) values in pandas For Example, Suppose different
users being surveyed may choose not to share their income, some users may choose not to share the
address in this way many datasets went missing.
1. isnull()
2. notnull()
3. dropna()
4. fillna()
5. replace()
6. interpolate()
1. isnull():
import pandas as pd
import numpy as np
dict = {'First Score':[100, 90, np.nan, 95],'Second Score': [30, 45, 56, np.nan],'Third Score':[np.nan,
40, 80, 98]}
df = pd.DataFrame(dict)
df.isnull()
Output:
First Score Second Score Third Score
0 False False True
1 False False False
2 True False False
3 False True False
2. notnull():
df.notnull()
Output:
First Score Second Score Third Score
0 True True False
1 True True True
2 False True True
3 True False True
3. dropna():
df.dropna()
Output:
First ScoreSecond Score Third Score
1 90.0 45.0 40.0
4. fillna():
df.fillna(0)
Output:
First ScoreSecond Score Third Score
0100.0 30.0 0.0
190.0 45.0 40.0
20.0 56.0 80.0
395.0 0.0 98.0
5. replace():
df.replace(to_replace = np.nan, value = -99)
Output:
First ScoreSecond Score Third Score
0 100.0 30.0 -99.0
1 90.0 45.0 40.0
2 -99.0 56.0 80.0
3 95.0 -99.0 98.0
6. interpolate():
df.interpolate(method ='linear', limit_direction ='forward')
Output:
First ScoreSecond Score Third Score
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 92.5 56.0 80.0
3 95.0 56.0 98.0
TIME SERIES
Time series data is a sequence of data points in chronological order that is used by businesses to
analyze past data and make future predictions. Common examples of time series data in our day-
to-day lives include:
Measuring weather temperatures
Measuring the number of taxi rides per month
Predicting a company’s stock prices for the next day
DATE FUNCTIONALITY
Extending the Time series, Date functionalities play major role in financial data analysis.
o Generating sequence of dates
o Convert the date series to different frequencies
o Create a Range of Dates
Using the date.range() function by specifying the periods and the frequency, we can create the date
series. By default, the frequency of range is Days.
import pandas as pd
print(pd.date_range('1/1/2011', periods=5))
Output:
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04', '2011-01-05'],
dtype='datetime64[ns]', freq='D')
o Change the Date Frequency
import pandas as pd
print pd.date_range('1/1/2011', periods=5,freq='M')
Output:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-30', '2011-05-31'],
dtype='datetime64[ns]', freq='M')
Example:
import pandas as pd
print(pd.bdate_range(start='1/1/2018', end='1/08/2018'))
Output:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-08'],
dtype='datetime64[ns]', freq='B')
Offset Aliases
Alias Description Alias Description
B business day frequency BQS business quarter start frequency
D calendar day frequency A annual(Year) end frequency
W weekly frequency BA business year end frequency
M month end frequency BAS business year start frequency
SM semi-month end frequency BH business hour frequency
BM business month end frequency H hourly frequency
MS month start frequency T, min minutely frequency
SMS SMS semi month start frequency S secondly frequency
BMS business month start frequency L, ms milliseconds
Q quarter end frequency U, us microseconds
BQ business quarter end frequency N nanoseconds
QS quarter start frequency
TIME DELTA
Timedelta is a subclass of datetime.timedelta.
Example:
import pandas as pd
td = pd.Timedelta('3 days 06:05:01.000000111')
print(td)
print(td.seconds)
Output:
3 days 06:05:01.000000
21901
Dot Product:
Dot product is an algebraic operation in which two equal length vectors are being multiplied such that
it produces a single number. Dot Product often called as inner product. This product results in a
scalar number.
Example:
import time
import numpy
import array
# 8 bytes size int
a = array.array('q')
for i in range(100000):
a.append(i);
b = array.array('q')
for i in range(100000, 200000):
b.append(i)
# classic dot product of vectors implementation
tic = time.process_time()
dot = 0.0;
for i in range(len(a)):
dot += a[i] * b[i]
toc = time.process_time()
print("dot_product = "+ str(dot));
print("Computation time = " + str(1000*(toc - tic )) + "ms")
n_tic = time.process_time()
n_dot_product = numpy.dot(a, b)
n_toc = time.process_time()
print("\nn_dot_product = "+str(n_dot_product))
print("Computation time = "+str(1000*(n_toc - n_tic ))+"ms")
Output:
dot_product = 833323333350000.0
Computation time = 61.22865300000058ms
n_dot_product = 833323333350000
Computation time = 2.8878020000000504ms
Outer Product:
The tensor product of two coordinate vectors is termed as Outer product. Let’s consider two vectors a
and b with dimension n x 1 and m x 1 then the outer product of the vector results in a rectangular
matrix of n x m. If two vectors have same dimension then the resultant matrix will be a square matrix
as shown in the figure.
Example:
import time
import numpy
import array
a = array.array('i')
for i in range(200):
a.append(i);
b = array.array('i')
for i in range(200, 400):
b.append(i)
tic = time.process_time()
outer_product = numpy.zeros((200, 200))
for i in range(len(a)):
for j in range(len(b)):
outer_product[i][j]= a[i]*b[j]
toc = time.process_time()
print("outer_product = "+ str(outer_product));
print("Computation time = "+str(1000*(toc - tic ))+"ms")
n_tic = time.process_time()
outer_product = numpy.outer(a, b)
n_toc = time.process_time()
print("outer_product = "+str(outer_product));
print("\nComputation time = "+str(1000*(n_toc - n_tic ))+"ms")
Output:
Example:
import time
import numpy
import array
a = array.array('i')
for i in range(50000):
a.append(i);
b = array.array('i')
for i in range(50000, 100000):
b.append(i)
vector = numpy.zeros((50000))
tic = time.process_time()
for i in range(len(a)):
vector[i]= a[i]*b[i]
toc = time.process_time()
print("Element wise Product = "+ str(vector));
print("\nComputation time = "+str(1000*(toc - tic ))+"ms")
n_tic = time.process_time()
vector = numpy.multiply(a, b)
n_toc = time.process_time()
print("Element wise Product = "+str(vector));
print("\nComputation time = "+str(1000*(n_toc - n_tic ))+"ms")
Output:
Element wise Product = [0.00000000e+00 5.00010000e+04 1.00004000e+05 ... 4.99955001e+09
4.99970000e+09 4.99985000e+09]
Computation time = 37.37993300000042ms
Element wise Product = [ 0 50001 100004 ... 704582713 704732708 704882705]
Computation time = 0.3640780000004895m
Example:
import pandas as pd
df=pd.read_csv("temp.csv")
print df
Output:
S.No Name Age City Salary
0 1 Tom 28 Toronto 20000
1 2 Lee 32 HongKong 3000
2 3 Steven 43 Bay Area 8300
3 4 Ram 38 Hyderabad 3900
custom index: This specifies a column in the csv file to customize the index using index_col.
import pandas as pd
df=pd.read_csv("temp.csv",index_col=['S.No'])
print df
Output:
S.No Name Age City Salary
1 Tom 28 Toronto 20000
2 Lee 32 HongKong 3000
3 Steven 43 Bay Area 8300
4 Ram 38 Hyderabad 3900
header_names: Specify the names of the header using the names argument.
import pandas as pd
df=pd.read_csv("temp.csv", names=['a', 'b', 'c','d','e'])
print df
Output:
a b c d e
0 S.No Name Age City Salary
1 1 Tom 28 Toronto 20000
2 2 Lee 32 HongKong 3000
3 3 Steven 43 Bay Area 8300
4 4 Ram 38 Hyderabad 3900
INDEXING
Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame.
Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the
columns, or some of each of the rows and columns. Indexing can also be known as Subset Selection.
Pandas support four types of Multi-axes indexing they are:
1. Dataframe.[ ] : This function also known as indexing operator
2. Dataframe.loc[ ] : This function is used for labels.
3. Dataframe.iloc[ ] : This function is used for positions or integer based
4. Dataframe.ix[] : This function is used for both label and integer based
Collectively, they are called the indexers.
pd.MultiIndex.from_frame(df)
Output:
MultiIndex([( 'Saikat', 'Software Developer', 12.4),
('Shrestha', 'System Engineer', 5.6),
( 'Sandi', 'Footballer', 9.3),
( 'Abinash', 'Singer', 10.0)],
names=['name', 'Jobs', 'Annual Salary(L.P.A)'])
print(df.index)
Output:
MultiIndex([(0, 'Peaky blinders', 4.5),
(1, 'Sherlock', 5.0),
(2, 'The crown', 3.9),
(3, 'Queens Gambit', 4.2),
(4, 'Friends', 5.0)],
names=[None, 'series', 'Ratings'])
DATA HANDLING
1. CATEGORICAL DATA
Categorical data is a set of predefined categories or groups an observation can fall into.
pandas.Categorical(val, categories = None, ordered = None, dtype = None) : It represents a
categorical variable. Categorical are a pandas data type that corresponds to the categorical variables in
statistics. Such variables take on a fixed and limited number of possible values. For examples –
grades, gender, blood group type etc.
Example:
import pandas as pd
data = ['red', 'blue', 'green', 'red', 'blue']
categorical_data = pd.Categorical(data)
print(categorical_data)
Output:
['red', 'blue', 'green', 'red', 'blue']
Categories (3, object): ['blue', 'green', 'red']
2. INTEGER DATA
import pandas as pd
s = pd.Series([1, 2, None], dtype="Int64")
s_plus_one = s + 1 # Adds 1 to each element in the series
comparison = s == 1 # Checks if each element in the series is equal to 1
print(s_plus_one)
print(comparison)
Output:
0 2
1 3
2 <NA>
dtype: Int64
0 True
1 False
2 <NA>
dtype: Boolean
COMPUTATIONAL TOOLS
An item of computer software used as a means of performing an operation or achieving an end.
STATISTICAL FUNCTIONS
Percent Change
ser = Series(randn(8))
ser.pct_change()
Output:
0 NaN
1 -1.602976
2 4.334938
3 -0.247456
4 -2.067345
5 -1.142903
6 -1.688214
7 -9.759729
dtype: float64
Covariance
s1 = Series(randn(1000))
s2 = Series(randn(1000))
s1.cov(s2)
Output: 0.00068010881743109993
Correlation
Method name Description
pearson (default) Standard correlation coefficient
kendall Kendall Tau correlation coefficient
spearman Spearman rank correlation coefficient
frame['a'].corr(frame['b'], method='spearman')
Output: -0.0072898851595406388
Data ranking
s = Series(np.random.randn(5), index=list('abcde'))
s['d'] = s['b'] # so there's a tie
s.rank()
Output:
a 5.0
b 2.5
c 1.0
d 2.5
e 4.0
Calculate Statistics
import pandas as pd
dataset = pd.read_csv('train.csv')
dataset.head()
Output:
a. Mean:
mean = dataset['Age'].mean()
print(mean)
Output: 29.69911764705882
b. Median
median = dataset['Fare'].median()
print(median)
Output: 14.4542
c. Mode:
mode = dataset['Sex'].mode()
print(mode)
Output: 0 male
d. Count:
count = dataset['Ticket'].count()
print(count)
Output: 891
e. Standard Deviation
std = dataset['Fare'].std()
print(std)
Output: 49.693428597180905
f. Max:
maxValue = dataset['Age'].max()
print(maxValue)
Output: 80.0
g. Min:
minValue = dataset['Fare'].min()
print(minValue)
Output: 0.0000
h. Describe:
dataset.describe()
Output:
WINDOWING OPERATIONS
Pandas contain a compact set of APIs for performing windowing operations - an operation that
performs an aggregation over a sliding partition of values.
s = pd.Series(range(5))
s.rolling(window=2).sum()
Output:
0 NaN
1 1.0
2 3.0
3 5.0
4 7.0
for window in s.rolling(window=2):
print(window)
Output:
0 0
dtype: int64
0 0
1 1
dtype: int64
1 1
2 2
dtype: int64
2 2
3 3
dtype: int64
3 3
4 4
dtype: int64
1. Rolling window
s= pd.Series(range(5), index=pd.date_range('2020-01-01', periods=5, freq='1D'))
s.rolling(window='2D').sum()
Output:
2020-01-01 0.0
2020-01-02 1.0
2020-01-03 3.0
2020-01-04 5.0
2020-01-05 7.0
Freq: D, dtype: float64
2. Expanding window
df = pd.DataFrame({'A': ['a', 'b', 'a', 'b', 'a'], 'B': range(5)})
df.groupby('A').expanding().sum()
Output:
B
A
a 0 0.0
2 2.0
4 6.0
b 1 1.0
3 4.0
3. Weighted window
def weighted_mean(x):
arr = np.ones((1, x.shape[1]))
arr[:, :2] = (x[:, :2] * x[:, 2]).sum(axis=0) / x[:, 2].sum()
return arr
df = pd.DataFrame([[1, 2, 0.6], [2, 3, 0.4], [3, 4, 0.2], [4, 5, 0.7]])
df.rolling(2, method="table", min_periods=0).apply(weighted_mean, raw=True, engine="numba") #
noqa: E501
Output:
0 1 2
0 1.000000 2.000000 1.0
1 1.800000 2.000000 1.0
2 3.333333 2.333333 1.0
3 1.555556 7.000000 1.0