0% found this document useful (0 votes)
46 views

FDS Notes Unit-4

Uploaded by

Disha Singhal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

FDS Notes Unit-4

Uploaded by

Disha Singhal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

FUNDAMENTALS OF DATA SCIENCE

21CSS202T
Unit-4
HANDLING DATA
 Problem faced when handling large data
 General techniques for handling large volume of data
 General programming tips for dealing large data sets
 Introduction to Pandas
 Data Structure in pandas
 Dataframe and Series
 Accessing and Slicing of Series and Dataframes
 Arithmetic and Logical Operations on Dataframe
 Groupby operations on Dataframe
 Pivot tables to understand the relationship between variables in the data with different
aggregation
 Crosstab to understand the relationship between variables in the data
 Handling missing data
 Time Series
o Date Functionality
o Time Delta
 Vectorization concept implementation using pandas
 I/O tools of Pandas
 Indexing, multi indexing concepts - Application.
 Data Handling
o Categorical data
o Integer data
 Computational tools
o Statistical functions
o Windowing Operations
 Chart and Table Visualization in Pandas
PROBLEM FACED WHEN HANDLING LARGE DATA
 Data quality: Large data sets can contain errors, duplicates, and incomplete records. Data
validation can help ensure that data is accurate, complete, and properly formatted.
 Security and privacy: As the amount of data increases, so do the security and privacy
concerns. Organizations need to put in place strong data processes and governance policies to
ensure data is managed responsibly and ethically.
 Cost: Managing large amounts of data can be expensive, especially for organizations that
generate large volumes of data daily. Organizations need to evaluate their storage and
processing needs and adopt cost-effective solutions.
 Data integration: Big data comes from many different sources and in many different
formats. Data integration tools can help combine data from different sources and make it
available for analysis.
 Accessibility: Organizations need to make data easy and convenient for users of all skill
levels to use.
 Finding the right tools: Organizations need to find the right technology to work within their
established ecosystems and address their particular needs.
 Uncovering insights: Organizations need to analyze big data to unearth intelligence to drive
better decision making.
 Organizational resistance: Companies need to rethink processes, workflows, and even the
way problems are approached.

GENERAL TECHNIQUES FOR HANDLING LARGE


VOLUME OF DATA
GENERAL PROGRAMMING TIPS FOR DEALING LARGE
DATA SETS

INTRODUCTION TO PANDAS
 Pandas is a Python library used for working with data sets.
 It has functions for analyzing, cleaning, exploring, and manipulating data.
 The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and
was created by Wes McKinney in 2008.
 Pandas allow us to analyze big data and make conclusions based on statistical theories.
 Pandas can clean messy data sets, and make them readable and relevant.
 Data Science: is a branch of computer science where we study how to store, use and analyze
data for deriving information from it.
Installation of Pandas
pip install pandas
Import Pandas
import pandas
Import Pandas Aliasing
import pandas as pd

DATA STRUCTURE IN PANDAS


It supports two data structures:
1. Dataframe
2. Series

DATAFRAME
 Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data
structure with labeled axes (rows and columns).
 A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in
rows and columns like a spreadsheet or SQL table, or a dict of Series objects.
 Pandas DataFrame consists of three principal components: the data, rows, and columns.

Example:
import pandas as pd
data = { "calories": [420, 380, 390], "duration": [50, 40, 45] }
df = pd.DataFrame(data)
print(df)
Output:
calories duration
0 420 50
1 380 40
2 390 45

Locate Row
Example: Return row 0:
print(df.loc[0])
Output:
calories 420
duration 50
Example: Return row 0 and 1:
print(df.loc[[0, 1]])
Output:
calories duration
0 420 50
1 380 40
Note: When using [], the result is a Pandas DataFrame.

Named Indexes
import pandas as pd
data = { "calories": [420, 380, 390], "duration": [50, 40, 45] }
df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)
Output:
calories duration
day1 420 50
day2 380 40
day3 390 45

Locate Named Indexes


Example: Return "day2":
print(df.loc["day2"])
Output:
calories 380
duration 40

Load Files Into a DataFrame


import pandas as pd
df = pd.read_csv('data.csv')
print(df)

SERIES
 A Pandas Series is like a column in a table.
 It is a one-dimensional array holding data of any type.
Syntax: pandas.Series(data=None, index=None, dtype=None, name=None, copy=False,
fastpath=False)
Parameters:
data: array- Contains data stored in Series.
index: array-like or Index (1d)
dtype: str, numpy.dtype, or ExtensionDtype, optional
name: str, optional
copy: bool, default False

Example: Create a simple Pandas Series from a list:


import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
Output:
0 1
1 7
2 2

Labels
print(myvar[0])
Output:
1

Create Labels
With the index argument, you can name your own labels.
Example: Create your own labels
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a, index = ["x", "y", "z"])
print(myvar)
Output:
x 1
y 7
z 2

When you have created labels, you can access an item by referring to the label.
Example:
print(myvar["y"])
Output: 7

Key/Value Objects as Series


You can also use a key/value object, like a dictionary, when creating a Series.
Example: Create a simple Pandas Series from a dictionary:
import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories)
print(myvar)
Output:
day1 420
day2 380
day3 390

Note: The keys of the dictionary become the labels.


To select only some of the items in the dictionary, use the index argument and specify only the items
you want to include in the Series.
Example: Create a Series using only data from "day1" and "day2":
import pandas as pd
calories = {"day1": 420, "day2": 380, "day3": 390}
myvar = pd.Series(calories, index = ["day1", "day2"])
print(myvar)
Output:
day1 420
day2 380

DataFrames to Create Series


 Data sets in Pandas are usually multi-dimensional tables, called DataFrames.
 Series is like a column, a DataFrame is the whole table.
Example: Create a DataFrame from two Series
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
myvar = pd.DataFrame(data)
print(myvar)
Output:
calories duration
0 420 50
1 380 40
2 390 45

ACCESSING AND SLICING OF SERIES AND DATAFRAMES


Accessing is used to select a specific row or column from a DataFrame.
 Using Integer Indexing
Example:
import pandas as pd
data = pd.Series([10, 20, 38, 40, 50])
print(data[2])
Output: 38

 Using Custom Index Label


import pandas as pd
data = pd.Series([10, 20, 28, 40, 50], index=['A', 'B', 'C', 'D', 'E'])
print(data['C'])
Output: 28

 Using Multiple Column Names


import pandas as pd
data = {'Name': ['John', 'Anna', 'Peter'], 'Age': [28, 24, 22]}
df = pd.DataFrame(data)
Output:
Name Age
0 John 28
1 Anna 24
2 Peter 22

Slicing Pandas DataFrames is a powerful technique, allowing extraction of specific data subsets
based on integer positions.
Example:
import pandas as pd
data = pd.DataFrame({'Brand': ['Maruti', 'Hyundai', 'Tata','Mahindra', 'Maruti', 'Hyundai', 'Renault',
'Tata', 'Maruti'], 'Year': [2012, 2014, 2011, 2015, 2012, 2016, 2014, 2018, 2019],
'Kms Driven': [50000, 30000, 60000, 25000, 10000, 46000, 31000, 15000, 12000],
'City': ['Gurgaon', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Chennai', 'Ghaziabad'],
'Mileage': [28, 27, 25, 26, 28, 29, 24, 21, 24]})
print(data)

 loc()
The loc() function is label based data selecting method which means that we have to pass the name of
the row or column which we want to select.
Example: Selecting Data According to Some Conditions
print(data.loc[(data.Brand == 'Maruti') & (data.Mileage > 25)])
Output:
Brand Year Kms Driven City Mileage
0 Maruti 2012 50000 Gurgaon 28
4 Maruti 2012 10000 Mumbai 28

Example: Selecting a Range of Rows From the DataFrame


print(data.loc[2: 5])
Output:
Brand Year Kms Driven City Mileage
2 Tata 2011 60000 Mumbai 25
3 Mahindra 2015 25000 Delhi 26
4 Maruti 2012 10000 Mumbai 28
5 Hyundai 2016 46000 Delhi 29

Example: Updating the Value of Any Column


data.loc[(data.Year < 2015), ['Mileage']] = 22
print(data)
Output:
Brand Year Kms Driven City Mileage
0 Maruti 2012 50000 Gurgaon 22
1 Hyundai 2014 30000 Delhi 22
2 Tata 2011 60000 Mumbai 22
3 Mahindra 2015 25000 Delhi 26
4 Maruti 2012 10000 Mumbai 22
5 Hyundai 2016 46000 Delhi 29
6 Renault 2014 31000 Mumbai 22
7 Tata 2018 15000 Chennai 21
8 Maruti 2019 12000 Ghaziabad 24
 iloc()
The iloc() function is an indexed-based selecting method which means that we have to pass an integer
index in the method to select a specific row/column.
Example: Selecting Rows Using Integer Indices
print(data.iloc[[0, 2, 4, 7]])
Output:
Brand Year Kms Driven City Mileage
0 Maruti 2012 50000 Gurgaon 28
2 Tata 2011 60000 Mumbai 25
4 Maruti 2012 10000 Mumbai 28
7 Tata 2018 15000 Chennai 21
Example: Selecting a Range of Columns and Rows Simultaneously
print(data.iloc[1: 5, 2: 5])
Output:
Kms Driven City Mileage
1 30000 Delhi 27
2 60000 Mumbai 25
3 25000 Delhi 26
4 10000 Mumbai 28

 at[]
Pandas at[] is used to return data in a dataframe at the passed location.
Syntax: Dataframe.at[position, label]
Parameters:
position: Position of element in column
label: Column name to be used
Return type: Single element at passed position
Example:
position = 2
label = 'Brand'
output = data.at[position, label]
print(output)
Output: Tata

 iat[]
Pandas iat[] method is used to return data in a dataframe at the passed location.
Syntax: Dataframe.iat[row, column]
Parameters:
position: Position of element in column
label: Position of element in row
Return type: Single element at passed position

Example:
column = 3
row = 2
output = data.iat[row, column]
print(output)
Output: Mumbai
ARITHMETIC OPERATIONS ON DATAFRAME
 Addition of 2 Series
import pandas as pd
series1 = pd.Series([1, 2, 3, 4, 5])
series2 = pd.Series([6, 7, 8, 9, 10])
series3 = series1 + series2
print(series3)
Output:
0 7
1 9
2 11
3 13
4 15

 Subtraction of 2 Series
import pandas as pd
series1 = pd.Series([1, 2, 3, 4, 5])
series2 = pd.Series([6, 7, 8, 9, 10])
series3 = series1 - series2
print(series3)
Output:
0 -5
1 -5
2 -5
3 -5
4 -5

 Multiplication of 2 Series
import pandas as pd
series1 = pd.Series([1, 2, 3, 4, 5])
series2 = pd.Series([6, 7, 8, 9, 10])
series3 = series1 * series2
print(series3)
Output:
0 6
1 14
2 24
3 36
4 50

 Division of 2 Series
import pandas as pd
series1 = pd.Series([1, 2, 3, 4, 5])
series2 = pd.Series([6, 7, 8, 9, 10])
series3 = series1 / series2
print(series3)
Output:
0 0.166667
1 0.285714
2 0.375000
3 0.444444
4 0.500000
LOGICAL OPERATIONS ON DATAFRAME
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 32, 30, 29],
'Score': [85, 92, 88, 75]
}
df = pd.DataFrame(data)

 AND Operation(&):
filtered_df = df[(df['Age'] > 30) & (df['Score'] > 80)]
print(filtered_df)
Output:
Name Age Score
1 Bob 32 92
 OR Operation (|):
filtered_df_or = df[(df['Age'] > 30) | (df['Score'] > 80)]
print(filtered_df_or)
Output:
Name Age Score
0 Alice 25 85
1 Bob 32 92
2 Charlie 30 88
 NOT Operation (~):
filtered_df_not = df[~(df['Score'] > 80)]
print(filtered_df_not)
Output:
Name Age Score
3 David 29 75

DATAFRAME FILTER() METHOD


import pandas as pd
data = {
"name": ["Sally", "Mary", "John"],
"age": [50, 40, 30],
"qualified": [True, False, False]
}
df = pd.DataFrame(data)
newdf = df.filter(items=["name", "age"])
print(newdf)
Output:
name age
0 Sally 50
1 Mary 40
2 John 30

GROUPBY OPERATIONS ON DATAFRAME


Pandas groupby is used for grouping the data according to the categories and applying a function to
the categories.
Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True,
group_keys=True, squeeze=False, **kwargs)
Parameters :
by : mapping, function, str, or iterable
axis : int, default 0
level : If the axis is a MultiIndex (hierarchical), group by a particular level or levels
as_index : For aggregated output, return object with group labels as the index. Only relevant for
DataFrame input. as_index=False is effectively “SQL-style” grouped output
sort : Sort group keys. Get better performance by turning this off. Note this does not influence the
order of observations within each group.
group_keys : When calling apply, add group keys to index to identify pieces
squeeze : Reduce the dimensionality of the return type if possible, otherwise return a consistent type
Returns : GroupBy object
Example:
import pandas as pd
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'],
'Max Speed': [380., 370., 24., 26.]})
print(df.groupby(['Animal']).mean())
Output:
Max Speed
Animal
Falcon 375.0
Parrot 25.0

Example:
import pandas as pd
l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
df = pd.DataFrame(l, columns=["a", "b", "c"])
print(df)
print("Groupby operation")
print(df.groupby(by=["b"]).sum())
Output:
a b c
0 1 2.0 3
1 1 NaN 4
2 2 1.0 3
3 1 2.0 2
Groupby operation
a c
b
1.0 2 3
2.0 2 5

PIVOT TABLES TO UNDERSTAND THE RELATIONSHIP


BETWEEN VARIABLES IN THE DATA WITH DIFFERENT
AGGREGATION
 A pivot table in Pandas is a quantitative table that summarizes a large DataFrame, such as a
large dataset. It is a component of data processing.
 In pivot tables, the report may include average, mode, summation, or other statistical
elements.
Syntax: pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc=’mean’,
fill_value=None, margins=False, dropna=True, margins_name=’All’)
Parameters:
o data : DataFrame
o values : column to aggregate, optional
o index: column, Grouper, array, or list of the previous
o columns: column, Grouper, array, or list of the previous
o aggfunc: function, list of functions, dict, default numpy.mean
o If list of functions passed, the resulting pivot table will have hierarchical columns
whose top level are the function names.
o If dict is passed, the key is column to aggregate and value is function or list of
functions
o fill_value[scalar, default None] : Value to replace missing values with
o margins[boolean, default False] : Add all row / columns (e.g. for subtotal / grand totals)
o dropna[boolean, default True] : Do not include columns whose entries are all NaN
o margins_name[string, default ‘All’] : Name of the row / column that will contain the totals
when margins is True.
Returns: DataFrame

Example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['John', 'Boby', 'Mina', 'Peter', 'Nicky'],
'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'],
'C': [27, 23, 21, 23, 24]})
print(df)
Output:
A B C
0 John Masters 27
1 Boby Graduate 23
2 Mina Graduate 21
3 Peter Masters 23
4 Nicky Graduate 24

# Simplest pivot table must have a dataframe and an index/list of index.


table = pd.pivot_table(df, index =['A', 'B'])
print(table)
Output:

# Creates a pivot table dataframe


table = pd.pivot_table(df, values ='A', index =['B', 'C'], columns =['B'], aggfunc = np.sum)
print(table)
Output:
CROSSTAB TO UNDERSTAND THE RELATIONSHIP
BETWEEN VARIABLES IN THE DATA
crosstab() function: This function allows you to compute a frequency table of two or more variables,
which summarizes the distribution of values in the data and provides insights into the relationships
between the variables. Cross tabulation (or crosstab) is an important tool for analyzing two categorical
variables in a dataset. It provides a tabular summary of the frequency distribution of two variables,
allowing us to see the relationship between them and identify any patterns or trends.
Syntax: pandas.crosstab(index, columns, values=None, rownames=None, colnames=None,
aggfunc=None, margins=False, margins_name=’All’, dropna=True, normalize=False)
Arguments :
o index : array-like, Series, or list of arrays/Series, Values to group by in the rows.
o columns : array-like, Series, or list of arrays/Series, Values to group by in the columns.
o values : array-like, optional, array of values to aggregate according to the factors. Requires
`aggfunc` be specified.
o rownames : sequence, default None, If passed, must match number of row arrays passed.
o colnames : sequence, default None, If passed, must match number of column arrays passed.
o aggfunc : function, optional, If specified, requires `values` be specified as well.
o margins : bool, default False, Add row/column margins (subtotals).
o margins_name : str, default ‘All’, Name of the row/column that will contain the totals when
margins is True.
o dropna : bool, default True, Do not include columns whose entries are all NaN.
Example:
import pandas as pd
df = pd.DataFrame({
'gender': ['male', 'male', 'female', 'female', 'male', 'female', 'male', 'female'],
'education_level': ['high school', 'college', 'college', 'graduate', 'high school', 'graduate', 'college',
'graduate'],
'score': [75, 82, 88, 95, 69, 92, 78, 85]
})
ct = pd.crosstab(df['gender'], df['education_level'])
print(ct)
Output:
education_level college graduate high school
gender
female 1 3 0
male 2 0 2

Example:
import pandas
import numpy
a = numpy.array(["foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar", "foo", "foo", "foo"],
dtype=object)
b = numpy.array(["one", "one", "one", "two", "one", "one", "one", "two", "two", "two", "one"],
dtype=object)
c = numpy.array(["dull", "dull", "shiny", "dull", "dull", "shiny", "shiny", "dull", "shiny",
"shiny", "shiny"], dtype=object)
pandas.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
Output:
HANDLING MISSING DATA
Missing Data can occur when no information is provided for one or more items or for a whole unit.
Missing Data can also refer to as NA(Not Available) values in pandas For Example, Suppose different
users being surveyed may choose not to share their income, some users may choose not to share the
address in this way many datasets went missing.
1. isnull()
2. notnull()
3. dropna()
4. fillna()
5. replace()
6. interpolate()

1. isnull():
import pandas as pd
import numpy as np
dict = {'First Score':[100, 90, np.nan, 95],'Second Score': [30, 45, 56, np.nan],'Third Score':[np.nan,
40, 80, 98]}
df = pd.DataFrame(dict)
df.isnull()
Output:
First Score Second Score Third Score
0 False False True
1 False False False
2 True False False
3 False True False

2. notnull():
df.notnull()
Output:
First Score Second Score Third Score
0 True True False
1 True True True
2 False True True
3 True False True

3. dropna():
df.dropna()
Output:
First ScoreSecond Score Third Score
1 90.0 45.0 40.0

4. fillna():
df.fillna(0)
Output:
First ScoreSecond Score Third Score
0100.0 30.0 0.0
190.0 45.0 40.0
20.0 56.0 80.0
395.0 0.0 98.0

Filling null values with the previous ones:


df.fillna(method ='pad')
Output:
First ScoreSecond Score Third Score
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 90.0 56.0 80.0
3 95.0 56.0 98.0
Filling null values with the next ones:
df.fillna(method ='bfill')
Output:
First ScoreSecond Score Third Score
0 100.0 30.0 40.0
1 90.0 45.0 40.0
2 95.0 56.0 80.0
3 95.0 NaN 98.0

5. replace():
df.replace(to_replace = np.nan, value = -99)
Output:
First ScoreSecond Score Third Score
0 100.0 30.0 -99.0
1 90.0 45.0 40.0
2 -99.0 56.0 80.0
3 95.0 -99.0 98.0

6. interpolate():
df.interpolate(method ='linear', limit_direction ='forward')
Output:
First ScoreSecond Score Third Score
0 100.0 30.0 NaN
1 90.0 45.0 40.0
2 92.5 56.0 80.0
3 95.0 56.0 98.0

TIME SERIES
Time series data is a sequence of data points in chronological order that is used by businesses to
analyze past data and make future predictions. Common examples of time series data in our day-
to-day lives include:
 Measuring weather temperatures
 Measuring the number of taxi rides per month
 Predicting a company’s stock prices for the next day

Components of Time Series


1. Trend Component: This is a variation that moves up or down in a reasonably predictable
pattern over a long period.
2. Seasonality Component: is the variation that is regular and periodic and repeats itself over a
specific period such as a day, week, month, season, etc.,
3. Cyclical Component: is the variation that corresponds with business or economic 'boom-
bust' cycles or follows their own peculiar cycles
4. Random Component: is the variation that is erratic or residual and does not fall under any of
the above three classifications.

 DATE FUNCTIONALITY
Extending the Time series, Date functionalities play major role in financial data analysis.
o Generating sequence of dates
o Convert the date series to different frequencies
o Create a Range of Dates
Using the date.range() function by specifying the periods and the frequency, we can create the date
series. By default, the frequency of range is Days.
import pandas as pd
print(pd.date_range('1/1/2011', periods=5))
Output:
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04', '2011-01-05'],
dtype='datetime64[ns]', freq='D')
o Change the Date Frequency
import pandas as pd
print pd.date_range('1/1/2011', periods=5,freq='M')
Output:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-30', '2011-05-31'],
dtype='datetime64[ns]', freq='M')
Example:
import pandas as pd
print(pd.bdate_range(start='1/1/2018', end='1/08/2018'))
Output:
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
'2018-01-05', '2018-01-08'],
dtype='datetime64[ns]', freq='B')

 Offset Aliases
Alias Description Alias Description
B business day frequency BQS business quarter start frequency
D calendar day frequency A annual(Year) end frequency
W weekly frequency BA business year end frequency
M month end frequency BAS business year start frequency
SM semi-month end frequency BH business hour frequency
BM business month end frequency H hourly frequency
MS month start frequency T, min minutely frequency
SMS SMS semi month start frequency S secondly frequency
BMS business month start frequency L, ms milliseconds
Q quarter end frequency U, us microseconds
BQ business quarter end frequency N nanoseconds
QS quarter start frequency

TIME DELTA
Timedelta is a subclass of datetime.timedelta.
Example:
import pandas as pd
td = pd.Timedelta('3 days 06:05:01.000000111')
print(td)
print(td.seconds)
Output:
3 days 06:05:01.000000
21901

 Manipulating and Accessing Components of Pandas Timedelta


import pandas as pd
td = pd.Timedelta('7 days 15 min 3 s')
print(td)
print(td.seconds)
Output:
7 days 00:15:03
903

 Creating and Accessing Pandas Timedelta with Specified Units


import pandas as pd
import datetime
td = pd.Timedelta(133, unit='s')
print(td)
print(td.seconds)
Output:
0 days 00:02:13
133

VECTORIZATION CONCEPT IMPLEMENTATION USING


PANDAS
Vectorization is used to speed up the Python code without using loop. Using such a function can help
in minimizing the running time of code efficiently.
Classic methods are more time consuming than using some standard function by calculating
their processing time.
 outer(a, b): Compute the outer product of two vectors.
 multiply(a, b): Matrix product of two arrays.
 dot(a, b): Dot product of two arrays.
 zeros((n, m)): Return a matrix of given shape and type, filled with zeros.
 process_time(): Return the value (in fractional seconds) of the sum of the system and user
CPU time of the current process. It does not include time elapsed during sleep.

Dot Product:
Dot product is an algebraic operation in which two equal length vectors are being multiplied such that
it produces a single number. Dot Product often called as inner product. This product results in a
scalar number.
Example:
import time
import numpy
import array
# 8 bytes size int
a = array.array('q')
for i in range(100000):
a.append(i);
b = array.array('q')
for i in range(100000, 200000):
b.append(i)
# classic dot product of vectors implementation
tic = time.process_time()
dot = 0.0;
for i in range(len(a)):
dot += a[i] * b[i]
toc = time.process_time()
print("dot_product = "+ str(dot));
print("Computation time = " + str(1000*(toc - tic )) + "ms")

n_tic = time.process_time()
n_dot_product = numpy.dot(a, b)
n_toc = time.process_time()
print("\nn_dot_product = "+str(n_dot_product))
print("Computation time = "+str(1000*(n_toc - n_tic ))+"ms")
Output:
dot_product = 833323333350000.0
Computation time = 61.22865300000058ms

n_dot_product = 833323333350000
Computation time = 2.8878020000000504ms

Outer Product:
The tensor product of two coordinate vectors is termed as Outer product. Let’s consider two vectors a
and b with dimension n x 1 and m x 1 then the outer product of the vector results in a rectangular
matrix of n x m. If two vectors have same dimension then the resultant matrix will be a square matrix
as shown in the figure.

Example:
import time
import numpy
import array
a = array.array('i')
for i in range(200):
a.append(i);

b = array.array('i')
for i in range(200, 400):
b.append(i)
tic = time.process_time()
outer_product = numpy.zeros((200, 200))
for i in range(len(a)):
for j in range(len(b)):
outer_product[i][j]= a[i]*b[j]
toc = time.process_time()
print("outer_product = "+ str(outer_product));
print("Computation time = "+str(1000*(toc - tic ))+"ms")
n_tic = time.process_time()
outer_product = numpy.outer(a, b)
n_toc = time.process_time()
print("outer_product = "+str(outer_product));
print("\nComputation time = "+str(1000*(n_toc - n_tic ))+"ms")
Output:

Element wise Product:


Element-wise multiplication of two matrices is the algebraic operation in which each element of first
matrix is multiplied by its corresponding element in the later matrix. Dimension of the matrices
should be same.

Example:
import time
import numpy
import array
a = array.array('i')
for i in range(50000):
a.append(i);
b = array.array('i')
for i in range(50000, 100000):
b.append(i)
vector = numpy.zeros((50000))
tic = time.process_time()
for i in range(len(a)):
vector[i]= a[i]*b[i]
toc = time.process_time()
print("Element wise Product = "+ str(vector));
print("\nComputation time = "+str(1000*(toc - tic ))+"ms")
n_tic = time.process_time()
vector = numpy.multiply(a, b)
n_toc = time.process_time()
print("Element wise Product = "+str(vector));
print("\nComputation time = "+str(1000*(n_toc - n_tic ))+"ms")
Output:
Element wise Product = [0.00000000e+00 5.00010000e+04 1.00004000e+05 ... 4.99955001e+09
4.99970000e+09 4.99985000e+09]
Computation time = 37.37993300000042ms
Element wise Product = [ 0 50001 100004 ... 704582713 704732708 704882705]
Computation time = 0.3640780000004895m

I/O TOOLS OF PANDAS


Pandas has a number of functions for reading table data as DataFrame objects, including
Function Description
pandas.read_csv loads CSV data from a file, URL or file-like object; usually a comma is used as separator
pandas.read_fwf loads fwf(fixed-width formatted) which is data in column format with a fixed width
pandas.read_clipboard reads data from the clipboard and passes it to read_csv; useful for converting tables from web pages, among other
things
pandas.read_excel reads table data from an Excel XLS or XLSX file
pandas.read_hdf reads HDF5 files
pandas.read_html reads all tables from the specified HTML document
pandas.read_json reads data from a JSON file
pandas.read_pickle reads any object stored in Python Pickle format
pandas.read_sql reads the results of an SQL query (with SQLAlchemy) as a pandas DataFrame
pandas.read_sql_table reads an entire SQL table (with SQLAlchemy) as a pandas DataFrame (corresponds to a query that selects
everything Rin this table with read_sql)

Example:
import pandas as pd
df=pd.read_csv("temp.csv")
print df
Output:
S.No Name Age City Salary
0 1 Tom 28 Toronto 20000
1 2 Lee 32 HongKong 3000
2 3 Steven 43 Bay Area 8300
3 4 Ram 38 Hyderabad 3900

custom index: This specifies a column in the csv file to customize the index using index_col.
import pandas as pd
df=pd.read_csv("temp.csv",index_col=['S.No'])
print df
Output:
S.No Name Age City Salary
1 Tom 28 Toronto 20000
2 Lee 32 HongKong 3000
3 Steven 43 Bay Area 8300
4 Ram 38 Hyderabad 3900

Converters: dtype of the columns can be passed as a dict.


import pandas as pd
df = pd.read_csv("temp.csv", dtype={'Salary': np.float64})
print df.dtypes
Output:
S.No int64
Name object
Age int64
City object
Salary float64

header_names: Specify the names of the header using the names argument.
import pandas as pd
df=pd.read_csv("temp.csv", names=['a', 'b', 'c','d','e'])
print df
Output:
a b c d e
0 S.No Name Age City Salary
1 1 Tom 28 Toronto 20000
2 2 Lee 32 HongKong 3000
3 3 Steven 43 Bay Area 8300
4 4 Ram 38 Hyderabad 3900

Skiprows: skiprows skips the number of rows specified.


import pandas as pd
df=pd.read_csv("temp.csv", skiprows=2)
print df
Output:
2 Lee 32 HongKong 3000
0 3 Steven 43 Bay Area 8300
1 4 Ram 38 Hyderabad 3900

INDEXING
Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame.
Indexing could mean selecting all the rows and some of the columns, some of the rows and all of the
columns, or some of each of the rows and columns. Indexing can also be known as Subset Selection.
Pandas support four types of Multi-axes indexing they are:
1. Dataframe.[ ] : This function also known as indexing operator
2. Dataframe.loc[ ] : This function is used for labels.
3. Dataframe.iloc[ ] : This function is used for positions or integer based
4. Dataframe.ix[] : This function is used for both label and integer based
Collectively, they are called the indexers.

1. Indexing a Dataframe using indexing operator []:


 Selecting a single columns
import pandas as pd
data = pd.read_csv("nba.csv", index_col ="Name")
first = data["Age"]
print(first)
Output:
 Selecting multiple columns
import pandas as pd
data = pd.read_csv("nba.csv", index_col ="Name")
first = data[["Age", "College", "Salary"]]
print(first)
Output:

2. Indexing a DataFrame using .loc[ ]


 Selecting a single row
import pandas as pd
data = pd.read_csv("nba.csv", index_col ="Name")
first = data.loc["Avery Bradley"]
second = data.loc["R.J. Hunter"]
print(first, "\n\n\n", second)
Output:

 Selecting multiple rows


import pandas as pd
data = pd.read_csv("nba.csv", index_col ="Name")
first = data.loc[["Avery Bradley", "R.J. Hunter"]]
print(first)
Output:

 Selecting two rows and three columns


import pandas as pd
data = pd.read_csv("nba.csv", index_col ="Name")
first = data.loc[["Avery Bradley", "R.J. Hunter"],
["Team", "Number", "Position"]]
print(first)
Output:

 Selecting all of the rows and some columns


import pandas as pd
data = pd.read_csv("nba.csv", index_col ="Name")
first = data.loc[:, ["Team", "Number", "Position"]]
print(first)
Output:

3. Indexing a DataFrame using .iloc[ ] :


 Selecting a single row
import pandas as pd
data = pd.read_csv("nba.csv", index_col ="Name")
row2 = data.iloc[3]
print(row2)
Output:

 Selecting multiple rows


import pandas as pd
data = pd.read_csv("nba.csv", index_col ="Name")
row2 = data.iloc [[3, 5, 7]]
print(row2)
Output:

 Selecting two rows and two columns


import pandas as pd
data = pd.read_csv("nba.csv", index_col ="Name")
row2 = data.iloc [[3, 4], [1, 2]]
print(row2)
Output:
 Selecting all the rows and a some columns
import pandas as pd
data = pd.read_csv("nba.csv", index_col ="Name")
row2 = data.iloc [:, [1, 2]]
print(row2)
Output:

4. Indexing a using Dataframe.ix[ ] :


 Selecting a single row using .ix[] as .loc[]
import pandas as pd
data = pd.read_csv("nba.csv", index_col ="Name")
first = data.ix["Avery Bradley"]
print(first)
Output:

MULTI INDEXING CONCEPTS


Multi-index allows you to select more than one row and column in your index.
 Example: Creating multi-index from arrays
import pandas as pd
arrays = ['Sohom','Suresh','kumkum','subrata']
age= [10, 11, 12, 13]
marks=[90,92,23,64]
multi_index = pd.MultiIndex.from_arrays([arrays,age,marks], names=('names', 'age','marks'))
print(multi_index)
Output:
MultiIndex([( 'Sohom', 10, 90),
( 'Suresh', 11, 92),
( 'kumkum', 12, 23),
('subrata', 13, 64)],
names=['names', 'age', 'marks'])

 Example: Creating multi-index from DataFrame using Pandas.


import pandas as pd
dict = {'name': ["Saikat", "Shrestha", "Sandi", "Abinash"],
'Jobs': ["Software Developer", "System Engineer","Footballer", "Singer"],
'Annual Salary(L.P.A)': [12.4, 5.6, 9.3, 10]}
df = pd.DataFrame(dict)
print(df)
Output:

pd.MultiIndex.from_frame(df)
Output:
MultiIndex([( 'Saikat', 'Software Developer', 12.4),
('Shrestha', 'System Engineer', 5.6),
( 'Sandi', 'Footballer', 9.3),
( 'Abinash', 'Singer', 10.0)],
names=['name', 'Jobs', 'Annual Salary(L.P.A)'])

 Example: Using DataFrame.set_index([col1,col2,..])


import pandas as pd
data = {
'series': ['Peaky blinders', 'Sherlock', 'The crown', 'Queens Gambit', 'Friends'],
'Ratings': [4.5, 5, 3.9, 4.2, 5],
'Date': [2013, 2010, 2016, 2020, 1994]
}
df = pd.DataFrame(data)
df.set_index(["series", "Ratings"], inplace=True, append=True, drop=False)
print(df)
Output:

print(df.index)
Output:
MultiIndex([(0, 'Peaky blinders', 4.5),
(1, 'Sherlock', 5.0),
(2, 'The crown', 3.9),
(3, 'Queens Gambit', 4.2),
(4, 'Friends', 5.0)],
names=[None, 'series', 'Ratings'])

DATA HANDLING

1. CATEGORICAL DATA
Categorical data is a set of predefined categories or groups an observation can fall into.
pandas.Categorical(val, categories = None, ordered = None, dtype = None) : It represents a
categorical variable. Categorical are a pandas data type that corresponds to the categorical variables in
statistics. Such variables take on a fixed and limited number of possible values. For examples –
grades, gender, blood group type etc.
Example:
import pandas as pd
data = ['red', 'blue', 'green', 'red', 'blue']
categorical_data = pd.Categorical(data)
print(categorical_data)
Output:
['red', 'blue', 'green', 'red', 'blue']
Categories (3, object): ['blue', 'green', 'red']

Convert Pandas Series to Categorical Series


Example:
import pandas as pd
data = ['red', 'blue', 'green', 'red', 'blue']
series1 = pd.Series(data)
categorical_s = series1.astype('category')
print(categorical_s)
Output:
0 red
1 blue
2 green
3 red
4 blue
dtype: category
Categories (3, object): ['blue', 'green', 'red']

2. INTEGER DATA
import pandas as pd
s = pd.Series([1, 2, None], dtype="Int64")
s_plus_one = s + 1 # Adds 1 to each element in the series
comparison = s == 1 # Checks if each element in the series is equal to 1
print(s_plus_one)
print(comparison)
Output:
0 2
1 3
2 <NA>
dtype: Int64
0 True
1 False
2 <NA>
dtype: Boolean

COMPUTATIONAL TOOLS
An item of computer software used as a means of performing an operation or achieving an end.

 STATISTICAL FUNCTIONS
 Percent Change
ser = Series(randn(8))
ser.pct_change()
Output:
0 NaN
1 -1.602976
2 4.334938
3 -0.247456
4 -2.067345
5 -1.142903
6 -1.688214
7 -9.759729
dtype: float64

 Covariance
s1 = Series(randn(1000))
s2 = Series(randn(1000))
s1.cov(s2)
Output: 0.00068010881743109993

 Correlation
Method name Description
pearson (default) Standard correlation coefficient
kendall Kendall Tau correlation coefficient
spearman Spearman rank correlation coefficient

frame = DataFrame(randn(1000, 5), columns=['a', 'b', 'c', 'd', 'e'])


frame.ix[::2] = np.nan
frame['a'].corr(frame['b'])
Output: 0.013479040400098801

frame['a'].corr(frame['b'], method='spearman')
Output: -0.0072898851595406388

 Data ranking
s = Series(np.random.randn(5), index=list('abcde'))
s['d'] = s['b'] # so there's a tie
s.rank()
Output:
a 5.0
b 2.5
c 1.0
d 2.5
e 4.0

 Calculate Statistics
import pandas as pd
dataset = pd.read_csv('train.csv')
dataset.head()
Output:
a. Mean:
mean = dataset['Age'].mean()
print(mean)
Output: 29.69911764705882

b. Median
median = dataset['Fare'].median()
print(median)
Output: 14.4542

c. Mode:
mode = dataset['Sex'].mode()
print(mode)
Output: 0 male

d. Count:
count = dataset['Ticket'].count()
print(count)
Output: 891

e. Standard Deviation
std = dataset['Fare'].std()
print(std)
Output: 49.693428597180905

f. Max:
maxValue = dataset['Age'].max()
print(maxValue)
Output: 80.0

g. Min:
minValue = dataset['Fare'].min()
print(minValue)
Output: 0.0000

h. Describe:
dataset.describe()

Output:
 WINDOWING OPERATIONS
Pandas contain a compact set of APIs for performing windowing operations - an operation that
performs an aggregation over a sliding partition of values.
s = pd.Series(range(5))
s.rolling(window=2).sum()
Output:
0 NaN
1 1.0
2 3.0
3 5.0
4 7.0
for window in s.rolling(window=2):
print(window)

Output:
0 0
dtype: int64
0 0
1 1
dtype: int64
1 1
2 2
dtype: int64
2 2
3 3
dtype: int64
3 3
4 4
dtype: int64

Pandas supports 4 types of windowing operations:


1. Rolling window: Generic fixed or variable sliding window over the values.
2. Weighted window: Weighted, non-rectangular window supplied by the scipy.signal library.
3. Expanding window: Accumulating window over the values.
4. Exponentially Weighted window: Accumulating and exponentially weighted window over
the values.

1. Rolling window
s= pd.Series(range(5), index=pd.date_range('2020-01-01', periods=5, freq='1D'))
s.rolling(window='2D').sum()
Output:
2020-01-01 0.0
2020-01-02 1.0
2020-01-03 3.0
2020-01-04 5.0
2020-01-05 7.0
Freq: D, dtype: float64

2. Expanding window
df = pd.DataFrame({'A': ['a', 'b', 'a', 'b', 'a'], 'B': range(5)})
df.groupby('A').expanding().sum()
Output:
B
A
a 0 0.0
2 2.0
4 6.0
b 1 1.0
3 4.0

3. Weighted window
def weighted_mean(x):
arr = np.ones((1, x.shape[1]))
arr[:, :2] = (x[:, :2] * x[:, 2]).sum(axis=0) / x[:, 2].sum()
return arr
df = pd.DataFrame([[1, 2, 0.6], [2, 3, 0.4], [3, 4, 0.2], [4, 5, 0.7]])
df.rolling(2, method="table", min_periods=0).apply(weighted_mean, raw=True, engine="numba") #
noqa: E501
Output:
0 1 2
0 1.000000 2.000000 1.0
1 1.800000 2.000000 1.0
2 3.333333 2.333333 1.0
3 1.555556 7.000000 1.0

4. Exponentially Weighted window:


df = pd.DataFrame([[1, 2, 0.6], [2, 3, 0.4], [3, 4, 0.2], [4, 5, 0.7]])
df.ewm(0.5).mean()
Output:
0 1 2
0 1.000000 2.000000 0.600000
1 1.750000 2.750000 0.450000
2 2.615385 3.615385 0.276923
3 3.550000 4.550000 0.562500

CHART AND TABLE VISUALIZATION IN PANDAS


1. Basic Line Plot
2. Bar Plot
3. Area Plot
4. Density Plot- Kernel Density Estimation (KDE)
5. Histogram Plot
6. Scatter Plot
7. Box Plots
8. Hexagonal Bin Plots

Examples are in Unit-5

You might also like