0% found this document useful (0 votes)
32 views

DS Module 1

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

DS Module 1

Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 112

21AIM44 - DATA SCIENCE

MODULE 1
Basic Concepts: Predictive Modelling , Data preparation,
Importance of Data preparation , Data Cleaning , Feature selection ,
Data Transform , Feature Engineering, Dimensionality reduction, K-fold
cross validation , Data Leakage and avoidance measure. Python
Packages: Numpy , Matplotlib , Pandas , Scipy , Scikit , Data frame,
Loading Machine Learning data.
What is Data Science?
Data Science
 Data Science is a field that gives insights from structured and unstructured data, using
different scientific methods and algorithms, and consequently helps in generating
insights, making predictions, and devising data driver solutions.

 It is the process of extracting the data from structured, unstructured data using the
Data mining technique (for getting the information from raw data).

 It is a lot of science involving getting the necessary information out from tons and millions
of tons of products, services, and their data in it for making better products,
developments, and many more.

 It uses a large amount of data to get meaningful insights using statistics and computation
for decision making.
Data Science Examples
It helps in getting the ideas of what customers would love to purchase or eat
according to their previous order history.

Data Science also helps in making future predictions.


 For example, the airlines can predict the prices for their flights according to the customers’
previous booking history.

Data Science also helps in getting recommendations.


 As an example, Netflix can give recommendations based on the previous browsing history
of videos and ratings given by users to the videos.
Life Cycle of Data Science
Life Cycle of Data Science
 Business Understanding - A well-defined problem statement defines a specific goal and is the key to the
success of the project.

 Data Collection - Data needs to be relevant that can solve the business problem correctly

 Data Preparation - It helps in cleaning and bringing the data into shape, which is required for further
analysis and modeling

 Exploratory Data Analysis – Data is analyzed using summary statistics and graphically to understand
key patterns.
 Model Building – There are two types of data modeling, i.e.,
 Descriptive analytics, which involves insights based on historical data, and
 Predictive modeling, which involves future predictions.

 Model Deployment and Maintenance - Once the model is built, ready to deploy in the real world. The
deployment can occur offline, on the web, on the cloud, any android or iOS app.
Who Exactly is a Data Scientist?
 Statistics – To find the hidden pattern in data and correlation between different features
in data.

 Machine Learning - Different algorithms for building a model

 Computer Science - Software engineering, database system, Artificial Intelligence,


and numerical analysis.

 Programming – Python or R, and SQL.

 Analytical Thinking - Think analytically to solve business problems.

 Critical Thinking – Have the critical thinking ability to analyze the facts before
concluding.

 Interpersonal Skills - Must have excellent communication skills

 Business Intuition - Must be able to communicate with clients to understand the


problems.
Tools a Data Scientist Uses
 Python: Python is a versatile programming language that is used most by Data Scientists. Its most important application is
used in the field of Machine Learning. It has many libraries that make it perfect for handling Data Science related work.

 R Programming: R is one of the essential statistical programming tools, which is mainly used by Data Scientists to
perform a detailed analysis of large data to find insights.

 SQL: It is also a valuable tool used by a Data Scientist. It helps them in working on DBMS and structured data. A Data
Engineer also uses this tool.

 Tableau / PowerBI: This is a top-rated data visualization tool among Data Scientists because of its amazing reporting
capabilities. This tool makes it simple to visualize the data and show the results to clients.

 Hadoop: It is an open-source and powerful tool that is used by every Data Scientist.

 SAS[Statistical Analysis System]: SAS is an advanced tool for analysis, which many data analysts use. It has many
powerful features, such as analyzing, extracting, and reporting, which makes it a popular tool. Also, it has a great GUI that
anyone can use it easily, and Data Scientists use it to convert the data into business insights.
Computer Programing Languages used for Data Science
AI vs ML vs DL
Artificial Intelligence
Artificial intelligence refers to the simulation of human intelligence in machines.

Artificial intelligence (AI), is the ability of a digital computer or computer-


controlled robot to perform tasks commonly associated with intelligent beings.

The goals of artificial intelligence include learning, reasoning, and perception.

Artificial intelligence (AI), also known as machine intelligence, is a branch of


computer science that focuses on building and managing technology that can
learn to autonomously make decisions and carry out actions on behalf of a
human being.
Artificial Intelligence
Artificial Intelligence
Examples of Artificial Intelligence in use today
Siri
 Siri is one of the most popular personal assistants offered by Apple on the iPhone and
iPad.
 The friendly female voice-activated assistant interacts with the user on a daily routine.

 She assists us to find information, get directions, send messages, making voice calls,
opening applications, and adding events to the calendar.

Tesla
 Not only smartphones but automobiles are also shifting towards Artificial Intelligence.

 The car has not only been able to achieve many accolades but also features like self-
driving, predictive capabilities, and absolute technological innovation.
Machine Learning
 Machine Learning is the method of dealing with data and automating the tasks by training it so
that it gives new suggestions and detects when a similar type of data is provided to it.

 It comes under the AI umbrella and it identifies the data and makes the decisions faster and
saves time and human effort so that there is less need for human intervention it.

 The steps involved in building a machine learning model include the following:
 Gather training data.
 Prepare data for training.
 Decide which learning algorithm to use.
 Train the learning algorithm.
 Evaluate the learning algorithm’s outputs.
 If necessary, adjust the variables (hyperparameters) that govern the training process in order to improve output.
Machine Learning
Machine Learning
Deep Learning
 Deep learning is a type of machine learning and artificial intelligence (AI) that imitates
the way humans gain certain types of knowledge.

 Deep learning is an important element of data science, which includes statistics and
predictive modeling.

 It is extremely beneficial to data scientists who are tasked with collecting, analyzing, and
interpreting large amounts of data; deep learning makes this process faster and easier.

 Deep Learning is a subset of Machine Learning.

 In Machine Learning features are provided manually.

 Whereas Deep Learning learns features directly from the data.


Machine Learning vs Deep Learning
Convolutional Neural Network
Recurrent Neural Network
Deep Learning

Corgi or loaf of bread image recognition example. Source: https://imgur.com/gallery/9IqWNIw


Deep Learning
AI vs ML vs DL vs DS

Data Science —
Scientific methods, algorithms and systems to extract knowledge or insights from
big data
•Also known as Predictive or Advanced Analytics
•Algorithmic and computational techniques and tools for handing large data sets
•Increasingly focused on preparing and modeling data for ML & DL tasks
•Encompasses statistical methods, data manipulation and streaming technologies
(e.g. Spark, Hadoop)
•Key skill and tools behind building modern AI technologies
AI vs ML vs DL
Predictive Modelling
• Predictive modeling is a commonly used statistical technique to predict
future behavior.
• Predictive modeling solutions are a form of data-mining technology that
works by analyzing historical and current data and generating a model
to help predict future outcomes.
• Data is collected, a statistical model is formulated, predictions are
made, and the model is validated (or revised) as additional data becomes
available.
• Predictive models analyze past performance to assess how likely a
customer is to exhibit a specific behavior in the future
Predictive Modelling
Predictive modeling is the process of using known results to create, process, and
validate a model that can be used to forecast future outcomes.

It is a tool used in predictive analytics, a data mining technique that attempts to
answer the question “what might possibly happen in the future?”

Predictive Modeling is the use of data and statistics to predict the outcome of the data
models.

This prediction finds its utility in almost all areas from sports, to TV ratings, corporate
earnings, and technological advances.

Predictive modeling is also called predictive analytics.


Goal of Predictive Modelling
The goal of predictive modeling is to develop a model that makes accurate
predictions on new data, unseen during training which is a hard problem.

Because we cannot evaluate the model on something we don’t have.

Therefore, we must estimate the performance of the model on unseen data


by training it on only some of the data we have and evaluating it on the rest
of the data.

This is the principle that underlies cross-validation and more sophisticated


techniques that try to reduce the variance in this estimate.
Examples
• Health care industry to improve diagnostic practices and properly treat
terminal or chronically ill patients.
• A restaurant estimating the amount of supplies to order may assign
factors such as nearby events and upcoming holidays to this model.
• Human resources departments and companies may use them to hire
employees.
• Banks to detect fraud
Data Preparation
 Data preparation is the process of transforming raw data into a form that is more appropriate for
modeling.

 Data preparation is the process of cleaning and transforming raw data prior to processing and
analysis.

 Key steps include collecting, cleaning, and labeling raw data into a form suitable for algorithms
and for visualisation

 The process of applied machine learning consists of a sequence of steps.


 Step 1: Define Problem.

 Step 2: Prepare Data.

 Step 3: Evaluate Models.

 Step 4: Finalize Model.


1. Define Problem

Learning enough about the project to select the framing or framings of


the prediction task. For example, is it classification or regression.

It involves collecting the data that is believed to be useful in making a


prediction and clearly defining the form that the prediction will take.

Taking a close look at the data, and exploring the data using summary
statistics and data visualisation
Step 2: Prepare Data

This step is concerned with transforming the raw data that was
collected into a form that can be used in modeling.

Data pre-processing techniques generally refer to the addition,


deletion, or transformation of training set data.
Step 3: Evaluate Models
This step is concerned with evaluating machine learning models on
your dataset.

This involves tasks such as selecting a performance metric for


evaluating the skill of a model, establishing a baseline or floor in
performance to which all model evaluations can be compared, and a
resampling technique for splitting the data into training and test sets to
simulate how the final model will be used.
Step 4: Finalize Model

Once a suite of models has been evaluated, you must choose a model
that represents the solution to the project.

This is called model selection and may involve further evaluation of


candidate models on a hold-out validation dataset, or selection via
other project-specific criteria such as model complexity.
What Is Data Preparation

On a predictive modeling project, such as classification or regression,


raw data typically cannot be used directly.

This is because of reasons such as:


Machine learning algorithms require data to be numbered.

Some machine learning algorithms impose requirements on the data.

Statistical noise and errors in the data may need to be corrected.

Complex nonlinear relationships may be teased out of the data.


What Is Data Preparation

As such, the raw data must be pre-processed prior to being used to fit
and evaluate a machine learning model.

This step in a predictive modeling project is referred to as data


preparation, although it goes by many other names, such as data
wrangling, data cleaning, data pre-processing, and feature engineering.

We can define data preparation as the transformation of raw data into
a form that is more suitable for modeling
Data Preparation Purpose
 Data preparation is to ensure that raw data being readied for processing and analysis is
accurate and consistent so the results of analytics applications will be valid.

 Data is commonly created with missing values, inaccuracies, or other errors, and separate
data sets often have different formats that need to be reconciled when they're combined.

 Correcting data errors, validating data quality, and consolidating data sets are big parts of
data preparation projects.

 Data preparation also involves finding relevant data to ensure that analytics applications
deliver meaningful information and actionable insights for business decision-making.
Challenges of data preparation
Challenges of data preparation
 Inadequate or nonexistent data profiling
 If data isn't properly profiled, errors, anomalies, and other problems might not be identified, which can result in
flawed analytics.

 Missing or incomplete data


 Data sets often have missing values and other forms of incomplete data; such issues need to be assessed as
possible errors and addressed if so. Ex: Missing Year or Month in DoB

 Invalid data values


 Misspellings, other typos, and wrong numbers are examples of invalid entries that frequently occur in data and
must be fixed to ensure analytics accuracy.

 Name and address standardization


 Names and addresses may be inconsistent in data from different systems, with variations that can affect views of
customers and other entities.
Challenges of data preparation
Inconsistent data across enterprise systems
 Other inconsistencies in data sets drawn from multiple source systems, such as different
terminology and unique identifiers, are also a pervasive issue in data preparation efforts.

Data enrichment
 Deciding how to enrich a data set -- for example, what to add to it -- is a complex task
that requires a strong understanding of business needs and analytics goals.

Maintaining and expanding data prep processes


 Data preparation work often becomes a recurring process that needs to be sustained and
enhanced on an ongoing basis.
Data Preparation Tasks
There are common or standard tasks that you may use or explore during the
data preparation step in a machine learning project.

These tasks include:


 Data Cleaning: Identifying and correcting mistakes or errors in the data.

 Feature Selection: Identifying those input variables that are most relevant to the
task.
 Data Transforms: Changing the scale or distribution of variables.

 Feature Engineering: Deriving new variables from available data.

 Dimensionality Reduction: Creating compact projections of the data.


Steps in the data preparation process
Steps in the data preparation process
 Data collection:
 Relevant data is gathered from operational systems, data warehouses, data lakes, and other data sources.
 During this step, data scientists, who collect data should confirm that it's a good fit for the objectives of the planned
analytics applications.

 Data discovery and profiling:


 Explore the collected data to better understand what it contains and what needs to be done to prepare it for the
intended uses.
 To help with that, data profiling identifies patterns, relationships, and other attributes in the data, as well as
inconsistencies, anomalies, missing values, and other issues so they can be addressed.

 Data cleansing:
 The identified data errors and issues are corrected to create complete and accurate data sets.
 For example, as part of cleansing data sets, faulty data is removed or fixed, missing values are filled in and
inconsistent entries are harmonized.
Steps in the data preparation process
 Data structuring:
 At this point, the data needs to be modeled and organized to meet the analytics requirements.

 For example, data stored in comma-separated values (CSV) files or other file formats has to be converted into tables to make
it accessible to BI and analytics tools.

 Data transformation and enrichment:


 In addition to being structured, the data type must be transformed into a unified and usable format.

 For example, data transformation may involve creating new fields or columns that aggregate values from existing ones.

 Data enrichment further enhances and optimizes data sets as needed, through measures such as augmenting and adding data.

 Data validation and publishing:


 Automated routines are run against the data to validate its consistency, completeness and accuracy.

 The prepared data is then stored in a data warehouse, a data lake or another repository and either used directly by whoever
prepared it or made available for other users to access.
How to Choose Data Preparation Techniques?
• The steps in a predictive modeling project before and after the data
preparation step inform the data preparation that may be required.
Defining the problem
• Gather data from the problem domain.
• Discuss the project with subject matter experts.
• Select those variables to be used as inputs and outputs for a
predictive model.
• Review the data that has been collected.
• Summarize the collected data using statistical methods.
• Visualize the collected data using plots and charts
• Information known about the choice of algorithms and the discovery
of well-performing algorithms can also inform the selection and
configuration of data preparation methods.
• For example, the choice of algorithms may impose requirements and
expectations on the type and form of input variables in the data.
• This might require variables to have a specific probability distribution,
the removal of correlated input variables, and/or the removal of
variables that are not strongly related to the target variable.
Why Data Preparation is So Important

On a predictive modeling project, machine learning algorithms learn a


mapping from input variables to a target variable.

The most common form of predictive modeling project involves so-


called structured data or tabular data.

This is data as it looks in a spreadsheet or a matrix, with rows of


examples and columns of features for each example.
Why Data Preparation is So Important

We cannot fit and evaluate machine learning algorithms on raw data

Instead, we must transform the data to meet the requirements of


individual machine learning algorithms.

Structured data in machine learning consists of rows and columns.

Data preparation is a required step in each machine learning project.

The routineness of machine learning algorithms means the majority of

effort on each project is spent on data preparation.


What Is Data in Machine Learning

What we call data are observations of real-world phenomena.

Each piece of data provides a small window into a limited aspect of


reality.

The most common type of input data is typically referred to as tabular


data or structured data.

This is data as you might see it in a spreadsheet, in a database, or in a


comma-separated variable (CSV) file.
What Is Data in Machine Learning

The table is composed of rows and columns.

A row represents one example from the problem domain and may be
referred to as an example, an instance, or a case.

 A column represents the properties observed about the example and


may be referred to as a variable, a feature, or an attribute.
What Is Data in Machine Learning

Row. A single example from the domain is often called an instance,


example, or sample in machine learning.

Column. A single property is recorded for each example, often called


a variable, predictor, or feature in machine learning.

Input Variables: Columns in the dataset provided to a model in order


to make a prediction.

Output Variable: Column in the dataset to be predicted by a model.


Raw Data Must Be Prepared
Raw data: Data in the form provided from the domain.
In almost all cases, raw data will need to be changed before you can use
it as the basis for modeling with machine learning.
• A feature is a numeric representation of an aspect of raw data.
• Features sit between data and models in the machine learning
pipeline.
• Feature engineering is the act of extracting features from raw data and
transforming them into formats that are suitable for the machine
learning model.
Reasons why you must prepare raw data in a machine
learning project

• Machine Learning Algorithms Expect Numbers


• Machine Learning Algorithms Have Requirements
• Model Performance Depends on Data
• Complex Data: Raw data contains compressed complex nonlinear
relationships that may need to be exposed
• Messy Data: Raw data contains statistical noise, errors, missing values, and
conflicting examples.
Predictive Modeling Is Mostly Data Preparation?

• Data quality is one of the most important problems in data


management, since dirty data often leads to inaccurate data analytics
results and incorrect business decisions
An effective machine learning practitioner, you must know:
• The different types of data preparation to consider on a project.
• The top few algorithms for each class of data preparation technique.
• When to use and how to configure top data preparation techniques.
Tour of Data Preparation Techniques

 Data Cleaning

Feature Selection

Data Transforms

Feature Engineering

Dimensionality Reduction
Data Cleaning
 Data cleaning is an operation that is typically performed first, prior to other data
preparation operations.

 Fixing systematic problems or errors in messy data.

 Could involve identifying and addressing specification observations that may be incorrect.

 There are many reasons data may have incorrect values, such as being mistyped, corrupted,
duplicated, and so on.

 Once messy, noisy, corrupt, or erroneous observations are identified, they can be addressed.

 This might involve removing a row or a column.


Data Cleaning

Alternately, it might involve replacing observations with new values.

 As such, there are general data cleaning operations that can be


performed, such as:
 Using statistics to define normal data and identify outliers.

 Identifying columns that have the same value or no variance and removing them.

 Identifying duplicate rows of data and removing them.

 Marking empty values as missing.

 Imputing missing values using statistics or a learned model.


Overview of Data Cleaning Techniques.
Feature Selection
Feature selection refers to techniques for selecting a subset of input features
that are most relevant to the target variable that is being predicted.

This is important as irrelevant and redundant input variables can distract or


mislead learning algorithms possibly resulting in lower predictive
performance.

Feature selection techniques may generally be grouped into those that use the
target variable (supervised) and those that do not (unsupervised).
Feature Selection

Additionally, the supervised techniques can be further divided into


models that automatically select features as part of fitting the model
(intrinsic),

Those that explicitly choose features that result in the best performing
model (wrapper) and

Those that score each input feature and allow a subset to be selected
(filter).
Feature Selection

Statistical methods, such as correlation, are popular for scoring input


features.

The features can then be ranked by their scores and a subset with the
largest scores is used as input to a model.

The choice of statistical measure depends on the data types of the


input variables and a review of different statistical measures.
Overview of Feature Selection
Data Transforms

Data transforms are used to change the type or distribution of data


variables.

This is a large umbrella of different techniques and they may be just as


easily applied to input and output variables.

Recall that data may have one of a few types, such as numeric or
categorical, with subtypes for each,
 Such as Integer And Real-valued floating-point values for numeric, and

 Nominal, Ordinal, and Boolean for categorical.


Data Transforms

Numeric Data Type: Number values.


Integer: Integers with no fractional part.

Float: Floating-point values.

Categorical Data Type: Label values.


Ordinal: Labels with a rank ordering.

Nominal: Labels with no rank ordering.

Boolean: Values True and False.


Overview of Data Variable Types
Nominal or ordinal
• Country?
• Rank?
• Hair color?
• Income level?
• Satisfaction level?
• Nationality?
Data Transforms Methods
Discretization Transform: Encode a numeric variable as an ordinal variable

Ordinal Transform: Encode a categorical variable into an integer variable

One Hot Transform: Encode a categorical variable into binary variables

Normalization Transform: Scale a variable to the range 0 and 1

Standardization Transform: Scale a variable to a standard Gaussian

Power Transform: Change the distribution of a variable to be more Gaussian

Quantile Transform: Impose a probability distribution such as uniform or


Gaussian
Overview of Data Transforms
Feature Engineering

Feature engineering refers to the process of creating new input variables


from the available data.

Engineering new features is highly specific to your data and data types.

There are some techniques that can be reused, such as:


 Adding a Boolean flag variable for some state.

 Adding a group or global summary statistic, such as a mean.

 Adding new variables for each component of a compound variable, such as a


date-time.
Dimensionality Reduction
 The number of input features for a dataset may be considered the dimensionality of the data.

 For example, two input variables together can define a two-dimensional area where each
row of data defines a point in that space.

 This idea can then be scaled to any number of input variables to create large multi-
dimensional hyper-volumes.

 The problem is, the more dimensions this space has (e.g. the more input variables), the
more likely it is that the dataset represents a very sparse and likely unrepresentative
sampling of that space.

 This is referred to as the curse of dimensionality.


Dimensionality Reduction

The most common approach to dimensionality reduction is to use a


matrix factorization technique:
Principal Component Analysis

Singular Value Decomposition

Linear Discriminant Analysis


Overview of Dimensionality Reduction Techniques

manifold learning algorithms : Kohonen self-organizing maps (SOME) and t-Distributed Stochastic Neighbor
Embedding (t-SNE).
Cross-validation

Cross-validation is a technique for evaluating a machine learning


model and testing its performance on unseen data.

It helps to compare and select an appropriate model for the specific
predictive modeling problem.

CV is easy to understand, and easy to implement.

All of this makes cross-validation a powerful tool for selecting the


best model for a specific task.
Cross-validation
• If you have a machine learning model and some data, you want to tell if your
model can fit.
• You can split your data into training and test set. Train your model with the
training set and evaluate the result with test set.
• But you evaluated the model only once and you are not sure your good result is
by luck or not. You want to evaluate the model multiple times so you can be
more confident about the model design.
• The procedure has a single parameter called k that refers to the number of
groups that a given data sample is to be split into. As such, the procedure is
often called k-fold cross-validation. When a specific value for k is chosen, it
may be used in place of k in the reference to the model, such as k=10 becoming
10-fold cross-validation.
Cross-validation
• Cross-validation is primarily used in applied machine learning to
estimate the skill of a machine learning model on unseen data.
• That is, to use a limited sample in order to estimate how the model is
expected to perform in general when used to make predictions on data
not used during the training of the model.
• It is a popular method because it is simple to understand and because it
generally results in a less biased or less optimistic estimate of the
model skill than other methods, such as a simple train/test split.
Cross-validation Algorithm
Divide the dataset into two parts: Hold-out
one for training, the other for K-folds
testing
Leave-one-out
Train the model on the training set
Leave-p-out
Validate the model on the test set Stratified K-folds
Repeat 1-3 steps a couple of Repeated K-folds
times. This number depends on Nested K-folds
the CV method that you are using
Time series CV
Inner Working Process of Cross-Validation
Shuffle the dataset in order to remove any kind of order

Split the data into K number of folds. K= 5 or 10 will work for most of the cases.

Now keep one fold for testing and remaining all the folds for training.

Train(fit) the model on the train set and test(evaluate) it on the test set and note down
the results for that split

Now repeat this process for all the folds, every time choosing a separate fold as test data

So for every iteration our model gets trained and tested on different sets of data

At the end sum up the scores from each split and get the mean score
Inner Working of Cross Validation
K-fold cross validation
In K Fold cross-validation input data is divided into 'K' number of folds,
hence the name K Fold.

Suppose we have divided data into 5 folds i.e. K=5.

Now we have 5 sets of data to train and test our model.

So the model will get trained and tested 5 times, but for every iteration we
will use one fold as test data and rest all as training data.

Note that for every iteration, data in training and test fold changes which
adds to the effectiveness of this method.
K-fold cross-validation
Worked Example
Imagine we have a data sample with 6 observations:
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
• The first step is to pick a value for k in order to determine the number
of folds used to split the data. Here, we will use a value of k=3. That
means we will shuffle the data and then split the data into 3 groups.
Because we have 6 observations, each group will have an equal number
of 2 observations.
For example:
Fold1: [0.5, 0.2]
Fold2: [0.1, 0.3]
Fold3: [0.4, 0.6]
• We can then make use of the sample, such as to evaluate the skill of a
machine learning algorithm.
• Three models are trained and evaluated with each fold given a chance
to be the held out test set.
Model1: Trained on Fold1 + Fold2, Tested on Fold3
Model2: Trained on Fold2 + Fold3, Tested on Fold1
Model3: Trained on Fold1 + Fold3, Tested on Fold2
• The models are then discarded after they are evaluated as they have
served their purpose.
• The skill scores are collected for each model and summarized for use.
Stratified K Fold Cross Validation
Stratified K Fold used when just random shuffling and splitting the data
is not sufficient.

In case of regression problem folds are selected so that the mean response
value is approximately equal in all the folds.

 In the case of classification problems folds are selected to have the same
proportion of class labels.

Stratified K Fold is more useful in the case of classification problems, where


it is very important to have the same percentage of labels in every fold.
Stratified K Fold Cross Validation
Advantages
We end up using all the data for training and testing and this is very useful
in case of small datasets

It covers the variation of input data by validating the performance of the
model on multiple folds

Multiple folds also helps in case of unbalanced data

Model performance analysis for every fold gives us more insights to fine
tune the model

Used for hyperparameter tuning


Data Leakage
Data Leakage occurs when the data used in the training process contains
information about what the model is trying to predict.

Data leakage is when information from outside the training dataset is used to
create the model.

 This additional information can allow the model to learn or know something
that it otherwise would not know and in turn invalidate the estimated
performance of the model being constructed.

 Cross-validation is used to minimize data leakage when developing predictive


models.
Data Preparation Without Data Leakage
• Data preparation is the process of transforming raw data into a form that
is appropriate for modeling.
• A naive approach to preparing data applies the transform on the entire
dataset before evaluating the performance of the model. This results in a
problem referred to as data leakage, where knowledge of the hold-out
test set leaks into the dataset used to train the model.
• This can result in an incorrect estimate of model performance when
making predictions on new data.
• A careful application of data preparation techniques is required in order
to avoid data leakage, and this varies depending on the model evaluation
scheme used, such as train-test splits or k-fold cross-validation.
Problem With Naive Data Preparation
 The manner in which data preparation techniques are applied to data matters.

 A common approach is to first apply one or more transforms to the entire dataset.

 Then the dataset is split into train and test sets or k-fold cross-validation is used to fit and evaluate a
machine learning model.
 1. Prepare Dataset
 2. Split Data
 3. Evaluate Models

 Although this is a common approach, it is dangerously incorrect in most cases.

 The problem with applying data preparation techniques before splitting data for model evaluation is
it can lead to data leakage and, in turn, will likely result in an incorrect estimate of a model’s
performance on the problem.
Solution for Naive Data Preparation
 The solution is straightforward.

 Data preparation must be fit on the training dataset only


 1. Split Data.
 2. Fit Data Preparation on Training Dataset.
 3. Apply Data Preparation to Train and Test Datasets.
 4. Evaluate Models.

 The entire modeling pipeline must be prepared only on the training dataset to avoid data leakage.

 This might include data transforms, but also other techniques such as feature selection,
dimensionality reduction, feature engineering, and more.

 This means so-called model evaluation should really be called modeling pipeline evaluation.
NumPy
• NumPy, which stands for Numerical Python, is a library consisting of
multidimensional array objects and a collection of routines for processing those
arrays.
• Using NumPy, mathematical and logical operations on arrays can be performed
Important Numpy Functions in Python
np.array(): This function is used to create an array in NumPy.
Example code
import numpy as np
arr = np.array([1, 2, 3])
print(arr)

Output:
[1 2 3]
)

np.arange(): This function is used to create an array with a range of


values.
import numpy as np
arr = np.arange(1, 6)
print(arr)
Output:
[1 2 3 4 5]
Explanation: In this example, we created a numpy array with a range of
values which is (1,6), where 6 is excluded.
np.zeros(): This function is used to create an array filled with zeros.
Example code:
import numpy as np
arr = np.zeros(3)
print(arr)
Output:
[0. 0. 0.]
np.ones(): This function is used to create an array filled with ones.

Example Code:
import numpy as np
arr = np.ones(3)
print(arr)

Output:
[1. 1. 1.]
np.random.rand(): This function is used to create an array with
random values between 0 and 1.
Example code:
import numpy as np
arr = np.random.rand(3)
print(arr)

Output:
[0.5488135 0.71518937 0.60276338]
• np.random.randint(): This function is used to create an array with
random integer values between a specified range.
Example Code:
import numpy as np
arr = np.random.randint(0, 10, 5)
print(arr)
Output:
[1 5 8 9 8]
Explanation: In this example, we created an array of size 5 which is
filled with random values which lie between 0 and 10
np.max(): This function is used to find the maximum value in an array.
Example Code:
import numpy as np
arr = np.array([1, 2, 3])
max_value = np.max(arr)
print(max_value)

Output:
3
np.min(): This function is used to find the minimum value in an array.

Example Code:
import numpy as np
arr = np.array([1, 2, 3])
min_value = np.min(arr)
print(min_value)

Output:
1
Pandas
• The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis"
and was created by Wes McKinney in 2008.

• Pandas is a Python library used for working with data sets. It has functions for
analyzing, cleaning, exploring, and manipulating data.

• For example, let’s first import the data into pandas DataFrame df . A Pandas
DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with
rows and columns.

import pandas as pd

df = pd.read_csv("Dummy_Sales_Data_v1.csv")
df.head()
This function helps you to get the first few rows of the dataset.

df.tail()
This function helps you to get the last few rows of the dataset.

df.info()
This function returns a quick summary of the DataFrame.

df.describe()
This function returns descriptive statistics about the data.
df.unique()
• This function returns the list of unique values in a column or series.

df.isnull()
• This function helps you to check in which row and which column your
data has missing values.

df.fillna()
• This function is used to replace missing values or NaN in the df with
user-defined values. df.fillna() takes 1 required and 5 optional
parameters.
Matplotlib
Matplotlib is a comprehensive library for creating static, animated, and
interactive visualizations in Python. Matplotlib makes easy things easy
and hard things possible.
Example:
import matplotlib.pyplot as plt
import numpy as np

xpoints = np.array([1, 2, 6, 8])


ypoints = np.array([3, 8, 1, 10])

plt.plot(xpoints, ypoints)
plt.show()
import matplotlib.pyplot as plt
import numpy as np

ypoints = np.array([3, 8, 1, 10, 5, 7])

plt.plot(ypoints)
plt.show()

If we do not specify the points on the x-axis, they will get the default
values 0, 1, 2, 3 etc., depending on the length of the y-points.
SciPy and Scikit
• SciPy builds on NumPy and provides a large number of higher-level
science and engineering modules, including optimization, integration,
interpolation, and more.
• SciPy provides algorithms for optimization, integration, interpolation,
eigenvalue problems, algebraic equations, differential equations,
statistics and many other classes of problems.
• Scikit-learn: It is a machine learning library that is built on NumPy,
SciPy, and Pandas. It features various classification, regression, and
clustering algorithms and is designed to interoperate with the Python
numerical and scientific libraries.
Load Machine Learning Data
CSV or comma separated values is the most commonly used format for which machine learning data is
presented. Comma-separated values (CSV) is a text file format that uses commas to separate values, and
newlines to separate records. Each record consists of the same number of fields, and these are separated by
commas in the CSV file.
In the CSV file of your machine learning data, there are parts and features that you need to understand. These
include:
• CSV File Header: The header in a CSV file is used in automatically assigning names or labels to each
column of your dataset. If your file doesnt have a header, you will have to manually name your attributes.
• Comments: You can identify comments in a CSV file when a line starts with a hash sign (#). Depending on
the method you choose to load your machine learning data, you will have to determine if you want these
comments to show up, and how you can identify them.
• Delimiter: A delimiter separates multiple values in a field and is indicated by the comma (,). The tab (\t) is
another delimiter that you can use, but you have to specify it clearly.
• Quotes: If field values in your file contain spaces, these values are often quoted and the symbol that denotes
this is double quotation marks . If you choose to use other characters, you need to specify this in your file.
Load Data with Python Standard Library
• With Python Standard Library, you will be using the module CSV and the function
reader() to load your CSV files. Upon loading, the CSV data will be automatically
converted to NumPy array which can be used for machine learning.
• For example, below is a small code that when you run using the Python API will load
this dataset that has no header and contains numeric fields. It will also automatically
convert it to a NumPy array.

# Load CSV
import csv
import numpy
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)
data = numpy.array(x).astype('float')
print(data.shape)
Load Data File With NumPy
• Another way to load machine learning data in Python is by using
NumPy and the numpy.loadtxt() function.
• In the sample code below, the function assumes that your file has no
header row and all data use the same format. It also assumes that the
file pima-indians-diabetes.data.csv is stored in your current directory.

# Load CSV
import numpy
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')
data = numpy.loadtxt(raw_data, delimiter=",")
print(data.shape)
• Running the sample code above will load the file as a numpy.ndarray
and produces the following shape of the data:
• 1 (768, 9)
• If your file can be retrieved using a URL, the above code can be
modified to the following, while yielding the same dataset:
# Load CSV from URL using NumPy
from numpy import loadtxt
from urllib.request import urlopen
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-
indiansiabetes.data.csv'
raw_data = urlopen(url)
dataset = loadtxt(raw_data, delimiter=",")
print(dataset.shape)
Load Data File With Pandas

• The third way to load your machine learning data is using Pandas and
the pandas.read_csv() function.
• The pandas.read_csv() function is very flexible and the most ideal way
to load machine learning data. It returns a pandas.DataFrame that
enables you to start summarizing and plotting immediately.
• The sample code below assumes that the pima-indians-
diabetes.data.csv file is stored in your current directory.
1 # Load CSV using Pandas
2 import pandas
3 filename = 'pima-indians-diabetes.data.csv'
4 names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
5 data = pandas.read_csv(filename, names=names)
6 print(data.shape)
You will notice here that we explicitly idetntified the names of each attribute to the DataFrame. When we run the
sample code above prints the following shape of the data:
1 (768, 9)
• If your file can be retrieved using a URL, the above code can be
modified as to the following, while yielding the same dataset:

1 # Load CSV using Pandas from URL


2 Import pandas
3 url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-
indians-diabetes.data.csv"
4 names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
5 data = pandas.read_csv(url, names=names)
6 print(data.shape)

Running the sample code above will download a CSV file, parse it, and produce the following shape of
the loaded DataFrame:
1 (768, 9)

You might also like