DS Module 1
DS Module 1
MODULE 1
Basic Concepts: Predictive Modelling , Data preparation,
Importance of Data preparation , Data Cleaning , Feature selection ,
Data Transform , Feature Engineering, Dimensionality reduction, K-fold
cross validation , Data Leakage and avoidance measure. Python
Packages: Numpy , Matplotlib , Pandas , Scipy , Scikit , Data frame,
Loading Machine Learning data.
What is Data Science?
Data Science
Data Science is a field that gives insights from structured and unstructured data, using
different scientific methods and algorithms, and consequently helps in generating
insights, making predictions, and devising data driver solutions.
It is the process of extracting the data from structured, unstructured data using the
Data mining technique (for getting the information from raw data).
It is a lot of science involving getting the necessary information out from tons and millions
of tons of products, services, and their data in it for making better products,
developments, and many more.
It uses a large amount of data to get meaningful insights using statistics and computation
for decision making.
Data Science Examples
It helps in getting the ideas of what customers would love to purchase or eat
according to their previous order history.
Data Collection - Data needs to be relevant that can solve the business problem correctly
Data Preparation - It helps in cleaning and bringing the data into shape, which is required for further
analysis and modeling
Exploratory Data Analysis – Data is analyzed using summary statistics and graphically to understand
key patterns.
Model Building – There are two types of data modeling, i.e.,
Descriptive analytics, which involves insights based on historical data, and
Predictive modeling, which involves future predictions.
Model Deployment and Maintenance - Once the model is built, ready to deploy in the real world. The
deployment can occur offline, on the web, on the cloud, any android or iOS app.
Who Exactly is a Data Scientist?
Statistics – To find the hidden pattern in data and correlation between different features
in data.
Critical Thinking – Have the critical thinking ability to analyze the facts before
concluding.
R Programming: R is one of the essential statistical programming tools, which is mainly used by Data Scientists to
perform a detailed analysis of large data to find insights.
SQL: It is also a valuable tool used by a Data Scientist. It helps them in working on DBMS and structured data. A Data
Engineer also uses this tool.
Tableau / PowerBI: This is a top-rated data visualization tool among Data Scientists because of its amazing reporting
capabilities. This tool makes it simple to visualize the data and show the results to clients.
Hadoop: It is an open-source and powerful tool that is used by every Data Scientist.
SAS[Statistical Analysis System]: SAS is an advanced tool for analysis, which many data analysts use. It has many
powerful features, such as analyzing, extracting, and reporting, which makes it a popular tool. Also, it has a great GUI that
anyone can use it easily, and Data Scientists use it to convert the data into business insights.
Computer Programing Languages used for Data Science
AI vs ML vs DL
Artificial Intelligence
Artificial intelligence refers to the simulation of human intelligence in machines.
She assists us to find information, get directions, send messages, making voice calls,
opening applications, and adding events to the calendar.
Tesla
Not only smartphones but automobiles are also shifting towards Artificial Intelligence.
The car has not only been able to achieve many accolades but also features like self-
driving, predictive capabilities, and absolute technological innovation.
Machine Learning
Machine Learning is the method of dealing with data and automating the tasks by training it so
that it gives new suggestions and detects when a similar type of data is provided to it.
It comes under the AI umbrella and it identifies the data and makes the decisions faster and
saves time and human effort so that there is less need for human intervention it.
The steps involved in building a machine learning model include the following:
Gather training data.
Prepare data for training.
Decide which learning algorithm to use.
Train the learning algorithm.
Evaluate the learning algorithm’s outputs.
If necessary, adjust the variables (hyperparameters) that govern the training process in order to improve output.
Machine Learning
Machine Learning
Deep Learning
Deep learning is a type of machine learning and artificial intelligence (AI) that imitates
the way humans gain certain types of knowledge.
Deep learning is an important element of data science, which includes statistics and
predictive modeling.
It is extremely beneficial to data scientists who are tasked with collecting, analyzing, and
interpreting large amounts of data; deep learning makes this process faster and easier.
Data Science —
Scientific methods, algorithms and systems to extract knowledge or insights from
big data
•Also known as Predictive or Advanced Analytics
•Algorithmic and computational techniques and tools for handing large data sets
•Increasingly focused on preparing and modeling data for ML & DL tasks
•Encompasses statistical methods, data manipulation and streaming technologies
(e.g. Spark, Hadoop)
•Key skill and tools behind building modern AI technologies
AI vs ML vs DL
Predictive Modelling
• Predictive modeling is a commonly used statistical technique to predict
future behavior.
• Predictive modeling solutions are a form of data-mining technology that
works by analyzing historical and current data and generating a model
to help predict future outcomes.
• Data is collected, a statistical model is formulated, predictions are
made, and the model is validated (or revised) as additional data becomes
available.
• Predictive models analyze past performance to assess how likely a
customer is to exhibit a specific behavior in the future
Predictive Modelling
Predictive modeling is the process of using known results to create, process, and
validate a model that can be used to forecast future outcomes.
It is a tool used in predictive analytics, a data mining technique that attempts to
answer the question “what might possibly happen in the future?”
Predictive Modeling is the use of data and statistics to predict the outcome of the data
models.
This prediction finds its utility in almost all areas from sports, to TV ratings, corporate
earnings, and technological advances.
Data preparation is the process of cleaning and transforming raw data prior to processing and
analysis.
Key steps include collecting, cleaning, and labeling raw data into a form suitable for algorithms
and for visualisation
Taking a close look at the data, and exploring the data using summary
statistics and data visualisation
Step 2: Prepare Data
This step is concerned with transforming the raw data that was
collected into a form that can be used in modeling.
Once a suite of models has been evaluated, you must choose a model
that represents the solution to the project.
As such, the raw data must be pre-processed prior to being used to fit
and evaluate a machine learning model.
We can define data preparation as the transformation of raw data into
a form that is more suitable for modeling
Data Preparation Purpose
Data preparation is to ensure that raw data being readied for processing and analysis is
accurate and consistent so the results of analytics applications will be valid.
Data is commonly created with missing values, inaccuracies, or other errors, and separate
data sets often have different formats that need to be reconciled when they're combined.
Correcting data errors, validating data quality, and consolidating data sets are big parts of
data preparation projects.
Data preparation also involves finding relevant data to ensure that analytics applications
deliver meaningful information and actionable insights for business decision-making.
Challenges of data preparation
Challenges of data preparation
Inadequate or nonexistent data profiling
If data isn't properly profiled, errors, anomalies, and other problems might not be identified, which can result in
flawed analytics.
Data enrichment
Deciding how to enrich a data set -- for example, what to add to it -- is a complex task
that requires a strong understanding of business needs and analytics goals.
Feature Selection: Identifying those input variables that are most relevant to the
task.
Data Transforms: Changing the scale or distribution of variables.
Data cleansing:
The identified data errors and issues are corrected to create complete and accurate data sets.
For example, as part of cleansing data sets, faulty data is removed or fixed, missing values are filled in and
inconsistent entries are harmonized.
Steps in the data preparation process
Data structuring:
At this point, the data needs to be modeled and organized to meet the analytics requirements.
For example, data stored in comma-separated values (CSV) files or other file formats has to be converted into tables to make
it accessible to BI and analytics tools.
For example, data transformation may involve creating new fields or columns that aggregate values from existing ones.
Data enrichment further enhances and optimizes data sets as needed, through measures such as augmenting and adding data.
The prepared data is then stored in a data warehouse, a data lake or another repository and either used directly by whoever
prepared it or made available for other users to access.
How to Choose Data Preparation Techniques?
• The steps in a predictive modeling project before and after the data
preparation step inform the data preparation that may be required.
Defining the problem
• Gather data from the problem domain.
• Discuss the project with subject matter experts.
• Select those variables to be used as inputs and outputs for a
predictive model.
• Review the data that has been collected.
• Summarize the collected data using statistical methods.
• Visualize the collected data using plots and charts
• Information known about the choice of algorithms and the discovery
of well-performing algorithms can also inform the selection and
configuration of data preparation methods.
• For example, the choice of algorithms may impose requirements and
expectations on the type and form of input variables in the data.
• This might require variables to have a specific probability distribution,
the removal of correlated input variables, and/or the removal of
variables that are not strongly related to the target variable.
Why Data Preparation is So Important
We cannot fit and evaluate machine learning algorithms on raw data
A row represents one example from the problem domain and may be
referred to as an example, an instance, or a case.
Data Cleaning
Feature Selection
Data Transforms
Feature Engineering
Dimensionality Reduction
Data Cleaning
Data cleaning is an operation that is typically performed first, prior to other data
preparation operations.
Could involve identifying and addressing specification observations that may be incorrect.
There are many reasons data may have incorrect values, such as being mistyped, corrupted,
duplicated, and so on.
Once messy, noisy, corrupt, or erroneous observations are identified, they can be addressed.
Identifying columns that have the same value or no variance and removing them.
Feature selection techniques may generally be grouped into those that use the
target variable (supervised) and those that do not (unsupervised).
Feature Selection
Those that explicitly choose features that result in the best performing
model (wrapper) and
Those that score each input feature and allow a subset to be selected
(filter).
Feature Selection
The features can then be ranked by their scores and a subset with the
largest scores is used as input to a model.
Recall that data may have one of a few types, such as numeric or
categorical, with subtypes for each,
Such as Integer And Real-valued floating-point values for numeric, and
Engineering new features is highly specific to your data and data types.
For example, two input variables together can define a two-dimensional area where each
row of data defines a point in that space.
This idea can then be scaled to any number of input variables to create large multi-
dimensional hyper-volumes.
The problem is, the more dimensions this space has (e.g. the more input variables), the
more likely it is that the dataset represents a very sparse and likely unrepresentative
sampling of that space.
manifold learning algorithms : Kohonen self-organizing maps (SOME) and t-Distributed Stochastic Neighbor
Embedding (t-SNE).
Cross-validation
It helps to compare and select an appropriate model for the specific
predictive modeling problem.
Split the data into K number of folds. K= 5 or 10 will work for most of the cases.
Now keep one fold for testing and remaining all the folds for training.
Train(fit) the model on the train set and test(evaluate) it on the test set and note down
the results for that split
Now repeat this process for all the folds, every time choosing a separate fold as test data
So for every iteration our model gets trained and tested on different sets of data
At the end sum up the scores from each split and get the mean score
Inner Working of Cross Validation
K-fold cross validation
In K Fold cross-validation input data is divided into 'K' number of folds,
hence the name K Fold.
So the model will get trained and tested 5 times, but for every iteration we
will use one fold as test data and rest all as training data.
Note that for every iteration, data in training and test fold changes which
adds to the effectiveness of this method.
K-fold cross-validation
Worked Example
Imagine we have a data sample with 6 observations:
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
• The first step is to pick a value for k in order to determine the number
of folds used to split the data. Here, we will use a value of k=3. That
means we will shuffle the data and then split the data into 3 groups.
Because we have 6 observations, each group will have an equal number
of 2 observations.
For example:
Fold1: [0.5, 0.2]
Fold2: [0.1, 0.3]
Fold3: [0.4, 0.6]
• We can then make use of the sample, such as to evaluate the skill of a
machine learning algorithm.
• Three models are trained and evaluated with each fold given a chance
to be the held out test set.
Model1: Trained on Fold1 + Fold2, Tested on Fold3
Model2: Trained on Fold2 + Fold3, Tested on Fold1
Model3: Trained on Fold1 + Fold3, Tested on Fold2
• The models are then discarded after they are evaluated as they have
served their purpose.
• The skill scores are collected for each model and summarized for use.
Stratified K Fold Cross Validation
Stratified K Fold used when just random shuffling and splitting the data
is not sufficient.
In case of regression problem folds are selected so that the mean response
value is approximately equal in all the folds.
In the case of classification problems folds are selected to have the same
proportion of class labels.
It covers the variation of input data by validating the performance of the
model on multiple folds
Model performance analysis for every fold gives us more insights to fine
tune the model
Data leakage is when information from outside the training dataset is used to
create the model.
This additional information can allow the model to learn or know something
that it otherwise would not know and in turn invalidate the estimated
performance of the model being constructed.
A common approach is to first apply one or more transforms to the entire dataset.
Then the dataset is split into train and test sets or k-fold cross-validation is used to fit and evaluate a
machine learning model.
1. Prepare Dataset
2. Split Data
3. Evaluate Models
The problem with applying data preparation techniques before splitting data for model evaluation is
it can lead to data leakage and, in turn, will likely result in an incorrect estimate of a model’s
performance on the problem.
Solution for Naive Data Preparation
The solution is straightforward.
The entire modeling pipeline must be prepared only on the training dataset to avoid data leakage.
This might include data transforms, but also other techniques such as feature selection,
dimensionality reduction, feature engineering, and more.
This means so-called model evaluation should really be called modeling pipeline evaluation.
NumPy
• NumPy, which stands for Numerical Python, is a library consisting of
multidimensional array objects and a collection of routines for processing those
arrays.
• Using NumPy, mathematical and logical operations on arrays can be performed
Important Numpy Functions in Python
np.array(): This function is used to create an array in NumPy.
Example code
import numpy as np
arr = np.array([1, 2, 3])
print(arr)
Output:
[1 2 3]
)
Example Code:
import numpy as np
arr = np.ones(3)
print(arr)
Output:
[1. 1. 1.]
np.random.rand(): This function is used to create an array with
random values between 0 and 1.
Example code:
import numpy as np
arr = np.random.rand(3)
print(arr)
Output:
[0.5488135 0.71518937 0.60276338]
• np.random.randint(): This function is used to create an array with
random integer values between a specified range.
Example Code:
import numpy as np
arr = np.random.randint(0, 10, 5)
print(arr)
Output:
[1 5 8 9 8]
Explanation: In this example, we created an array of size 5 which is
filled with random values which lie between 0 and 10
np.max(): This function is used to find the maximum value in an array.
Example Code:
import numpy as np
arr = np.array([1, 2, 3])
max_value = np.max(arr)
print(max_value)
Output:
3
np.min(): This function is used to find the minimum value in an array.
Example Code:
import numpy as np
arr = np.array([1, 2, 3])
min_value = np.min(arr)
print(min_value)
Output:
1
Pandas
• The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis"
and was created by Wes McKinney in 2008.
• Pandas is a Python library used for working with data sets. It has functions for
analyzing, cleaning, exploring, and manipulating data.
• For example, let’s first import the data into pandas DataFrame df . A Pandas
DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with
rows and columns.
import pandas as pd
df = pd.read_csv("Dummy_Sales_Data_v1.csv")
df.head()
This function helps you to get the first few rows of the dataset.
df.tail()
This function helps you to get the last few rows of the dataset.
df.info()
This function returns a quick summary of the DataFrame.
df.describe()
This function returns descriptive statistics about the data.
df.unique()
• This function returns the list of unique values in a column or series.
df.isnull()
• This function helps you to check in which row and which column your
data has missing values.
df.fillna()
• This function is used to replace missing values or NaN in the df with
user-defined values. df.fillna() takes 1 required and 5 optional
parameters.
Matplotlib
Matplotlib is a comprehensive library for creating static, animated, and
interactive visualizations in Python. Matplotlib makes easy things easy
and hard things possible.
Example:
import matplotlib.pyplot as plt
import numpy as np
plt.plot(xpoints, ypoints)
plt.show()
import matplotlib.pyplot as plt
import numpy as np
plt.plot(ypoints)
plt.show()
If we do not specify the points on the x-axis, they will get the default
values 0, 1, 2, 3 etc., depending on the length of the y-points.
SciPy and Scikit
• SciPy builds on NumPy and provides a large number of higher-level
science and engineering modules, including optimization, integration,
interpolation, and more.
• SciPy provides algorithms for optimization, integration, interpolation,
eigenvalue problems, algebraic equations, differential equations,
statistics and many other classes of problems.
• Scikit-learn: It is a machine learning library that is built on NumPy,
SciPy, and Pandas. It features various classification, regression, and
clustering algorithms and is designed to interoperate with the Python
numerical and scientific libraries.
Load Machine Learning Data
CSV or comma separated values is the most commonly used format for which machine learning data is
presented. Comma-separated values (CSV) is a text file format that uses commas to separate values, and
newlines to separate records. Each record consists of the same number of fields, and these are separated by
commas in the CSV file.
In the CSV file of your machine learning data, there are parts and features that you need to understand. These
include:
• CSV File Header: The header in a CSV file is used in automatically assigning names or labels to each
column of your dataset. If your file doesnt have a header, you will have to manually name your attributes.
• Comments: You can identify comments in a CSV file when a line starts with a hash sign (#). Depending on
the method you choose to load your machine learning data, you will have to determine if you want these
comments to show up, and how you can identify them.
• Delimiter: A delimiter separates multiple values in a field and is indicated by the comma (,). The tab (\t) is
another delimiter that you can use, but you have to specify it clearly.
• Quotes: If field values in your file contain spaces, these values are often quoted and the symbol that denotes
this is double quotation marks . If you choose to use other characters, you need to specify this in your file.
Load Data with Python Standard Library
• With Python Standard Library, you will be using the module CSV and the function
reader() to load your CSV files. Upon loading, the CSV data will be automatically
converted to NumPy array which can be used for machine learning.
• For example, below is a small code that when you run using the Python API will load
this dataset that has no header and contains numeric fields. It will also automatically
convert it to a NumPy array.
# Load CSV
import csv
import numpy
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')
reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE)
x = list(reader)
data = numpy.array(x).astype('float')
print(data.shape)
Load Data File With NumPy
• Another way to load machine learning data in Python is by using
NumPy and the numpy.loadtxt() function.
• In the sample code below, the function assumes that your file has no
header row and all data use the same format. It also assumes that the
file pima-indians-diabetes.data.csv is stored in your current directory.
# Load CSV
import numpy
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')
data = numpy.loadtxt(raw_data, delimiter=",")
print(data.shape)
• Running the sample code above will load the file as a numpy.ndarray
and produces the following shape of the data:
• 1 (768, 9)
• If your file can be retrieved using a URL, the above code can be
modified to the following, while yielding the same dataset:
# Load CSV from URL using NumPy
from numpy import loadtxt
from urllib.request import urlopen
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-
indiansiabetes.data.csv'
raw_data = urlopen(url)
dataset = loadtxt(raw_data, delimiter=",")
print(dataset.shape)
Load Data File With Pandas
• The third way to load your machine learning data is using Pandas and
the pandas.read_csv() function.
• The pandas.read_csv() function is very flexible and the most ideal way
to load machine learning data. It returns a pandas.DataFrame that
enables you to start summarizing and plotting immediately.
• The sample code below assumes that the pima-indians-
diabetes.data.csv file is stored in your current directory.
1 # Load CSV using Pandas
2 import pandas
3 filename = 'pima-indians-diabetes.data.csv'
4 names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
5 data = pandas.read_csv(filename, names=names)
6 print(data.shape)
You will notice here that we explicitly idetntified the names of each attribute to the DataFrame. When we run the
sample code above prints the following shape of the data:
1 (768, 9)
• If your file can be retrieved using a URL, the above code can be
modified as to the following, while yielding the same dataset:
Running the sample code above will download a CSV file, parse it, and produce the following shape of
the loaded DataFrame:
1 (768, 9)