0% found this document useful (0 votes)

135 views17 pages

Intro to Exploratory Data Analysis

The document discusses exploratory data analysis and machine learning techniques. It covers topics like EDA, k-nearest neighbors classification, decision trees, data preparation, and includes steps for performing different techniques. Multiple lectures are presented that describe algorithms and best practices.

Uploaded by

Hussein ElGhoul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

135 views17 pages

Intro to Exploratory Data Analysis

Uploaded by

Hussein ElGhoul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Lecture 3 - Exploratory Data Analysis

Part I – Introduction to EDA

Definition: Exploratory data analysis (EDA) is a critical part of the data science process, and
the first step toward building a model. The basic tools of EDA are plots, graphs, and summary
statistics.
Although there’s lots of visualization involved in EDA, we distinguish between EDA and
data visualization
● EDA is done toward the beginning of the analysis.
● Data Visualization is done toward the end to communicate one’s findings.
● With EDA, the graphics are solely done for you to understand what’s going on.

Data Science Cycle

Part II - Guide to perform EDA

1. First, we find out how many observations (rows) and variables (columns) we have in our
data, in addition to the name of the variables.
2. We use the “table” function in R to find out the number of observations for each type in a
specific variable.
3. Before we explore the data we need to take care of missing values by setting them to
NA.
4. It’s very helpful to create cross tables using the “table” function in R especially if you
have a special-temporal application in mind.
5. Make sure that the data type of the column is representative (Boolean, Character,
integer, etc..) and keep in mind that R reads letters T and F as Boolean while you might
want them to be characters which will affect the ML algorithm later on.
6. Apply “split-apply-combine” if needed, where you split the data frame into groups, apply
a function for each group, and then combine the results into a new data frame.
7. Change the type of variable as needed and draw multiple graphs into the same
observation using the “wrap” function in R for better comparison.
8. It’s helpful to use boxplots to observe the distribution of certain variables.

Conclusion: After knowing your variables, take care of the missing values and the type of the
variables, and then use visualizations to draw an analysis of your variables.

Lecture 4 - Introduction to Machine Learning

Supervised Learning: Classification, Regression or Ranking
Unsupervised Learning: Anomaly Detection, Clustering
Noisy data is meaningless data that cannot be understood and interpreted correctly by
machines, such as unstructured text.
Overfitting occurs when a model begins to "memorize" training data rather than "learning" to
generalize from the trend.
Underfitting occurs when the model or algorithm shows low variance but high bias: a feature
whereby the expected value of the results differs from the true underlying quantitative parameter
being estimated.
The balance between overfitting and underfitting leads to bias and variance.
Ranking labels in supervised learning are called “ordinals”
Majority Class is when we have one class that has records more than the others with a
threshold of around 10% to 5%.

Lecture 5 - kNN
Lazy Learning – Classification Using Nearest Neighbors

NN classifiers: classify unlabeled examples by assigning them the class of the most similar
labeled examples.
Strengths:
● Simple and effective
● Makes no assumptions about the underlying data distribution
● Fast Training Phase

Weaknesses:
● Does not produce a model which limits the ability to find novel insights in relationships
among features
● Slow Classification Phase
● Requires a large amount of memory

The kNN algorithm

1. Let k be an integer specified in advance. Begin with a training dataset.
2. Assume that we have a test dataset containing unlabeled examples that otherwise have
the same features as the training data.
3. For each record in the test dataset, identify k records in the training data that are the
"nearest" in similarity.
4. The unlabeled test instance is assigned the class of the majority of the k nearest
neighbors.

Scaling and Normalization

The traditional method is min-max normalization, it transforms all the values so they fall
X −min( X)
between 0 and 1. Xnew =
max(X )−min
The issue with this method that it squashes all the outliers and you wouldn’t be able to identify
outliers.
X−μ
To solve this problem we use z-score standardization Xnew=
δ

Calculating the Distance

Calculating the distance is usually done after scaling and normalization and usually done using
the euclidian distance, while euclidian distance isn’t defined for nominal data we could perform
data encoding either by doing dummy encoding or by performing one-hot encoding.

Best practices for kNN

● start by removing the Id column (not descriptive) and the labels row(for learning
purposes) for more accurate clustering.
● perform hyperparameter tuning.
● Testing alternative values of k.
● Perform Cross-Validation.
● Change Training Set vs Testing Set ration.
● Change the distance function.
Lecture 6 - Decision Trees and Rules
Decision trees and rule learners:
● Divide data into smaller and smaller portions
● Identify patterns that can be used for prediction.
● Knowledge is presented in the form of logical structures
● Understood without any statistical knowledge.
May not be an ideal fit:
● when the data has a large number of nominal features with many levels.
● when the data has a large number of numeric features.

Divide and Conquer

Decision trees are built using divide and conquer:
● Begin at the root node
● The node represents the entire dataset
● Choose a feature that is the most predictive of the target class.
● The examples are then partitioned into groups of distinct values of this feature;
● This decision forms the first set of tree branches.
● The algorithm continues to divide-and-conquer the nodes,
● Chooses the best candidate feature each time until a stopping criterion is reached.

The C5.0 decision tree algorithm

Strengths:
● An all-purpose classifier that does well on most problems
● Can handle numeric or nominal features, missing data
● Uses only the most important features
● Can be used on data with relatively few training examples
Weaknesses:
● Often biased toward splits on features having a large number of levels
● It is easy to overfit or underfit the model
● Can have a trouble modeling some relationships due to reliance on axis-parallel splits
● Small changes in training data can result in large changes to decision logic
● Large trees can be difficult to interpret and the decisions they make may seem
counterintuitive.
Choosing the best split

C5.0 and many decision trees algorithms use entropy for measuring purity, the best split is the
split that results in the purest partition (If the segments of data contain only a single class, they
are considered pure. )
The entropy of a sample of data indicates how mixed the class values are, the lowest is 0
which means that the data is completely homogeneous, while the value 1 indicates the highest
amount of disorder.
For a given segment of data (S):
c: the number of different class levels
p(i): the proportion of values falling into a class level I

The algorithm uses entropy to calculate information

gain

The higher the information gain, the better a feature is set creating a homogeneous group after
a split on that feature

Pruning the decision tree

A decision tree can continue to grow indefinitely, choosing splitting features and dividing into
smaller and smaller partitions until each example is perfectly classified or the algorithm runs out
of features to split on.
If the tree grows overly large, many of the decisions it makes will be overly specific and the
model will have been overfitted to the training data.
The process of pruning the decision tree is reducing its size that generalizes better for unseen
data.
Pre-pruning: stop the tree from growing and doing needless work once it reaches a certain
number of decisions or if the decision nodes contain only a small number of examples.
Post-pruning: Growing a tree that is too large, pruning criteria based on the error rates at the
nodes to reduce the size of the tree to a more appropriate level.

Improving model performance

Adaptive boosting: A process in which many decision trees are built, and the trees vote on the
best class for each example. Combine a number of weak performing learners, to create a team
that is much stronger than any one of the learners alone. Each of the models has a unique set
of strengths and weaknesses and may be better or worse at certain problems.

Defining a cost matrix: This creates a matrix with two rows and two columns, that specify the
cost of each decision.
Rule-based learners should be the baseline for our model(Read More about it in the slides).

Lecture 7 - Data Preparation for ML

Data preparation is the process of transforming raw data into a form that is more appropriate for
modeling.

Data Cleaning

The performance of the ML algorithm is as good as the data so:

● We use statistics to define normal data and identify outliers
● Identify the columns that have the same value or no variance and removing them
● Identify duplicate rows and remove them
● Making empty values as missing
● Imputing missing values using statistics or a learned model.

Basic Data Cleaning:

1. Remove columns (variables) that contain a single value.
2. Treat with caution columns (variables) that contain only a few distinct values:
3. Delete rows that contain duplicate data.

Outlier Identification and Removal

An outlier is an observation that is unlike the other observations. They are rare, distinct, or do
not fit in some way.
If the data is gaussian:
If we know that the distribution of values in the sample is Gaussian or Gaussian-like, we can
use the standard deviation of the sample as a cut-off for identifying outliers.
Given mu and sigma, a simple way to identify outliers is to compute a z-score for every xi, which
is defined as the number of standard deviations away xi is from the mean.
Else:
We use the Interquartile Range Method, The IQR is calculated as the difference between the
75th and the 25th percentiles of the data and defines the box in a box and whisker plot.
The IQR can be used to identify outliers by defining limits on the sample values that are a factor
k of the IQR below the 25th percentile or above the 75th percentile.
The common value for the factor k is the value 1.5. A factor k of 3 or more can be used to
identify values that are extreme outliers or far outs when described in the context of box and
whisker plots.

Automatic Outlier Detection

The local outlier factor, or LOF for short, is a technique that builds on the idea of nearest
neighbors for outlier detection. Each example is assigned a scoring of how isolated or how likely
it is to be an outlier based on the size of its local neighborhood. Those examples with the largest
score are more likely to be outliers.

Missing Values
1. Mark invalid or corrupt values as missing in your dataset.
2. Confirm that the presence of marked missing values causes problems for learning
algorithms.
3. You may choose to remove rows with missing data from your dataset and evaluate a
learning algorithm on the transformed dataset.
4. Can be too limiting on some predictive modeling problems, say if more than 30% of the
data is missing.

How to deal with it?

Statistical Imputation: estimate a statistical value for a column from those values that are
present, replace all missing values in the column with the calculated statistic.
Common statistics calculated include the column mean value, median value, mode value, or a
constant value.
KNNImputer Data Transform: The KNNImputer from scikit-learn is a data transformation that
is first configured based on the method used to estimate the missing values. The default
distance measure is a Euclidean distance measure that is NaN aware, e.g. will not include NaN
values when calculating the distance between members of the training dataset.
Iterative Imputation: Iterative imputation refers to a process where each feature is modeled as
a function of the other features, e.g. a regression problem where missing values are predicted.
Each feature is imputed sequentially, one after the other, allowing prior imputed values to be
used as part of a model in predicting subsequent features. This approach may be generally
referred to as fully conditional specification (FCS) or multivariate imputation by chained
equations (MICE).

Feature Selection

Feature selection is the process of reducing the number of input variables when developing a
predictive model. It is desirable to reduce the number of input variables to both reduce the
computational cost of modeling and, in many cases, to improve the performance of the model.
Statistical-based feature selection methods involve evaluating the relationship between each
input variable and the target variable using statistics and selecting those input variables that
have the strongest relationship with the target variable.

Unsupervised Selection: Do not use the target variable (e.g. remove redundant variables).
Supervised Selection: Use the target variable (e.g. remove irrelevant variables).
● Intrinsic: Algorithms that perform automatic feature selection during training.
● Filter: Select subsets of features based on their relationship with the target.
● Wrapper: Search subsets of features that perform according to a predictive model.

Feature selection is also related to dimensionality reduction techniques in that both methods
seek fewer input variables to a predictive model. The difference is that feature selection selects
features to keep or remove from the dataset, whereas dimensionality reduction creates a
projection of the data resulting in entirely new input features. As such, dimensionality reduction
is an alternate to feature selection rather than a type of feature selection

A. This is a regression predictive modeling problem with numerical input variables. The
most common techniques are to use a correlation coefficient, such as Pearson’s for a
linear correlation, or rank-based methods for a nonlinear correlation.
B. This is a classification predictive modeling problem with numerical input variables. This
might be the most common example of a classification problem, Again, the most
common techniques are correlation-based, although in this case, they must take the
categorical target into account.
C. This is a regression predictive modeling problem with categorical input variables.
Nevertheless, you can use the same Numerical Input, Categorical Output methods
(described above), but in reverse.
D. This is a classification predictive modeling problem with categorical input variables, The
most common correlation measure for categorical data is the chi-squared test. You can
also use mutual information (information gain) from the field of information theory.
E.

Recursive Feature Elimination

RFE is a wrapper-type feature selection algorithm. This means that a different machine learning
algorithm is given and used in the core of the method, is wrapped by RFE, and used to help
select features. RFE works by searching for a subset of features by starting with all features in
the training dataset and successfully removing features until the desired number remains.
Features are scored either using the provided machine learning model (e.g. some algorithms
like decision trees offer importance scores) or by using a statistical method.

Feature Importance
Feature importance refers to a class of techniques for assigning scores to input features to a
predictive model that indicates the relative importance of each feature when making a
prediction.
The scores can be used for a better understanding of the data, the model, and reducing the
number of input features.
Coefficients as Feature Importance: Linear machine learning algorithms fit a model where the
prediction is the weighted sum of the input values. Examples include linear regression, logistic
regression, and extensions that add regularization, such as ridge regression, LASSO, and the
elastic net.
Data transforms

Data transforms to change the type of distribution of data variables, which may be applied to
input as well as output variables.
Types of transforms:
● Discretization Transform: Encode a numeric variable as an ordinal variable
● Ordinal Transform: Encode a categorical variable into an integer variable
● One Hot Transform: Encode a categorical variable into binary variables
● Binary Hot Transform (a space-efficient adaptation of One Hot Transform)

● Normalization Transform: Scale a variable to the range 0 and 1

● Standardization Transform: Scale a variable to a standard Gaussian
● Power Transform: Change the distribution of a variable to be more Gaussian
● Quantile Transform: Impose a probability distribution such as uniform or Gaussian

How to Scale Numerical Data

Data Normalization: Normalization is a rescaling of the data from the original range so that all
values are within the new range of 0 and 1.
Data Standardization: Standardizing a dataset involves rescaling the distribution of values so
that the mean of observed values is 0 and the standard deviation is 1.
Robust Scaling Data (in case you want to keep the outliers): Ignore the outliers from the
calculation of the mean and standard deviation, then use the calculated values to scale the
variable.
How to Encode Categorical Data
Discretization: A numerical variable can be converted to an ordinal variable by dividing the
range of the numerical variable into bins and assigning values to each bin.
Ordinal Encoding: An integer ordinal encoding is a natural encoding for ordinal variables. For
categorical variables, it imposes an ordinal relationship where no such relationship may exist.
One Hot Encoding: The one-hot encoding creates one binary variable for each category.

Make Data More Gaussian

Many machine learning algorithms perform better when the distribution of variables is Gaussian.
Some algorithms like linear regression and logistic regression explicitly assume the real-valued
variables have a Gaussian distribution.

A power transform will make the probability distribution of a variable more Gaussian.
We can apply a power transform directly by calculating the log or square root of the variable,
although this may or may not be the best power transformer for a given variable.
Instead, we can use a generalized version of the transform that finds a parameter (lambda or λ)
that best transforms a variable to a Gaussian probability distribution.

Some transformation techniques: Box-Cox, Yeo-Johnson Transform, Quantile Transforms,

Normal Quantile Transform, Uniform Quantile Transform

How to Transform Numerical to Categorical Data

Some machine learning algorithms may prefer or require categorical or ordinal input variables,
such as some decision tree and rule-based algorithms.

Uniform: Each bin has the same width in the span of possible values for the variable.
Quantile: Each bin has the same number of values, split based on percentiles.
Clustered: Clusters are identified and examples are assigned to each group.

Feature Engineering
Creating new input variables from the available data In order to expose clearer trends in the
input data by adding a broader context to a single observation or decomposing a complex
variable.
Some common techniques:
● Adding a Boolean flag variable for some state.
● Adding a group or global summary statistic, such as a mean.
● Adding new variables for each component of a compound variable, such as a date-time.
● Polynomial Transform: Create copies of numerical input variables that are raised to a
power
● Feature Crossing

How to Derive New Input Variables

Polynomial Features: Polynomial features are those features created by raising existing
features to an exponent.
It is also common to add new variables that represent the interaction between features e.g a
new column that represents one variable multiplied by another.

Feature Crossing
A feature cross is a synthetic feature that encodes nonlinearity in the feature space by
multiplying two or more input features together.

Dimensionality Reduction
Dimensionality reduction techniques create a projection of the data into a lower-dimensional
space that still preserves the most important properties of the original data.

Techniques from linear algebra can be used for dimensionality reduction. Specifically,
matrix factorization methods can be used to reduce a dataset matrix into its constituent
parts.
Manifold Learning: Techniques from high-dimensionality statistics are used to create a
low-dimensional projection of high-dimensional data, often for the purposes of data
visualization.

Autoencoder Methods: Deep learning neural networks can be constructed to perform

dimensionality reduction.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a method for reducing the dimensionality of data.
In PCA, data with 𝑚-columns (features) is projected into a subspace with 𝑚 or fewer
columns, whilst retaining the essence of the original data.

Many statistical models suffer from a high correlation between covariates, PCA can be
used to produce linear combinations of the covariates that are uncorrelated between
each other.
In PCA, we simplify a dataset with many variables by turning the original variables into a
smaller number of "Principal Components".

Lecture 8 - Clustering with kmeans

Clustering is an unsupervised machine learning process that automatically divides data
into clusters or groupings of similar items. It does this without having been told what the
groups should look like ahead of time.
The resulting clusters can then be used for action customer segmentation, anomaly
behavior, Simplifying extremely large datasets by grouping a large number of features
with similar values into a much smaller number of homogeneous categories.

The k-means algorithm for clustering

Strength:
● Uses simple principles for identifying clusters which can be explained in non-
statistical terms
● Highly flexible and can be adapted to address nearly all of its shortcomings with
simple adjustments
● Fairly efficient and performs well at dividing the data into useful clusters.
Weaknesses:
● Less sophisticated than more recent clustering algorithms
● Because it uses an element of random chance, it is not guaranteed to find the
optimal set of clusters
● Requires a reasonable guess as to how many clusters naturally exist in the data
Due to the heuristic nature of k-means, you may end up with somewhat different final
results by making only slight changes to the starting conditions. If the results vary
dramatically, this could indicate a problem.

Using distance to assign and update clusters: Traditionally,

k-means uses Euclidean distance, but Manhattan distance or
Minkowski distance is also sometimes used.

Choosing the appropriate number of clusters

The algorithm can be sensitive to randomly chosen cluster centers. Choosing the
number of clusters requires a delicate balance: Setting k to be very large will improve the
homogeneity of the clusters, but risks overfitting the data.

Sometimes the number of clusters is dictated by business requirements or the

motivation for the analysis. Without any, a priori knowledge at all, one rule of thumb
suggests setting k equal to the square root of (n / 2), where n is the number of
examples in the dataset.
The elbow method attempts to gauge how the homogeneity or heterogeneity within the
clusters changes for various values of k. There are numerous statistics to measure
homogeneity and heterogeneity within clusters that can be used with the elbow method.
Lecture 9 - Regression Methods
Regression is concerned with specifying the relationship between a single numeric dependent
variable (the value to be predicted) and one or more numeric independent variables (the
predictors).

Commonly used for modeling complex relationships among data elements. Estimating the
impact of a treatment on an outcome, and extrapolate into the future.
Example of regression models: simple linear regression, multiple regression, logistic regression,
Poisson regression.

Simple linear regression

Simple linear regression defines the relationship between a dependent variable

and a single independent predictor variable using a line:

The ordinary least squares (OLS) estimation method

allows us to determine the optimal estimates of α and β.
It can be shown using calculus that the value of b that
results in the minimum squared error is:

While the optimal value of a is:

The denominator for b is the variance of x. The numerator is the covariance of x and y.
Correlations

The correlation between two variables indicates how closely their relationship follows a straight
line. The correlation ranges between -1 and +1. The extreme values indicate a perfectly linear
relationship. A correlation close to zero indicates the absence of a linear relationship. The
following formula defines Pearson's correlation:

Multiple linear regression

Most real-world analyses have more than one independent variable. Therefore, it is likely that
you will be using multiple linear regression most of the time you use regression for a numeric
prediction task.
Strengths:
● By far the most common approach for modeling numeric data
● Can be adapted to model almost any data
● Provides estimates of the strength and size of the relationships among features and the
outcome
Weaknesses:
● Makes strong assumptions about the data
● The model's form must be specified by the user in advance
● Does not do well with missing data
● Only works with numeric features, so categorical data require extra processing
● Requires some knowledge of statistics to understand the model

Multiple regression is an extension of simple linear regression, find values of beta coefficients
that minimize the prediction error of a linear equation.

Minimizing error:
How to solve for the vector β that minimizes the sum of the squared errors between the
predicted and actual y values?
It has been shown in the literature that the best estimate of the vector β can be computed as:

6months ML
No ratings yet
6months ML
161 pages
Chapter 2. Pair Programming
No ratings yet
Chapter 2. Pair Programming
15 pages
Real Estate ML Project Guide
No ratings yet
Real Estate ML Project Guide
20 pages
Pydantic AI Cookbook - ? Swipe
No ratings yet
Pydantic AI Cookbook - ? Swipe
15 pages
RTRP Lab Project
No ratings yet
RTRP Lab Project
13 pages
Advance Deep Learning
No ratings yet
Advance Deep Learning
10 pages
Ai ML
No ratings yet
Ai ML
11 pages
AI ML Interview Introduction
No ratings yet
AI ML Interview Introduction
15 pages
What Is Stream Processing
No ratings yet
What Is Stream Processing
3 pages
Unit 4 Hadoop Ecosystem - HIVE and PIG
No ratings yet
Unit 4 Hadoop Ecosystem - HIVE and PIG
157 pages
03-Python Libraries - Numpy - Matplotlib
No ratings yet
03-Python Libraries - Numpy - Matplotlib
56 pages
MRI Brain Image Classification Using Various Deep Learning
No ratings yet
MRI Brain Image Classification Using Various Deep Learning
18 pages
Machine Learning Foundations - Overview
100% (1)
Machine Learning Foundations - Overview
24 pages
UnSupervised ML
No ratings yet
UnSupervised ML
17 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Handling Missing Value
No ratings yet
Handling Missing Value
12 pages
Full Stack Interview Questions
No ratings yet
Full Stack Interview Questions
5 pages
6 Types of Neural Network
No ratings yet
6 Types of Neural Network
8 pages
Introduction To AI-ML-and Applications
No ratings yet
Introduction To AI-ML-and Applications
115 pages
Isolation Forest: Anomaly Detection Method
No ratings yet
Isolation Forest: Anomaly Detection Method
11 pages
Machine Learning and AI Workloads Hardware Requirements
No ratings yet
Machine Learning and AI Workloads Hardware Requirements
2 pages
Decision Tree in Machine Learning
No ratings yet
Decision Tree in Machine Learning
16 pages
Prompt Engineering Notes
No ratings yet
Prompt Engineering Notes
2 pages
ML Observability: Key to Model Success
No ratings yet
ML Observability: Key to Model Success
13 pages
MLOPS Notes
100% (1)
MLOPS Notes
5 pages
Redshift-Developer Guide
No ratings yet
Redshift-Developer Guide
1,552 pages
Data Science ML Full Stack 2022 GitHub
No ratings yet
Data Science ML Full Stack 2022 GitHub
9 pages
Ontology Unit 2 Notes
No ratings yet
Ontology Unit 2 Notes
25 pages
CUDA Image Processing Thesis
No ratings yet
CUDA Image Processing Thesis
66 pages
Performance Evaluation of Various Classification Techniques For Customer
No ratings yet
Performance Evaluation of Various Classification Techniques For Customer
19 pages
Function Point Analysis: A Simple Five Step Counting Process
No ratings yet
Function Point Analysis: A Simple Five Step Counting Process
13 pages
Machine Learning: The Basics
No ratings yet
Machine Learning: The Basics
288 pages
Machine Learning Algorithms Theory - Vimal Mishra
No ratings yet
Machine Learning Algorithms Theory - Vimal Mishra
931 pages
Ensemble Learning-Bagging-Boosting-Stacking
No ratings yet
Ensemble Learning-Bagging-Boosting-Stacking
12 pages
Neo4j: Leading Graph Database Guide
No ratings yet
Neo4j: Leading Graph Database Guide
16 pages
Fast Python High Performance Techniques For Large Datasets MEAP V10 Tiago Rodrigues Antao Instant Download
No ratings yet
Fast Python High Performance Techniques For Large Datasets MEAP V10 Tiago Rodrigues Antao Instant Download
110 pages
AI Unit 4 - Artificial Neural Network by Kulbhushan (Krazy Kaksha & KK World)
No ratings yet
AI Unit 4 - Artificial Neural Network by Kulbhushan (Krazy Kaksha & KK World)
5 pages
Machine Learning & DevOps Sessions Summary
No ratings yet
Machine Learning & DevOps Sessions Summary
23 pages
Nptel: Parallel Computing - Video Course
No ratings yet
Nptel: Parallel Computing - Video Course
3 pages
Lecture+Notes Intro To MLOps Session3
No ratings yet
Lecture+Notes Intro To MLOps Session3
8 pages
Machine Learning Notes 2
100% (1)
Machine Learning Notes 2
92 pages
Machine Learning For Business: Using Amazon SageMaker and Jupyter 1st Edition Doug Hudgeon Full Chapters Instanly
100% (1)
Machine Learning For Business: Using Amazon SageMaker and Jupyter 1st Edition Doug Hudgeon Full Chapters Instanly
129 pages
BDM 1
No ratings yet
BDM 1
37 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
02 - Lecture Note - TensorFlow Ops
No ratings yet
02 - Lecture Note - TensorFlow Ops
21 pages
Gen Ai
No ratings yet
Gen Ai
3 pages
AI Engineer Interview Prep Guide
No ratings yet
AI Engineer Interview Prep Guide
16 pages
ML Projects 1
No ratings yet
ML Projects 1
29 pages
Embedded System
No ratings yet
Embedded System
19 pages
Agentic AI Presentation
No ratings yet
Agentic AI Presentation
8 pages
Data Science Lecture Notes
100% (1)
Data Science Lecture Notes
216 pages
Agentic AI Presentation With Visuals
No ratings yet
Agentic AI Presentation With Visuals
9 pages
School of Science, Engineering and Technology: Project Report
No ratings yet
School of Science, Engineering and Technology: Project Report
14 pages
Text
No ratings yet
Text
131 pages
AA Banking PDF
No ratings yet
AA Banking PDF
23 pages
BTAIML AI Notes Upto Unit 3
No ratings yet
BTAIML AI Notes Upto Unit 3
101 pages
Intro to Software Engineering
No ratings yet
Intro to Software Engineering
37 pages
Decision Tree
No ratings yet
Decision Tree
18 pages
Module 1 ML Mumbai University
No ratings yet
Module 1 ML Mumbai University
47 pages
DWDM Unit-3
No ratings yet
DWDM Unit-3
9 pages
Intro to EDA with Auto Data
100% (1)
Intro to EDA with Auto Data
113 pages
Lecture 2
No ratings yet
Lecture 2
163 pages
Graphics PDF
No ratings yet
Graphics PDF
38 pages
Objective: HW3: Due Mon, Nov 2, 2020 at 11:00 AM
No ratings yet
Objective: HW3: Due Mon, Nov 2, 2020 at 11:00 AM
11 pages
Airplane Passenger Satisfaction Prediction Final Report
No ratings yet
Airplane Passenger Satisfaction Prediction Final Report
47 pages
The-Impact-Of-E-Management-On-Improving-Crisis-Management - A-Field-Study-At-The-Ali-Boushaba-Public-Hospital-In-Khenchela
No ratings yet
The-Impact-Of-E-Management-On-Improving-Crisis-Management - A-Field-Study-At-The-Ali-Boushaba-Public-Hospital-In-Khenchela
18 pages
Module 1 Doing Scientific Investigation
91% (11)
Module 1 Doing Scientific Investigation
4 pages
Classical Linear Regression Model (CLRM)
100% (1)
Classical Linear Regression Model (CLRM)
68 pages
AGE 301 - NOTE - A-1
No ratings yet
AGE 301 - NOTE - A-1
8 pages
Spring 2025 - CS506 - 1
No ratings yet
Spring 2025 - CS506 - 1
10 pages
Influence of Social Media Marketing On Customer Engagement: F. Safwa Farook, Nalin Abeysekara
No ratings yet
Influence of Social Media Marketing On Customer Engagement: F. Safwa Farook, Nalin Abeysekara
11 pages
Pollock 2005 SPSS I 1 3
100% (1)
Pollock 2005 SPSS I 1 3
68 pages
07 - Beyond Confounders - Causal Inference For The Brave and True
No ratings yet
07 - Beyond Confounders - Causal Inference For The Brave and True
15 pages
Forecasting Air Quality Abstract ICACE 2024
No ratings yet
Forecasting Air Quality Abstract ICACE 2024
8 pages
Binary Logistic Regression Guide
No ratings yet
Binary Logistic Regression Guide
30 pages
Research Proposal Workshop
No ratings yet
Research Proposal Workshop
19 pages
Lecture 1 - Introduction: IEOR 242 - Applications in Data Analysis Fall 2019 - Paul Grigas
No ratings yet
Lecture 1 - Introduction: IEOR 242 - Applications in Data Analysis Fall 2019 - Paul Grigas
66 pages
Data Science Interview Questions
100% (1)
Data Science Interview Questions
68 pages
Grade 12 Abm Practical Research 2 Quarter 1 Notes
No ratings yet
Grade 12 Abm Practical Research 2 Quarter 1 Notes
3 pages
Research Context
No ratings yet
Research Context
63 pages
CSR's Impact on Employee Commitment
No ratings yet
CSR's Impact on Employee Commitment
28 pages
Summary The Psycholoy of Advertising Fennis & Stroebe Chapter 1
No ratings yet
Summary The Psycholoy of Advertising Fennis & Stroebe Chapter 1
7 pages
Practical Research 2
100% (2)
Practical Research 2
109 pages
Complete IT Report
No ratings yet
Complete IT Report
32 pages
Health Econometrics Using Stata 1nbsped 1597182281 9781597182287 Compress
No ratings yet
Health Econometrics Using Stata 1nbsped 1597182281 9781597182287 Compress
374 pages
Employee Satisfaction On Labour Welfare PDF
No ratings yet
Employee Satisfaction On Labour Welfare PDF
7 pages
Etextbook 978-1111826925 Business Research Methods All Chapter Instant Download
100% (9)
Etextbook 978-1111826925 Business Research Methods All Chapter Instant Download
53 pages
23MCA555 DAP Lab Manual
No ratings yet
23MCA555 DAP Lab Manual
25 pages
Regression Analysis Solutions
100% (6)
Regression Analysis Solutions
13 pages
Oe Statistics Notes
No ratings yet
Oe Statistics Notes
32 pages
Salary Prediction Using Machine Learning
No ratings yet
Salary Prediction Using Machine Learning
4 pages
Unit-I (R20 Syllabus) Machine Learning Basics
No ratings yet
Unit-I (R20 Syllabus) Machine Learning Basics
50 pages
Cost Accounting Final Exam Review
No ratings yet
Cost Accounting Final Exam Review
2 pages
Proactive Collections Management: Using Artificial Intelligence To Predict Invoice Payment Dates By: Sonali Nanda
No ratings yet
Proactive Collections Management: Using Artificial Intelligence To Predict Invoice Payment Dates By: Sonali Nanda
22 pages

Intro to Exploratory Data Analysis

Uploaded by

Intro to Exploratory Data Analysis

Uploaded by

Lecture 3 - Exploratory Data Analysis

Part I – Introduction to EDA

Data Science Cycle

Part II - Guide to perform EDA

Lecture 4 - Introduction to Machine Learning

The kNN algorithm

Scaling and Normalization

Calculating the Distance

Best practices for kNN

Divide and Conquer

The C5.0 decision tree algorithm

The algorithm uses entropy to calculate information

Pruning the decision tree

Improving model performance

Lecture 7 - Data Preparation for ML

The performance of the ML algorithm is as good as the data so:

Basic Data Cleaning:

Outlier Identification and Removal

Automatic Outlier Detection

How to deal with it?

Recursive Feature Elimination

● Normalization Transform: Scale a variable to the range 0 and 1

How to Scale Numerical Data

Make Data More Gaussian

Some transformation techniques: Box-Cox, Yeo-Johnson Transform, Quantile Transforms,

How to Transform Numerical to Categorical Data

How to Derive New Input Variables

Autoencoder Methods: Deep learning neural networks can be constructed to perform

Principal Component Analysis (PCA)

Lecture 8 - Clustering with kmeans

The k-means algorithm for clustering

Using distance to assign and update clusters: Traditionally,

Choosing the appropriate number of clusters

Sometimes the number of clusters is dictated by business requirements or the

Simple linear regression

Simple linear regression defines the relationship between a dependent variable

The ordinary least squares (OLS) estimation method

While the optimal value of a is:

Multiple linear regression

You might also like