Machine Learning in PySpark
Bharti Motwani
The Data Mining Process
Consists of multiple steps from problem definition to
model deployment
Explore
Define Obtain Determine Choose Apply Evaluate Deploy
&clean
purpose data DM task DM Methods Methods Performance Model
data
Defining Purpose
Define
purpose
• Should focus on business understanding and problem
• Managers are often not clear about what the goal of a data mining project is
• Determining this requires iteration between data exploration and
defining the problem
Obtaining Data
Define Obtain
purpose data
• Most real world applications combine data from multiple sources
Explore, Clean and Preprocess
Explore
Define Obtain
&clean
purpose data
data
Exploring, understanding and visualizing data are perhaps the most important steps in the data mining process.
Visualize and explore the data:
• Are there missing values? If yes, how should we handle them?
• Are there outliers? How should we handle them?
• Are the data summaries what we would expect? Are ranges of values reasonable?
• What does the data look like? Visualize the data using graphing techniques
Some of the key tasks that may be performed are:
• Eliminate variables or otherwise reduce data Apply domain knowledge!
• Transform variables (“feature engineering”)
Determine Task
Explore
Define Obtain Determine
&clean
purpose data DM task
data
• Is it supervised or unsupervised learning (or something else)?
• Is it Regression? Is it Classification?
Apply Methods and Evaluate
Explore
Define Obtain Determine Apply Evaluate
&clean
purpose data DM task Methods Performance
data
• Typically apply multiple methods and compare their performance
• Models will be judged based on how good they are at making predictions for
test data.
Apply Methods and Evaluate
Explore
Define Obtain Determine Apply Evaluate
&clean
purpose data DM task Methods Performance
data
Train
• Portion of data used to develop a model
Validation data (Tune!)
• Portion of the data used to assess how well the model fits
• To adjust parameters
Test
• Portion of the data used only at the end of the model building and
selection process
• Assess how well the final model performs on data that was
‘unseen’ during training
Model Deployment
Explore
Define Obtain Determine Choose Apply Evaluate Model
&clean
purpose data DM task DM Methods Methods Performance Deployment
data
Overarching Framework
Machine Learning
Supervised Learning Unsupervised Learning
Regression Clustering
Classification Recommendation System
Frequent Pattern Mining
14
Supervised Learning
• The process of providing an algorithm with records for which an output variable of
interest is known and the algorithm “learns” how to predict this value with new
records where the output is not known
• Goal is to predict an outcome, such as purchases/no purchase, fraud/no fraud, sales,
salary and others
Supervised Learning Models
• We build a model that understands how to correctly assign a
label to an example
• Supervised learning models are mathematical functions that
map input data (i.e., features) to predict outcome labels
(referred to as outcome/output/target variables)
>
x f(x) y
Input features Model Predicted
outcome
Regression
•When the dependent variable (label) is a real number.
Example:
•Predicting sales
•Predicting the cost of coffee in 2022
Regression Problem:
Input features Outcome
Classification
•When the dependent variable (label) is specific class (i.e.,
category)
Example:
•Determining if a customer will churn or not
•Determining if a patient is a current smoker, former smoker, or
non-smoker
Classification Problem:
Input features Outcome
Subscription Tenure in months Primary Phone Churn
2-line plan 12 Samsung S8 Yes
Family plan 36 iPhone X No
Individual 18 Pixel 4A No
Supervised Learning Pipeline
1. Split complete data into training and test/validation dataset
Using randomSplit() to split the data
2. Estimate a model on the training dataset
[Link] for Regression Problems
[Link] for Classification Problems
3. Predict using the test dataset
4. Evaluate the model using metrics of accuracy/error
[Link] for evaluating
5. Creating and selecting the best model
[Link] for Hyper-parameter tuning 3
18