Machine Learning in PySpark

The document outlines the data mining process, emphasizing the importance of defining the purpose, obtaining and cleaning data, and determining the appropriate machine learning task. It details the steps involved in applying methods, evaluating performance, and deploying models, with a focus on supervised learning techniques such as regression and classification. The document also describes the supervised learning pipeline in PySpark, including data splitting, model estimation, prediction, and evaluation.

Uploaded by

BraveAF

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views18 pages

Machine Learning in PySpark

Uploaded by

BraveAF

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Machine Learning in PySpark

Bharti Motwani
The Data Mining Process

Consists of multiple steps from problem definition to

model deployment

Explore
Define Obtain Determine Choose Apply Evaluate Deploy
&clean
purpose data DM task DM Methods Methods Performance Model
data
Defining Purpose
Define
purpose

• Should focus on business understanding and problem

• Managers are often not clear about what the goal of a data mining project is

• Determining this requires iteration between data exploration and

defining the problem
Obtaining Data
Define Obtain
purpose data

• Most real world applications combine data from multiple sources

Explore, Clean and Preprocess
Explore
Define Obtain
&clean
purpose data
data

Exploring, understanding and visualizing data are perhaps the most important steps in the data mining process.

Visualize and explore the data:

• Are there missing values? If yes, how should we handle them?
• Are there outliers? How should we handle them?
• Are the data summaries what we would expect? Are ranges of values reasonable?
• What does the data look like? Visualize the data using graphing techniques
Some of the key tasks that may be performed are:
• Eliminate variables or otherwise reduce data Apply domain knowledge!
• Transform variables (“feature engineering”)
Determine Task
Explore
Define Obtain Determine
&clean
purpose data DM task
data

• Is it supervised or unsupervised learning (or something else)?

• Is it Regression? Is it Classification?
Apply Methods and Evaluate
Explore
Define Obtain Determine Apply Evaluate
&clean
purpose data DM task Methods Performance
data

• Typically apply multiple methods and compare their performance

• Models will be judged based on how good they are at making predictions for
test data.
Apply Methods and Evaluate
Explore
Define Obtain Determine Apply Evaluate
&clean
purpose data DM task Methods Performance
data

Train
• Portion of data used to develop a model

Validation data (Tune!)

• Portion of the data used to assess how well the model fits
• To adjust parameters

Test
• Portion of the data used only at the end of the model building and
selection process
• Assess how well the final model performs on data that was
‘unseen’ during training
Model Deployment

Explore
Define Obtain Determine Choose Apply Evaluate Model
&clean
purpose data DM task DM Methods Methods Performance Deployment
data
Overarching Framework

Machine Learning

Supervised Learning Unsupervised Learning

Regression Clustering

Classification Recommendation System

Frequent Pattern Mining

14
Supervised Learning

• The process of providing an algorithm with records for which an output variable of
interest is known and the algorithm “learns” how to predict this value with new
records where the output is not known
• Goal is to predict an outcome, such as purchases/no purchase, fraud/no fraud, sales,
salary and others
Supervised Learning Models
• We build a model that understands how to correctly assign a
label to an example
• Supervised learning models are mathematical functions that
map input data (i.e., features) to predict outcome labels
(referred to as outcome/output/target variables)

>
x f(x) y
Input features Model Predicted
outcome
Regression
•When the dependent variable (label) is a real number.
Example:
•Predicting sales
•Predicting the cost of coffee in 2022
Regression Problem:
Input features Outcome
Classification

•When the dependent variable (label) is specific class (i.e.,

category)
Example:
•Determining if a customer will churn or not
•Determining if a patient is a current smoker, former smoker, or
non-smoker
Classification Problem:
Input features Outcome

Subscription Tenure in months Primary Phone Churn

2-line plan 12 Samsung S8 Yes
Family plan 36 iPhone X No
Individual 18 Pixel 4A No
Supervised Learning Pipeline
1. Split complete data into training and test/validation dataset
Using randomSplit() to split the data
2. Estimate a model on the training dataset
[Link] for Regression Problems
[Link] for Classification Problems
3. Predict using the test dataset
4. Evaluate the model using metrics of accuracy/error
[Link] for evaluating
5. Creating and selecting the best model
[Link] for Hyper-parameter tuning 3
18

Data Science Lecture: Classification & Regression
No ratings yet
Data Science Lecture: Classification & Regression
27 pages
Module 3 - Introduction To ML
No ratings yet
Module 3 - Introduction To ML
45 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
Module 2 - ML
No ratings yet
Module 2 - ML
53 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
37 pages
Introduction Class
No ratings yet
Introduction Class
134 pages
Classification vs Regression in ML
No ratings yet
Classification vs Regression in ML
15 pages
Machine Learning Reg
No ratings yet
Machine Learning Reg
45 pages
Lec-7 Intro Machine Learning
No ratings yet
Lec-7 Intro Machine Learning
87 pages
Classification
No ratings yet
Classification
22 pages
Air Quality Prediction Using Machine Learning
No ratings yet
Air Quality Prediction Using Machine Learning
29 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
10 pages
ML Unit 1
No ratings yet
ML Unit 1
21 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
24 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Introduction to Predictive Analytics
No ratings yet
Introduction to Predictive Analytics
30 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
5 pages
Lecture 1
No ratings yet
Lecture 1
19 pages
Intro to Supervised Machine Learning
No ratings yet
Intro to Supervised Machine Learning
42 pages
Statistics For Data Science
100% (3)
Statistics For Data Science
39 pages
Unit 3 ML
No ratings yet
Unit 3 ML
119 pages
Churn Prediction with ML Techniques
No ratings yet
Churn Prediction with ML Techniques
77 pages
Ocs353 DSF Unit III Notes
No ratings yet
Ocs353 DSF Unit III Notes
11 pages
Beginner's Guide to Machine Learning
No ratings yet
Beginner's Guide to Machine Learning
37 pages
ML 2
No ratings yet
ML 2
39 pages
Research Trends in Machine Learning: Muhammad Kashif Hanif
No ratings yet
Research Trends in Machine Learning: Muhammad Kashif Hanif
80 pages
Project
No ratings yet
Project
12 pages
ML Workshop
No ratings yet
ML Workshop
78 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
55 pages
Machine Learning Part: Domain Overview
No ratings yet
Machine Learning Part: Domain Overview
20 pages
Machine Learning
No ratings yet
Machine Learning
54 pages
Untitled
No ratings yet
Untitled
11 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
6 pages
ML SIG - Day 1
No ratings yet
ML SIG - Day 1
55 pages
3 Pred Analysis
No ratings yet
3 Pred Analysis
18 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Lecture 2 Unit 1
No ratings yet
Lecture 2 Unit 1
60 pages
Unit Iii
No ratings yet
Unit Iii
67 pages
Unit 4 - Question Bank and Answers
No ratings yet
Unit 4 - Question Bank and Answers
23 pages
ML Chap 2
No ratings yet
ML Chap 2
60 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
Lecture 2
No ratings yet
Lecture 2
36 pages
Chapter 4 - Machine Learning
No ratings yet
Chapter 4 - Machine Learning
81 pages
TM 4 - Data Mining and Machine Learning
No ratings yet
TM 4 - Data Mining and Machine Learning
60 pages
Data Mining Techniques and Models
No ratings yet
Data Mining Techniques and Models
43 pages
Made By: Swati Tripathi
No ratings yet
Made By: Swati Tripathi
31 pages
Unit4 PPT
No ratings yet
Unit4 PPT
126 pages
Oe Cae 3
No ratings yet
Oe Cae 3
7 pages
Machine Learning Model Validation Insights
No ratings yet
Machine Learning Model Validation Insights
15 pages
Presentation On Supervised Learning
No ratings yet
Presentation On Supervised Learning
8 pages
Machine Learning
No ratings yet
Machine Learning
42 pages
Azure Machine Learning Overview
No ratings yet
Azure Machine Learning Overview
45 pages
MachineLearning Jan2nd
100% (2)
MachineLearning Jan2nd
171 pages
Machine Learning QB
No ratings yet
Machine Learning QB
15 pages
AI 501 - Lesson 4 - Supervised Learning
No ratings yet
AI 501 - Lesson 4 - Supervised Learning
41 pages
Introduction To Predictive Analytics: UNIT-1
No ratings yet
Introduction To Predictive Analytics: UNIT-1
14 pages
PSCS511 - Machine Learning
No ratings yet
PSCS511 - Machine Learning
23 pages
Common MCQs - V Imp
No ratings yet
Common MCQs - V Imp
24 pages
CFA Level II: Quantitative Methods
No ratings yet
CFA Level II: Quantitative Methods
20 pages
The Impact of FAS 166-167 Implementation On P2P Lending
No ratings yet
The Impact of FAS 166-167 Implementation On P2P Lending
64 pages
Relationship Among CSR, Service Quality, Corporate Image and Purchase Intention
100% (1)
Relationship Among CSR, Service Quality, Corporate Image and Purchase Intention
18 pages
OSCM Timeline&criteria
No ratings yet
OSCM Timeline&criteria
8 pages
Fin
No ratings yet
Fin
2 pages
SPSS Guide for Barnard Biology Students
No ratings yet
SPSS Guide for Barnard Biology Students
82 pages
Environmental Risk Premiums and Price Effects
No ratings yet
Environmental Risk Premiums and Price Effects
23 pages
mcd1110 Sample Test 2b 2012 02
No ratings yet
mcd1110 Sample Test 2b 2012 02
19 pages
Online Broker Rating Regression Analysis
No ratings yet
Online Broker Rating Regression Analysis
7 pages
02-2021 - Quant Advanced 2
No ratings yet
02-2021 - Quant Advanced 2
71 pages
Beatty PDF
No ratings yet
Beatty PDF
18 pages
GCE A-Level H2 Math Paper 2 Guide
100% (1)
GCE A-Level H2 Math Paper 2 Guide
20 pages
Simple Linear Regression Guide
No ratings yet
Simple Linear Regression Guide
12 pages
Managerial Economics 7th Edition Keat Test Bank 1
100% (75)
Managerial Economics 7th Edition Keat Test Bank 1
36 pages
Mining Process Quality Prediction Model
No ratings yet
Mining Process Quality Prediction Model
4 pages
ADMModule - STEM - GP12EU Ia 7
No ratings yet
ADMModule - STEM - GP12EU Ia 7
27 pages
Demographic Models: Lecture 11: Modelling Population Phenomena
No ratings yet
Demographic Models: Lecture 11: Modelling Population Phenomena
4 pages
Bussines Stats CHO
No ratings yet
Bussines Stats CHO
23 pages
Heat Index Final!
No ratings yet
Heat Index Final!
89 pages
Impact of Financial Factors on Audit Opinions
No ratings yet
Impact of Financial Factors on Audit Opinions
9 pages
Core Data Analysis Worksheet 6
No ratings yet
Core Data Analysis Worksheet 6
20 pages
Eict 2023
No ratings yet
Eict 2023
8 pages
Mineral Requirements of Fish A Systematic Review
No ratings yet
Mineral Requirements of Fish A Systematic Review
48 pages
MTH408 Machine - Learning - Logistic - Regression
No ratings yet
MTH408 Machine - Learning - Logistic - Regression
43 pages
Weather Factors Cardamom
No ratings yet
Weather Factors Cardamom
7 pages
Algorithmic Pair Trading in Indian Markets
No ratings yet
Algorithmic Pair Trading in Indian Markets
9 pages
Managerial Economics 7th Edition William F. Samuelson / Stephen G. Marks Instant Download
No ratings yet
Managerial Economics 7th Edition William F. Samuelson / Stephen G. Marks Instant Download
51 pages
RocLab 1.0: Rock Mass Strength Analysis
No ratings yet
RocLab 1.0: Rock Mass Strength Analysis
19 pages
Non-Linear Causal Discovery via ICA
No ratings yet
Non-Linear Causal Discovery via ICA
10 pages