0% found this document useful (0 votes)
33 views

Data Science Exercises

Uploaded by

citprocessteam
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Data Science Exercises

Uploaded by

citprocessteam
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

NAME PLACENTE, JOHN CABRERA

RAPIDMINER: HOW TO ACCESS YOUR DATA

EXERCISE 1: Read data. Prepare CDSA_EX01.xlsx


Objectives:
1. Read CDSA_EX01.xlsx
2. Import in Rapidminer
3. Read and show DATA RESULT
4. Show statistics report.

OUTPUT NO. 01: SHOW YOUR DATA RESULT SCREENSHOT.

OUTPUT NO. 02: SHOW YOUR STATISTICS REPORT.

EXERCISE #2: MODIFYING YOUR FIRST CHART

Use your CDSA_EX01 to show a LINE CHART.


X: TRANSACTION
Y: REPAIRTIME

OUTPUT 1: SHOW RESULT

OUTPUT 2: CUSTOMIZING THE CHART


Title: REPAIR TIME – SERVICES
Subtitle: PHILIPPINES

 SHOW MARKERS (CIRCLE)


 CHANGE THE COLOR TO RED
 SHOW LABEL

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

EXERCISE #3: SAVING YOUR PROCESS.

OBJECTIVE:
1. Create a repository CDSA-CMDI.
2. Save your process.
3. Name the process CDSA-EX01PROCESS

ACTIVITY OUTPUT: Write the step-by-step procedure on how to create repository and save your process.
Include screenshots.

STEP ACTIVITY OWNER DURATION

EXERCISE #4. LET’S DO MORE ANALYSIS

A. Load the data file CDSA_EX02.xlsx


B. Answer the following questions:

1. There are more male customers than female. True or false? Show evidence.
FEMALE

2. There are more customers aged 50 and above. True or false? Show evidence. TRUE

3. The average gross sales is 69883.38. True or false? Show evidence.

4. The maximum cost of goods sold (COGS) is PHP 39,134.69. True or false? Show evidence.

5. What is the average customer satisfaction?

6. How many are the customers in Luzon, Visayas and Mindanao?

EXERCISE #5. LET’S DO MORE CHARTING

A. Using CDSA_EX02.xlsx, perform the following charts.

1. Create a bar chart and line chart.


Bar Chart: Gross Sales
Line Chart: COGS

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

By: AGE

Result:

2. Create Boxplot of Gross Sales


Group By: Location

Result

3. Create donut chart of NETSALES


Group by Age

Result

4. Create SCATTER/Bubble plot


X-axis: Age
Value Column: Gross Sales
Color by: Location

Result:

Questions:
1. What are the ages of customers from Mindanao?

5. Create a heat map of the average satisfaction of customers


Group by AGE.

Result:

CASE STUDY NO. 01

AYO-AYO BANK OF ASIA is studying its business performance in the Philippines. A group of data scientists
now studies the data sets given to them by the corporation. Open the data set and provide the company
with an insight. What do you see and what can you share as a data scientist?

FILE: CDSA_EX03.xlsx

CASE STUDY NO. 02

What is causing the high reject in the factory?

FILE: CDSA_EX03B.xlsx

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

EXERCISE #6: FILTERING OF DATA

A. LOAD DATA CDSA_EX03.xls

OBJECTIVE: GET ALL LOAN DATA ONLY


OPERATOR: Use FILTER EXAMPLES

RESULT:

EXERCISE #7: REMOVING OF DUPLICATE DATA

A. LOAD DATA CDSA_EX04.xls

OBJECTIVE:
1. Remove duplicate data from the file.

RESULT:

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

SHOW NEW DATA WITHOUT THE DUPLICATE

SHOW THE REMOVED DATA SET

EXERCISE #8: WRITE THE NEW DATA TO THE DATA SET

A. From exercise #7, dump the duplicate data to an EXCEL FILE

OPERATOR: WRITE EXCEL

1. CREATE A BLANK EXCEL FILE, “WRITEEXAMPLE2.XLSX”


2. USE THE OPERATOR

RESULT:

EXERCISE #10: LOOK FOR MISSING DATA

A. FROM THE FILE CDSA_EX05.XLSX

OBJECTIVE: USE THE OPERATOR FILTER EXAMPLES TO LOOK FOR MISSING DATA SETS.

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

RESULTS:

EXERCISE #11: GENERATING ATTRIBUTES 1

A. Given the file CDSA_EX02.xlsx:


CREATE COLUMN “RESULT”
If the CSAT is less than 7, then state “FAILED”
If the CSAT is 7 or more, then state “PASS”

Use the operator GENERATE ATTRIBUTE.

RESULT:

EXERCISE #12: GENERATING ATTRIBUTES 1

A. Using CDSA_EX06.xlsx
Compute for CSAT if

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

CSAT = 0.3 X CONTENT + 0.6 X TRAINER + 0.1 X VENUE

Use GENERATE ATTRIBUTE to produce results.

RESULT:

EXERCISE #13: GENERATE COPY OF COLUMNS

A. Using CDSA_EX06.xlsx

Generate a copy of the column CONTENT and name it as CONTENT2.

RESULT:

EXERCISE #14: GENERATE CONCATENATION

A. USING CDSA_EX02, CONCATENATE CUSTOMER AND AGE into CUSTOMER_AGE

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

DATA SCIENCE MAJOR TOOLS

DECISION TREES

 Gives you a TREE based partition of data sets


 Normally this is used to describe the data set thru trees
 Used in root cause analysis

Example:

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

EXERCISE #15: GENERATE DECISION TREES (EXAMPLE 1)

A. Given CDSA_EX07, determine who are giving failing grade to transactions.

USE DECISION TREES.

STUDY RESULTS:

Level 2:

Level 3:

Level 4:

EXERCISE #16: DETERMINE FACTORS THAT AFFECTS HIGH SUGAR LEVEL (EXAMPLE 2)

A. Given CDSA_EX08_DIABETES, determine what factors affects SUGAR LEVEL

USE DECISION TREES.

EXERCISE #17: DETERMINE CHURN FACTORS (EXAMPLE 3)

A. Given CDSA_EX09, determine who what factors affects CHURN (Uumaalis)

USE DECISION TREES.

EXERCISE #18: INSURANCE CHARGES (EXAMPLE 4)

A. Given CDSA_EX10, determine who what factors affects INSURANCE CHARGES

USE DECISION TREES.

 Determine outliers in the data sets

METHOD 1: Use boxplot to detect outliers.

METHOD 2: Use operator DETECT OUTLIER (Distances)

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

EXERCISE #19: OUTLIER DETECTION

A. Given CDSA_EX11, detect the outliers.

Use DETECT OUTLIER (Distances)

METHOD 3: Use statistical process control

UCL = Mean + 3 x stdev


LCL = Mean – 3 x stdev
UCL = Upper control limit
LCL = Lower control limit

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

EXERCISE #20: OUTLIER DETECTION USING STATISTICAL PROCESS CONTROL

A. Given CDSA_EX11, create an SPC chart to monitor outlier

*** NOTE: Use the first 30 CLEAN dataset as baseline for UCL and LCL
RESULT

NELSON RULES FOR OUTLIER DETECTION / ANOMALY DETECTION


Instructions: Draw an example per each rule.
1 | One point is beyond Zone A.
2 | Nine points in a row in Zone C or beyond.
3 | Six points in a row steadily increasing or decreasing
4 | Fourteen points in a row alternating up and down
5 | Two out of three points in a row in Zone A or beyond
6 | Four out of five points in a row in Zone B or beyond
7 | Fifteen points in a row are in Zone C
8 | Eight points in a row on both sides of centerline with none in Zone C.

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

LINEAR REGRESSION

USE LINEAR REGRESSION to determine relationship between a response and factor, such as SALES AND
MARKETING.

EXAMPLE: For the dataset CDSA_EX18.XLSX, can you determine whether marketing is effective in
increasing sales. Also predict sales, if for example, I spend PHP 1,000,000 for marketing– how much is
the predicted sales.

SHOW RESULTS:

Also show accuracy of the performance model.

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

POLYNOMIAL REGRESSION

USE POLYNOMIAL REGRESSION to determine relationship between a response and factor, such as SALES
AND MARKETING.

EXAMPLE: For the dataset LINEARREGRESSION01.XLSX, determine the relationship between YEARS OF
SERVICE AND SALARY

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

SHOW RESULTS:

Also show accuracy of the performance model.

GENERALIZED LINEAR MODELLING

LINEAR REGRESSION AND MULTIPLE LINEAR REGRESSION ANALYSIS


 Helps determine factors that affects outcomes
 Helps in forecasting values based on factors
 Determines what factors are important to response

EXAMPLE DATA:

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

EXERCISE #21: GENERAL LINEAR MODELLING (EXCEL DATA ANALYSIS)

Use general linear modelling to determine the relationship of factors to response MSAT (Marriage
Satisfaction).

DATA: CDSA_EX12.xlsx

USE EXCEL DATA ANALYSIS

RESULTS:

EXERCISE #21: GENERAL LINEAR MODELLING (USING RAPIDMINER)

Use general linear modelling to determine the relationship of factors to response MSAT (Marriage
Satisfaction).

DATA: CDSA_EX12.xlsx

USE RAPIDMINER

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

RESULTS:

MODEL PERFORMANCE TESTING AND VALIDATION

Analysis models are measured via ABSOLUTE ERROR, CORRELATION AND SQUARED
CORRELATION including all other possible measures.

HOW DO WE MEASURE THE PERFORMANCE OF A CERTAIN MODEL?

EXERCISE #22: MEASURE PERFORMANCE OF THE GENERAL LINEAR MODEL (USING PEFORMANCE)

Use general linear modelling to determine the relationship of factors to response MSAT (Marriage
Satisfaction). Then determine the model performance using OPERATOR: PERFORMANCE (REGRESSION)

DATA: CDSA_EX12.xlsx

USE RAPIDMINER

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

SHOW RESULTS:

Is the model accurate?

NEURAL NETWORK MODELLING

EXERCISE #23: MEASURE PERFORMANCE OF THE NEURAL NETWORK (USING PEFORMANCE)

Use NEURAL NETWORK to determine the relationship of factors to response MSAT (Marriage
Satisfaction). Then determine the model performance using OPERATOR: PERFORMANCE (REGRESSION)

DATA: CDSA_EX12.xlsx

USE RAPIDMINER

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

RESULTS:

Show the accuracy of the performance model NEURAL NET.

Compare to General Linear Model.

DEEP LEARNING MODELLING

EXERCISE #24: MEASURE PERFORMANCE OF THE NEURAL NETWORK (USING PEFORMANCE)

Use DEEP LEARNING to determine the relationship of factors to response MSAT (Marriage Satisfaction).
Then determine the model performance using OPERATOR: PERFORMANCE (REGRESSION)

DATA: CDSA_EX12.xlsx

USE RAPIDMINER

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

RANDOM FOREST MODELLING

EXERCISE #25: MEASURE PERFORMANCE OF THE RANDOM FOREST (USING PEFORMANCE)

Use RANDOM FOREST to determine the relationship of factors to response MSAT (Marriage Satisfaction).
Then determine the model performance using OPERATOR: PERFORMANCE (REGRESSION)

DATA: CDSA_EX12.xlsx

USE RAPIDMINER

DECISION TREE MODELLING

EXERCISE #25: MEASURE PERFORMANCE OF THE DECISION TREE FOREST (USING PEFORMANCE)

Use RANDOM FOREST to determine the relationship of factors to response MSAT (Marriage Satisfaction).
Then determine the model performance using OPERATOR: PERFORMANCE (REGRESSION)

DATA: CDSA_EX12.xlsx

USE RAPIDMINER

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

CASE STUDY #3: LOAN SCORE PREDICTION

EXERCISE #25: MEASURE PERFORMANCE USING DIFFERENT MODELS

GIVEN the file CDSA_EX14, determine the LOAN SCORE PREDICTION model and its performance if the
following models are used:

A. DECISION TREE
B. GLM
C. NEURAL NET
D. RANDOM FOREST
E. DEEP LEARNING
F. GLM VIA EXCEL

Create a table and show absolute error, relative error, correlation and squared correlation.
Determine which factors affect the loan score.

CDAA
DBMA
CDRA
CDSA

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

https://www.nievgen.com/CDAA.html

MODEL ABSOLUTE RELATIVE CORRELATION SQUARED


ERROR ERROR CORRELATION
DECISION T.
GLM
NEURAL
R. FOREST
DEEP LEARNING

HYPOTHESIS TESTING: COMPARISON OF MEANS

In data science modelling, we statistically and probabilistically compare the performance


against a target or against another model.

ONE SAMPLE T

Given the file CDSA_E16.xlsx, determine if the Absolute Percentage Error meets the 3.0% error
criterion/requirement of the company.

Use SPCforExcel in the process.

SHOW RESULT:

Do they meet the requirement in forecast model performance?

TWO SAMPLE T

Given the file CDSA_E17.xlsx, compare the model performance between NEURAL NET and DEEP
LEARNING. Which one is better?

Use SPCforExcel in the process.

SHOW RESULT:

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE
NIEVGEN DATA SCIENCE DEPARTMENT | CDSA

PAIRED T

Given the file CDSA_E19.xlsx, determine if the employee performance survey improved?
Use SPCforExcel in the process.

SHOW RESULT:

ONE WAY ANOVA

Given the file CDSA_E20.xlsx, determine if the MODELS ARE EQUAL OR NOT

SHOW RESULT:

ENGR. JOHN C. PLACENTE | ALL RIGHTS RESERVED | DO NOT SHARE WITH ANYONE

You might also like