REVIEWER
REVIEWER
Predictive Analytics involves extracting previously unknown, useful information from data using
data analysis.
Traditional techniques may be unsuitable due to large data volumes, high dimensionality, and
heterogeneous data sources.
o Data collection from OLTP databases, ERP systems, external data sources.
Supervised Learning:
Unsupervised Learning:
o Text Mining & Sentiment Analysis: Extracts insights from unstructured text.
Text Mining & Sentiment Analysis: Analyzes social media, reviews, and feedback.
Problem Definition: Establish business goals and define predictive analytics objectives.
Model Evaluation: Assess model performance using metrics like accuracy and precision.
7. Example Applications
Hotline Call Reduction: Reduce call center congestion by predicting call patterns.
Regular evaluation ensures alignment with business objectives and adapts to changing trends.
MODULE 02: DATA PREPROCESSING
Raw data is often incomplete, noisy, or inconsistent, requiring cleaning and transformation.
Data extraction, cleaning, and transformation form the bulk of data warehousing efforts.
4. Data Integration
Combines multiple datasets while addressing schema integration, entity identification, and
redundancy resolution.
Joins (inner, outer, left, right) are used for data merging.
5. Data Transformation
Normalization: Rescales data into a standard range (e.g., min-max, z-score normalization).
6. Data Cleaning
Methods include:
o Ignoring rows with missing values (not ideal for large datasets).
o Filling missing values with mean, median, mode, or inference from other attributes.
9. Outlier Detection
Boxplots and statistical methods (IQR, standard deviation) help detect anomalies.
Feature Selection: Identifies relevant features using techniques like filter, wrapper, and
embedded methods.
Techniques include:
These structured notes summarize the core concepts of data preprocessing, making it easier to apply
these techniques in real-world data analysis.
CODES:
MIN-MAX NORMALIZATION
# Format Values
options(digits = 2)
# MERGE
# Inner Join
# Outer Join
# Left Join
# Right Join
RIGHT = merge(x = left_table, y = right_table, by =c("trial"), all.y = TRUE)
# v - min()/max()-min()*(-)+
# Min-Max Normalization
Z-SCORE STANDARDIZATION
# Z-score standardization
# Format Values
options(digits = 2)
# (v - mean())/sd()
sd(CLASS$velocity)
BINNING
options(digits = 1)
#Prepare a data set of related information by developing 6 vectors of 10 entries: 1 primary key, 1
categorical, 4 numeric
# Primary Key
ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# Categorical Data
Nickname = c("Juan", "Thoo", "Dee", "Far", "Phi", "Sec", "Sev", "Ey", "Nine", "Ten")
# Numeric Data
Grades_Math = c(65, 99, 90, 87, 76, 66, 98, 88, 77, 82)
Grades_English = c(65, 65, 65, 99, 99, 99, 89, 80, 78, 88)
Grades_Science = c(90, 80, 70, 65, 68, 69, 70, 81, 84, 86)
Grades_Reading = c(71, 83, 95, 99, 87, 75, 88, 68, 90, 82)
# Do min-max normalization on the first numeric data, scale within 1-3, adding a new column
GRADES$transformed_math = (GRADES$Grades_Math -
min(GRADES$Grades_Math))/(max(GRADES$Grades_Math) - min(GRADES$Grades_Math))*(3-1)+1
#2nd numeric data: Scale another column using zscore, update the column
GRADES$transformed_english = (GRADES$Grades_English -
mean(GRADES$Grades_English))/sd(GRADES$Grades_English)
#3rd numeric data: Transform another column using equal width binning, 2 bins, enter the bins in the
syntax
#Transform the last numeric data using equal depth binning, update column, 3 bins, declare a variable
for the bin
bin.three = 3
Syntax: The set of rules that define how code should be written in a programming language.
Z-Score Standardization: Method of rescaling data to have a mean of 0 and a standard deviation
of 1.
Data Frame: A table-like structure in R that stores different data types in columns.
Quantile: Values that divide a dataset into equal-sized intervals.
Categorical Data: Data that represents categories rather than numerical values.
Numeric Data: Data represented as numbers that can be used for mathematical operations.