0% found this document useful (0 votes)
6 views9 pages

REVIEWER

The document provides a comprehensive overview of predictive analytics and data preprocessing, detailing their importance, techniques, algorithms, and tools. It covers various types of analytics, including descriptive, predictive, and prescriptive, and discusses data preprocessing tasks such as integration, transformation, cleaning, and reduction. Additionally, it includes practical coding examples and a glossary of key terms related to data analysis.

Uploaded by

EXO -l
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views9 pages

REVIEWER

The document provides a comprehensive overview of predictive analytics and data preprocessing, detailing their importance, techniques, algorithms, and tools. It covers various types of analytics, including descriptive, predictive, and prescriptive, and discusses data preprocessing tasks such as integration, transformation, cleaning, and reduction. Additionally, it includes practical coding examples and a glossary of key terms related to data analysis.

Uploaded by

EXO -l
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

MODULE 01: Predictive Analytics

1. Introduction to Predictive Analytics

 Predictive Analytics involves extracting previously unknown, useful information from data using
data analysis.

 Draws from machine learning, AI, statistics, and database systems.

 Traditional techniques may be unsuitable due to large data volumes, high dimensionality, and
heterogeneous data sources.

2. Business Analytics Overview

 Types of Business Analytics:

o Descriptive Analytics: Explains past and current events.

o Predictive Analytics: Forecasts future outcomes.

o Prescriptive Analytics: Recommends best solutions.

 Business Analytics Framework:

o Data collection from OLTP databases, ERP systems, external data sources.

o Data integration using ETL systems.

o Data warehousing and analysis through exploratory analysis, simulation, optimization.

3. Types of Predictive Analytics Algorithms

 Supervised Learning:

o Classification: Predicts categorical outcomes.

o Regression: Predicts numerical values.

 Unsupervised Learning:

o Clustering: Groups data points with similar characteristics.

o Association Analysis: Identifies relationships between variables.

o Sequential Pattern Analysis: Discovers dependencies in event sequences.

o Text Mining & Sentiment Analysis: Extracts insights from unstructured text.

4. Predictive Analytics Techniques

 Classification: Used in tasks like fraud detection and customer segmentation.

 Regression: Applied to forecasting and trend analysis.

 Clustering: Used in market segmentation and customer profiling.


 Association Rule Analysis: Commonly applied in market basket analysis.

 Sequential Pattern Analysis: Helps predict customer behavior patterns.

 Text Mining & Sentiment Analysis: Analyzes social media, reviews, and feedback.

5. Tools for Predictive Analytics

 Top tools include R, Python, RapidMiner, SAS, SPSS, and Weka.

 Selection of tools depends on data type, complexity, and use case.

6. Predictive Analytics Framework

 Problem Definition: Establish business goals and define predictive analytics objectives.

 Data Preparation: Extract, clean, and preprocess data.

 Data Exploration: Use visualization and statistical techniques to understand data.

 Modeling: Build predictive models using suitable algorithms.

 Model Evaluation: Assess model performance using metrics like accuracy and precision.

 Deployment: Implement models for real-world applications and monitor results.

7. Example Applications

 Churn Analysis in Telcos: Predict customer churn using subscriber data.

 Manpower Headcount in FMCG: Forecast staffing needs using regression models.

 Market Segmentation: Identify customer clusters for targeted marketing.

 Supermarket Basket Analysis: Optimize product placement using association rules.

 Hotline Call Reduction: Reduce call center congestion by predicting call patterns.

8. Model Deployment and Continuous Improvement

 Successful models require continuous monitoring and refinement.

 Implementation strategies range from generating reports to integrating into automated


systems.

 Regular evaluation ensures alignment with business objectives and adapts to changing trends.
MODULE 02: DATA PREPROCESSING

1. Introduction to Data Preprocessing

 Data preprocessing is essential for ensuring quality data for analysis.

 Raw data is often incomplete, noisy, or inconsistent, requiring cleaning and transformation.

2. Why Data Preprocessing is Important

 Poor-quality data leads to inaccurate insights and misleading statistics.

 Data extraction, cleaning, and transformation form the bulk of data warehousing efforts.

3. Major Tasks in Data Preprocessing

 Data Integration: Combines multiple data sources into a unified dataset.

 Data Transformation: Converts data into a format suitable for analysis.

 Data Cleaning: Handles missing values, outliers, and inconsistencies.

 Data Reduction: Reduces data volume while maintaining integrity.

4. Data Integration

 Combines multiple datasets while addressing schema integration, entity identification, and
redundancy resolution.

 Joins (inner, outer, left, right) are used for data merging.

5. Data Transformation

 Normalization: Rescales data into a standard range (e.g., min-max, z-score normalization).

 Encoding & Binning:

o Encoding converts categorical data to numerical values (binary or class-based encoding).

o Binning groups numeric values into discrete intervals (equal-width or equal-depth


binning).

 Aggregation & Smoothing: Summarizes data and reduces noise.

6. Data Cleaning

 Handles missing data through deletion, imputation, or placeholder values.

 Removes noise using binning, regression, or clustering.

 Identifies and resolves duplicate or inconsistent data.

7. Handling Missing Data

 Methods include:
o Ignoring rows with missing values (not ideal for large datasets).

o Filling missing values with mean, median, mode, or inference from other attributes.

8. Handling Noisy Data

 Binning: Smooths data by averaging values within bins.

 Regression: Fits data to a regression function to remove noise.

 Clustering: Groups data and removes outliers.

9. Outlier Detection

 Boxplots and statistical methods (IQR, standard deviation) help detect anomalies.

10. Data Reduction

 Sampling: Uses subsets of data to improve efficiency.

o Types: Random, Stratified, Upsampling, Downsampling.

 Feature Selection: Identifies relevant features using techniques like filter, wrapper, and
embedded methods.

 Dimensionality Reduction: Removes redundant attributes using techniques like PCA.

11. Feature Engineering

 Creates new features that improve model performance.

 Techniques include:

o Feature Extraction: Converts raw data into useful features.

o Feature Construction: Combines multiple attributes to create new ones.

12. Case Study: Data Preprocessing with R

 Practical applications of preprocessing techniques in R include handling missing data,


normalization, encoding, binning, and feature selection.

These structured notes summarize the core concepts of data preprocessing, making it easier to apply
these techniques in real-world data analysis.
CODES:

 MIN-MAX NORMALIZATION

# SOLANO, SAYANA MAE

# Format Values

options(digits = 2)

# Loading a CSV File

CLASS = read.csv ("simple.csv")

# Load First/Left Table

left_table = read.csv ("simple.csv")

# Develop the Second/Right Table

trial <- c("A","C","D")

cost <- c(11.4,3.3,1.1)

right_table <- data.frame(trial,cost)

# MERGE

# Inner Join

INNER = merge(x = left_table, y = right_table, by =c("trial"))

# Outer Join

OUTER = merge(x = left_table, y = right_table, by =c("trial"), all = TRUE)

# Left Join

LEFT = merge(x = left_table, y = right_table, by =c("trial"), all.x = TRUE)

# Right Join
RIGHT = merge(x = left_table, y = right_table, by =c("trial"), all.y = TRUE)

# v - min()/max()-min()*(-)+

# Min-Max Normalization

left_table$mass = (left_table$mass - min(left_table$mass))/(max(left_table$mass)-


min(left_table$mass))*(3-1)+1

# Add a Column for the Normalized Velocity

left_table$new_velocity = (left_table$velocity - min(left_table$velocity))/(max(left_table$velocity)-


min(left_table$velocity))*(3+1)+1

 Z-SCORE STANDARDIZATION

# FAMILYNAME, GIVEN NAME

# Z-score standardization

# Format Values

options(digits = 2)

# Loading the CSV file

CLASS = read.csv(file = "simple.csv")

# (v - mean())/sd()

# New column for Mass

CLASS$NEW_MASS = (CLASS$mass - mean(CLASS$mass))/sd(CLASS$mass)

# Update the velocity

CLASS$velocity = (CLASS$velocity - mean(CLASS$velocity))/

sd(CLASS$velocity)
 BINNING

# Format values to be 1 decimal place

options(digits = 1)

#Prepare a data set of related information by developing 6 vectors of 10 entries: 1 primary key, 1
categorical, 4 numeric

# Primary Key

ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

# Categorical Data

Nickname = c("Juan", "Thoo", "Dee", "Far", "Phi", "Sec", "Sev", "Ey", "Nine", "Ten")

# Numeric Data

Grades_Math = c(65, 99, 90, 87, 76, 66, 98, 88, 77, 82)

Grades_English = c(65, 65, 65, 99, 99, 99, 89, 80, 78, 88)

Grades_Science = c(90, 80, 70, 65, 68, 69, 70, 81, 84, 86)

Grades_Reading = c(71, 83, 95, 99, 87, 75, 88, 68, 90, 82)

# Integrate the vectors into a data frame

GRADES = data.frame(ID, Nickname, Grades_Math, Grades_English, Grades_Science, Grades_Reading)

# Do min-max normalization on the first numeric data, scale within 1-3, adding a new column

GRADES$transformed_math = (GRADES$Grades_Math -
min(GRADES$Grades_Math))/(max(GRADES$Grades_Math) - min(GRADES$Grades_Math))*(3-1)+1

#2nd numeric data: Scale another column using zscore, update the column

GRADES$transformed_english = (GRADES$Grades_English -
mean(GRADES$Grades_English))/sd(GRADES$Grades_English)
#3rd numeric data: Transform another column using equal width binning, 2 bins, enter the bins in the
syntax

intervals = quantile(GRADES$Grades_Science, (0:2)/2)

GRADES$transformed_science = cut(GRADES$Grades_Science, intervals, 2, include.lowest = TRUE, labels


= c("Low Grade", "High Grade"))

#Transform the last numeric data using equal depth binning, update column, 3 bins, declare a variable
for the bin

bin.three = 3

GRADES$Grades_Reading = cut(GRADES$Grades_Reading, bin.three, include.lowest = TRUE, labels =


c("Failing", "Average", "Passing"))

Glossary of Coding Terms

 Syntax: The set of rules that define how code should be written in a programming language.

 Normalization: Process of scaling numerical values to a common range.

 Z-Score Standardization: Method of rescaling data to have a mean of 0 and a standard deviation
of 1.

 Binning: Grouping continuous numerical data into discrete categories.

 Encoding: Converting categorical data into numerical format.

 Feature Selection: Identifying the most important attributes in a dataset.

 Feature Extraction: Creating new features based on existing ones.

 Upsampling: Increasing the frequency of underrepresented data points.

 Downsampling: Reducing the frequency of overrepresented data points.

 Regression: A statistical technique for modeling relationships between variables.

 Outlier: A data point significantly different from others in the dataset.

 Clustering: Grouping data points with similar characteristics together.

 Data Imputation: Filling in missing values using statistical or algorithmic methods.

 Primary Key: A unique identifier for each record in a dataset.

 Vector: A one-dimensional array that stores elements of the same type.

 Data Frame: A table-like structure in R that stores different data types in columns.
 Quantile: Values that divide a dataset into equal-sized intervals.

 Cut Function: Used in R to divide continuous data into discrete bins.

 Categorical Data: Data that represents categories rather than numerical values.

 Numeric Data: Data represented as numbers that can be used for mathematical operations.

You might also like