Unit 2
Big Data Analytics
Data Preprocessing
It is the process of transforming raw data gathered
from diverse sources into useful / efficient
/understandable format
This is the most important step in data mining
Need to check the quality of data before applying
the data mining algorithms
Raw data can have lot of inconsistent values and
present a lot of redundant information.
Most Common Problems of Raw Data
Missing Data : Usually occurs when there is a
problem in the collection phase, mistakes in data
entry, glitch due to system’s downtime
Most Common Problems of Raw Data
Noisy Data: Erroneous data and outliers that are
found in data set
The main source of this type of data is due to human
errors, mislabels during data gathering
Inconsistent Data: Inconsistencies happen when
you keep files with similar data in different formats
and files.
Why to Preprocess data?
The mistakes, redundancies, missing values and
inconsistencies all compromise the integrity of the
data sets.
Thus we need to “clean” it before processing it.
Preprocessing of data is mainly to check the data
quality.
Quality of data
The quality can be checked by the following
Accuracy
Completeness
Consistency
Timeliness
Believability
Interpretability
Major tasks in Data Preprocessing
Major tasks in Data Preprocessing
1. Data Cleaning:
Removes incorrect, incomplete and inaccurate data
from datasets
Replaces the missing values
i. Missing data:
Methods to clean the Missing data
Ignore the missing values in data sets (called a tuple)
Filling the missing values
Major tasks in Data Preprocessing
ii. Noisy Data
Random error containing unnecessary data points
Meaningless data that cant be interpreted by
machines
It is generated due to faulty data collection, data
entry errors etc.
Major tasks in Data Preprocessing
Methods to reduce noise:
Binning
Regression
Clustering
Binning: This is a method to smooth or handle noisy data
It is used on sorted data
After binning, you can replace the noise by defining
boundary values to do that replacement
Major tasks in Data Preprocessing
Major tasks in Data Preprocessing
Regression
It helps to decide the variable which is suitable for
our analysis.
The regression used might have one or more
independent variables.
Major tasks in Data Preprocessing
Clustering
Used to find the outliers and group the data
Clustering is a method of unsupervised learning
Major tasks in Data Preprocessing
Data Integration:
Process of combining multiple sources into a single dataset
Some of the problems to be considered during data integration
are
1) Schema integration : Integrates the metadata from different
sources
2) Entity identification Problem: We need to be able to identify the
entity from multiple databases.
The problem of identifying object instances from different databases that
correspond to the same real-world entity.
Eg: Student ID from one database and student name from another
database, belongs to the same entity.
Issues in Data Integration
3) Detecting and resolving data
value concepts:
The data taken from different
databases while merging may
differ.
Eg: The data format in one
database would be
MM/DD/YYYY or
DD/MM/YYYY.
Major tasks in Data Preprocessing
Data reduction:
This process involves reduction of the volume of data to make
analysis easier.
Reduce the storage space using
Dimensionality reduction,
Numerosity reduction
Data compression
Data reduction are important as it limits the data sets to the most
important information.
Increases storage efficiency
Reduces the money and time costs associated with working
Major tasks in Data Preprocessing
Data reduction is a complex process and involves
several steps.
Major tasks in Data Preprocessing
Dimensionality reduction:
Reduction of random variables so that dimensionality
of the data set is reduced
Combining and merging the attributes of data without
losing its original characteristics
Reduces storage space and computational time
Data encoding mechanisms used to further reduce its
size
Major tasks in Data Preprocessing
Numerosity Reduction:
Here the data is made smaller by reducing the
volume.
With this the original data will be represented by a
much smaller data.
Data Compression:
Major tasks in Data Preprocessing
The compression can be either lossless compression /
lossy compression (reduces information but it removes
only unnecessary information)
Data Cube Aggregation: Data cubes are multidimensional
arrays of values. Aggregation operations that derive a
single value for a group of values is used.
Attribute Subset Selection: Most relevant will be used
and the rest will be discarded.
A minimum threshold that all attributes have to reach to be
taken into consideration.
Major tasks in Data Preprocessing
Data Transformation: Transform the data in appropriate
forms.
The change made in the format or structure of data is
called data transformation
The following are the steps in data transformation.
Smoothing: Remove noise with the help of algorithms
Aggregation: Data set from multiple sources is integrated into
with data analysis description. The data is stored and presented
as a summary
Normalization: It is the method of scaling the data so that it
can be represented in a smaller range.
Major tasks in Data Preprocessing
Attribute Selection:
Using the given attributes, you create new ones to further
organize the data sets and help in data analysis
Discretization:
The continuous data is split into intervals
This process reduces the data size.
Eg: rather than specifying the class time, we can set an interval
like 3 pm - 5pm
Concept Hierarchy Generation:
Attributes are converted from lower level to higher level in
hierarchy.
Data Mining Primitives
Data mining Query
Data mining Primitives
Data Mining System
Data Mining Primitives
Task Relevant Data:
What is the data set I want to mine?
specifies the portions of the database or the set of data
in which the user is interested.
In a relational database, the set of task-relevant data can
be collected via a relational query involving operations
like selection, projection, join, and aggregation.
The data collection process results in a new data
relational called the initial data relation
Data Mining Primitives
Knowledge to be Mined:
This specifies the data mining functions to be
performed such as concept description, association,
correlation analysis, classification, prediction,
clustering, outlier analysis, or evolution analysis.
Data Mining Primitives
Background knowledge:
This knowledge about the domain to be mined is
useful for guiding the knowledge discovery process
and evaluating the patterns found
Concept Hierarchy:
Concept Hierarchy
Rolling Up - Generalization of data: Allow to view
data at more meaningful and explicit abstractions
and makes it easier to understand. It compresses the
data, and it would require fewer input/output
operations.
Drilling Down - Specialization of data: Concept
values replaced by lower-level concepts. Based on
different user viewpoints, there may be more than
one concept hierarchy for a given attribute or
dimension.
Interestingness Measure
Simplicity: simplicity for human comprehension
It is defined in terms of the pattern size in bits,
Number of attributes / operators appearing in a pattern
Certainty (Confidence): Each pattern should have a
measure of certainty associated with it that assesses the
validity of the pattern.
Given a set of task-relevant data tuples, the confidence
of "A => B" is defined as
Confidence (A=>B) = # tuples containing both A and
B # tuples containing A
Data Mining Primitives
Utility:
The potential usefulness of a pattern is a factor defining its
interestingness. It can be estimated by a utility function, such
as support.
Support (A=>B) = # tuples containing both A and B / total #of
tuples
Novelty:
Novel patterns are those that contribute new information or
increased performance to the given pattern set.
Another strategy for detecting novelty is to remove redundant
patterns.
Data Mining Primitives
Presentation and Visualization:
This refers to the form in which discovered patterns
are to be displayed, which may include rules, tables,
cross tabs, charts, graphs, decision trees, cubes, or
other visual representations.
Users must be able to specify the forms of
presentation to be used for displaying the discovered
patterns. Some representation forms may be better
suited than others for particular kinds of knowledge.
DMQL
The Data Mining Query Language (DMQL) was proposed by Han, Fu,
Wang, et al. for the DBMiner data mining system.
This is based on Structured Query Language (SQL).
Data Mining Query Languages can be designed to support ad hoc and
interactive data mining.
It is designed with the help of Backus Naur Form (BNF) notation/
grammar.
In this notation, “[ ]” or “{ }” denotes 0 or other possibilities.
DMQL Syntax for Task Relevant Specification
use database database_name
or
use data warehouse data_warehouse_name
in relevance to att_or_dim_list
from relation(s)/cube(s) [where condition]
order by order_list
group by grouping_list
Characterization (kind of knowledge to be mined)
The syntax for characterization is −
mine characteristics [as pattern_name]
analyze {measure(s) }
The analyze clause, specifies aggregate measures, such as count, sum, or
count%.
For example −
Description describing customer purchasing habits.
mine characteristics as customer Purchasing
analyze count%
Discrimination (kind of knowledge to be mined)
Syntax is
mine comparison [as {pattern_name]}
For {target_class } where {target_condition }
{versus {contrast_class_i }
Big Spenders who spend >100 $
where {contrast_condition_i}} on average
analyze {measure(s) } Budget spenders who spend < 100
$ on average
Eg:
mine comparison as purchaseGroups
for bigSpenders where avg([Link]) ≥$100
versus budgetSpenders where avg([Link])< $100
analyze count
Association (kind of knowledge to be mined)
Syntax is
mine associations [ as {pattern_name} ]
{matching {metapattern} }
Eg:
mine associations as buyingHabits
matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)
where X is key of customer relation; P and Q are
predicate variables; and W, Y, and Z are object
variables.
Classification (kind of knowledge to be mined)
Syntax is
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension
For example, to mine patterns, classifying customer
credit rating where the classes are determined by
the attribute credit_rating.
analyze credit_rating
Prediction
Syntax is
mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}
Concept Hierarchy Specification
Syntax is
use hierarchy <hierarchy> for <attribute_or_dimension>
Eg:
set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level3: {40, ..., 59} < level1: middle_aged
level4: {60, ..., 89} < level1: senior
-operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)
Interestingness Measures Specification
Syntax is
with <interest_measure_name> threshold = threshold_value
For Example −
with support threshold = 0.05
with confidence threshold = 0.7
Pattern Presentation and Visualization Specification
Syntax is
display as <result_form>
Eg:
display as table
Full Specification of DMQL
use database AllElectronics_db
use hierarchy location_hierarchy for [Link]
mine characteristics as customerPurchasing
analyze count%
in relevance to [Link],[Link],I.place_made
from customer C, item I, purchase P, items_sold S, branch B
where I.item_ID = S.item_ID and P.cust_ID = C.cust_ID and
P.method_paid = "AmEx" and [Link] = "Canada" and [Link] ≥ 100
with noise threshold = 5%
display as table
As market manager of a company, you would like to characterize the buying habits of customers who can
purchase items priced at no less than $100; with respect to the customer's age, type of item purchased, and
the place where the item was purchased. You would like to know the percentage of customers having that
characteristic. In particular, you are only interested in purchases made in Canada, and paid with an
American Express credit card. You would like to view the resulting descriptions in the form of a table.
Data Mining Architectures
Data Sources: Database, World Wide Web(WWW), and
data warehouse are parts of data sources. The data in these sources
may be in the form of plain text, spreadsheets, or other forms of
media like photos or videos. WWW is one of the biggest sources of
data.
Database Server: The database server contains the actual data ready
to be processed. It performs the task of handling data retrieval as per
the request of the user.
Data Mining Engine: It is one of the core components of the data
mining architecture that performs all kinds of data mining techniques
like association, classification, characterization, clustering,
prediction, etc.
Pattern Evaluation Modules: They are responsible for finding
interesting patterns in the data and sometimes they also interact with
the database servers for producing the result of the user requests.
Graphic User Interface: Since the user cannot fully understand the
complexity of the data mining process so graphical user interface
helps the user to communicate effectively with the data mining
system.
Knowledge Base: Knowledge Base is an important part of the data
mining engine that is quite beneficial in guiding the search for the
result patterns. Data mining engines may also sometimes get inputs
from the knowledge base. This knowledge base may contain data
from user experiences. The objective of the knowledge base is to
make the result more accurate and reliable.
Data Mining System Architectures
Types of Data Mining architecture:
No Coupling: The no coupling data mining architecture retrieves data from particular
data sources. It does not use the database for retrieving the data which is otherwise quite
an efficient and accurate way to do the same. The no coupling architecture for data
mining is poor and only used for performing very simple data mining processes.
Loose Coupling: In loose coupling architecture data mining system retrieves data from
the database and stores the data in those systems. This mining is for memory-based data
mining architecture.
Semi-Tight Coupling: It tends to use various advantageous features of the data
warehouse systems. It includes sorting, indexing, and aggregation. In this architecture,
an intermediate result can be stored in the database for better performance.
Tight coupling: In this architecture, a data warehouse is considered one of its most
important components whose features are employed for performing data mining tasks.
This architecture provides scalability, performance, and integrated information
Data Mining Architectures
Classic Architecture: This architecture involves several phases such as data
preprocessing, data cleaning, data transformation, data mining, and result interpretation.
The classic architecture is a widely used architecture that consists of several traditional
data mining techniques.
Client-Server Architecture: This architecture involves two main components: a server
and a client. The server is responsible for storing and managing the data while the client
is used to access the data and perform data mining tasks.
Parallel Architecture: This architecture involves the use of multiple processors to
perform data mining tasks simultaneously. This architecture is useful when dealing with
large and complex data sets that require a lot of computing power.
Web-Based Architecture: This architecture involves the use of web-based tools and
technologies for data mining tasks. It enables data mining to be performed remotely
through the internet.
Distributed Architecture: This architecture involves the distribution of data mining tasks
across multiple systems. It enables data mining to be performed more efficiently and
effectively by leveraging the computing power of multiple systems.
Cloud-Based Architecture: This architecture involves the use of cloud-based services
for data mining tasks. It allows users to access data mining tools and resources through
the cloud, which eliminates the need for local hardware and software.
Thank you