0% found this document useful (0 votes)

250 views45 pages

Big Data Preprocessing Techniques

This document discusses the key steps in data preprocessing including data cleaning, integration, reduction, and transformation. The major tasks in data cleaning are handling missing data, noisy data, and inconsistent data. Data integration combines multiple data sources. Data reduction decreases data volume through dimensionality reduction, numerosity reduction, and data compression. Data transformation prepares data for analysis through smoothing, aggregation, normalization, and other techniques.

Uploaded by

Prabha Joshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

250 views45 pages

Big Data Preprocessing Techniques

Uploaded by

Prabha Joshi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Unit 2

Big Data Analytics

Data Preprocessing
 It is the process of transforming raw data gathered
from diverse sources into useful / efficient
/understandable format
 This is the most important step in data mining
 Need to check the quality of data before applying
the data mining algorithms
 Raw data can have lot of inconsistent values and
present a lot of redundant information.
Most Common Problems of Raw Data

 Missing Data : Usually occurs when there is a

problem in the collection phase, mistakes in data
entry, glitch due to system’s downtime
Most Common Problems of Raw Data
 Noisy Data: Erroneous data and outliers that are
found in data set
 The main source of this type of data is due to human
errors, mislabels during data gathering
 Inconsistent Data: Inconsistencies happen when
you keep files with similar data in different formats
and files.
Why to Preprocess data?
 The mistakes, redundancies, missing values and
inconsistencies all compromise the integrity of the
data sets.

 Thus we need to “clean” it before processing it.

 Preprocessing of data is mainly to check the data

quality.
Quality of data
 The quality can be checked by the following

 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Interpretability
Major tasks in Data Preprocessing
Major tasks in Data Preprocessing
1. Data Cleaning:
 Removes incorrect, incomplete and inaccurate data
from datasets
 Replaces the missing values

i. Missing data:
 Methods to clean the Missing data
 Ignore the missing values in data sets (called a tuple)
 Filling the missing values
Major tasks in Data Preprocessing
ii. Noisy Data
 Random error containing unnecessary data points
 Meaningless data that cant be interpreted by
machines
 It is generated due to faulty data collection, data
entry errors etc.
Major tasks in Data Preprocessing
 Methods to reduce noise:
 Binning
 Regression
 Clustering

 Binning: This is a method to smooth or handle noisy data

 It is used on sorted data
 After binning, you can replace the noise by defining
boundary values to do that replacement
Major tasks in Data Preprocessing
Major tasks in Data Preprocessing
 Regression
 It helps to decide the variable which is suitable for
our analysis.
 The regression used might have one or more
independent variables.
Major tasks in Data Preprocessing
 Clustering
 Used to find the outliers and group the data
 Clustering is a method of unsupervised learning
Major tasks in Data Preprocessing
 Data Integration:
 Process of combining multiple sources into a single dataset
 Some of the problems to be considered during data integration
are
1) Schema integration : Integrates the metadata from different
sources
2) Entity identification Problem: We need to be able to identify the
entity from multiple databases.
 The problem of identifying object instances from different databases that
correspond to the same real-world entity.
 Eg: Student ID from one database and student name from another
database, belongs to the same entity.
Issues in Data Integration
3) Detecting and resolving data
value concepts:

 The data taken from different

databases while merging may
differ.

 Eg: The data format in one

database would be
MM/DD/YYYY or
DD/MM/YYYY.
Major tasks in Data Preprocessing
 Data reduction:
 This process involves reduction of the volume of data to make
analysis easier.
 Reduce the storage space using
 Dimensionality reduction,
 Numerosity reduction
 Data compression

 Data reduction are important as it limits the data sets to the most
important information.
 Increases storage efficiency
 Reduces the money and time costs associated with working
Major tasks in Data Preprocessing
 Data reduction is a complex process and involves
several steps.
Major tasks in Data Preprocessing
 Dimensionality reduction:
 Reduction of random variables so that dimensionality
of the data set is reduced
 Combining and merging the attributes of data without
losing its original characteristics
 Reduces storage space and computational time
 Data encoding mechanisms used to further reduce its
size
Major tasks in Data Preprocessing
 Numerosity Reduction:
 Here the data is made smaller by reducing the
volume.
 With this the original data will be represented by a
much smaller data.

 Data Compression:
Major tasks in Data Preprocessing
 The compression can be either lossless compression /
lossy compression (reduces information but it removes
only unnecessary information)
 Data Cube Aggregation: Data cubes are multidimensional
arrays of values. Aggregation operations that derive a
single value for a group of values is used.
 Attribute Subset Selection: Most relevant will be used
and the rest will be discarded.
 A minimum threshold that all attributes have to reach to be
taken into consideration.
Major tasks in Data Preprocessing
 Data Transformation: Transform the data in appropriate
forms.
 The change made in the format or structure of data is
called data transformation
 The following are the steps in data transformation.
 Smoothing: Remove noise with the help of algorithms
 Aggregation: Data set from multiple sources is integrated into
with data analysis description. The data is stored and presented
as a summary
 Normalization: It is the method of scaling the data so that it
can be represented in a smaller range.
Major tasks in Data Preprocessing
 Attribute Selection:
 Using the given attributes, you create new ones to further
organize the data sets and help in data analysis
 Discretization:
 The continuous data is split into intervals
 This process reduces the data size.
 Eg: rather than specifying the class time, we can set an interval
like 3 pm - 5pm
 Concept Hierarchy Generation:
 Attributes are converted from lower level to higher level in
hierarchy.
Data Mining Primitives

Data mining Query

Data mining Primitives

Data Mining System
Data Mining Primitives
 Task Relevant Data:
 What is the data set I want to mine?
 specifies the portions of the database or the set of data
in which the user is interested.
 In a relational database, the set of task-relevant data can
be collected via a relational query involving operations
like selection, projection, join, and aggregation.
 The data collection process results in a new data
relational called the initial data relation
Data Mining Primitives
 Knowledge to be Mined:

 This specifies the data mining functions to be

performed such as concept description, association,
correlation analysis, classification, prediction,
clustering, outlier analysis, or evolution analysis.
Data Mining Primitives
 Background knowledge:
 This knowledge about the domain to be mined is
useful for guiding the knowledge discovery process
and evaluating the patterns found

 Concept Hierarchy:
Concept Hierarchy
 Rolling Up - Generalization of data: Allow to view
data at more meaningful and explicit abstractions
and makes it easier to understand. It compresses the
data, and it would require fewer input/output
operations.
 Drilling Down - Specialization of data: Concept
values replaced by lower-level concepts. Based on
different user viewpoints, there may be more than
one concept hierarchy for a given attribute or
dimension.
Interestingness Measure
 Simplicity: simplicity for human comprehension
 It is defined in terms of the pattern size in bits,
 Number of attributes / operators appearing in a pattern
 Certainty (Confidence): Each pattern should have a
measure of certainty associated with it that assesses the
validity of the pattern.
 Given a set of task-relevant data tuples, the confidence
of "A => B" is defined as
Confidence (A=>B) = # tuples containing both A and
B # tuples containing A
Data Mining Primitives
 Utility:
 The potential usefulness of a pattern is a factor defining its
interestingness. It can be estimated by a utility function, such
as support.
 Support (A=>B) = # tuples containing both A and B / total #of
tuples
 Novelty:
 Novel patterns are those that contribute new information or
increased performance to the given pattern set.
 Another strategy for detecting novelty is to remove redundant
patterns.
Data Mining Primitives
 Presentation and Visualization:
 This refers to the form in which discovered patterns
are to be displayed, which may include rules, tables,
cross tabs, charts, graphs, decision trees, cubes, or
other visual representations.
 Users must be able to specify the forms of
presentation to be used for displaying the discovered
patterns. Some representation forms may be better
suited than others for particular kinds of knowledge.
DMQL
 The Data Mining Query Language (DMQL) was proposed by Han, Fu,
Wang, et al. for the DBMiner data mining system.

 This is based on Structured Query Language (SQL).

 Data Mining Query Languages can be designed to support ad hoc and

interactive data mining.

 It is designed with the help of Backus Naur Form (BNF) notation/

grammar.

 In this notation, “[ ]” or “{ }” denotes 0 or other possibilities.

DMQL Syntax for Task Relevant Specification

 use database database_name

 or

 use data warehouse data_warehouse_name

 in relevance to att_or_dim_list
 from relation(s)/cube(s) [where condition]
 order by order_list
 group by grouping_list
Characterization (kind of knowledge to be mined)

 The syntax for characterization is −

 mine characteristics [as pattern_name]

 analyze {measure(s) }

 The analyze clause, specifies aggregate measures, such as count, sum, or

count%.

 For example −

 Description describing customer purchasing habits.

 mine characteristics as customer Purchasing
 analyze count%
Discrimination (kind of knowledge to be mined)

 Syntax is
 mine comparison [as {pattern_name]}
 For {target_class } where {target_condition }
 {versus {contrast_class_i }
Big Spenders who spend >100 $
 where {contrast_condition_i}} on average
 analyze {measure(s) } Budget spenders who spend < 100
$ on average
 Eg:
 mine comparison as purchaseGroups
 for bigSpenders where avg([Link]) ≥$100
 versus budgetSpenders where avg([Link])< $100
 analyze count
Association (kind of knowledge to be mined)

 Syntax is
 mine associations [ as {pattern_name} ]
 {matching {metapattern} }

 Eg:
 mine associations as buyingHabits
 matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)
 where X is key of customer relation; P and Q are
predicate variables; and W, Y, and Z are object
variables.
Classification (kind of knowledge to be mined)

 Syntax is
 mine classification [as pattern_name]
 analyze classifying_attribute_or_dimension
 For example, to mine patterns, classifying customer
credit rating where the classes are determined by
the attribute credit_rating.
 analyze credit_rating
Prediction
 Syntax is

 mine prediction [as pattern_name]

 analyze prediction_attribute_or_dimension
 {set {attribute_or_dimension_i= value_i}}
Concept Hierarchy Specification
 Syntax is
 use hierarchy <hierarchy> for <attribute_or_dimension>

 Eg:
 set-grouping hierarchies
 define hierarchy age_hierarchy for age on customer as
 level1: {young, middle_aged, senior} < level0: all
 level2: {20, ..., 39} < level1: young
 level3: {40, ..., 59} < level1: middle_aged
 level4: {60, ..., 89} < level1: senior

 -operation-derived hierarchies
 define hierarchy age_hierarchy for age on customer as
 {age_category(1), ..., age_category(5)}
 := cluster(default, age, 5) < all(age)
Interestingness Measures Specification

 Syntax is
 with <interest_measure_name> threshold = threshold_value

 For Example −
 with support threshold = 0.05
 with confidence threshold = 0.7
Pattern Presentation and Visualization Specification

 Syntax is
 display as <result_form>

 Eg:
 display as table
Full Specification of DMQL
 use database AllElectronics_db
 use hierarchy location_hierarchy for [Link]
 mine characteristics as customerPurchasing
 analyze count%
 in relevance to [Link],[Link],I.place_made
 from customer C, item I, purchase P, items_sold S, branch B
 where I.item_ID = S.item_ID and P.cust_ID = C.cust_ID and
 P.method_paid = "AmEx" and [Link] = "Canada" and [Link] ≥ 100
 with noise threshold = 5%
 display as table

 As market manager of a company, you would like to characterize the buying habits of customers who can
purchase items priced at no less than $100; with respect to the customer's age, type of item purchased, and
the place where the item was purchased. You would like to know the percentage of customers having that
characteristic. In particular, you are only interested in purchases made in Canada, and paid with an
American Express credit card. You would like to view the resulting descriptions in the form of a table.
Data Mining Architectures
 Data Sources: Database, World Wide Web(WWW), and
data warehouse are parts of data sources. The data in these sources
may be in the form of plain text, spreadsheets, or other forms of
media like photos or videos. WWW is one of the biggest sources of
data.
 Database Server: The database server contains the actual data ready
to be processed. It performs the task of handling data retrieval as per
the request of the user.
 Data Mining Engine: It is one of the core components of the data
mining architecture that performs all kinds of data mining techniques
like association, classification, characterization, clustering,
prediction, etc.
 Pattern Evaluation Modules: They are responsible for finding
interesting patterns in the data and sometimes they also interact with
the database servers for producing the result of the user requests.
 Graphic User Interface: Since the user cannot fully understand the
complexity of the data mining process so graphical user interface
helps the user to communicate effectively with the data mining
system.
 Knowledge Base: Knowledge Base is an important part of the data
mining engine that is quite beneficial in guiding the search for the
result patterns. Data mining engines may also sometimes get inputs
from the knowledge base. This knowledge base may contain data
from user experiences. The objective of the knowledge base is to
make the result more accurate and reliable.
Data Mining System Architectures
 Types of Data Mining architecture:
 No Coupling: The no coupling data mining architecture retrieves data from particular
data sources. It does not use the database for retrieving the data which is otherwise quite
an efficient and accurate way to do the same. The no coupling architecture for data
mining is poor and only used for performing very simple data mining processes.
 Loose Coupling: In loose coupling architecture data mining system retrieves data from
the database and stores the data in those systems. This mining is for memory-based data
mining architecture.
 Semi-Tight Coupling: It tends to use various advantageous features of the data
warehouse systems. It includes sorting, indexing, and aggregation. In this architecture,
an intermediate result can be stored in the database for better performance.
 Tight coupling: In this architecture, a data warehouse is considered one of its most
important components whose features are employed for performing data mining tasks.
This architecture provides scalability, performance, and integrated information
Data Mining Architectures
 Classic Architecture: This architecture involves several phases such as data
preprocessing, data cleaning, data transformation, data mining, and result interpretation.
The classic architecture is a widely used architecture that consists of several traditional
data mining techniques.
 Client-Server Architecture: This architecture involves two main components: a server
and a client. The server is responsible for storing and managing the data while the client
is used to access the data and perform data mining tasks.
 Parallel Architecture: This architecture involves the use of multiple processors to
perform data mining tasks simultaneously. This architecture is useful when dealing with
large and complex data sets that require a lot of computing power.
 Web-Based Architecture: This architecture involves the use of web-based tools and
technologies for data mining tasks. It enables data mining to be performed remotely
through the internet.
 Distributed Architecture: This architecture involves the distribution of data mining tasks
across multiple systems. It enables data mining to be performed more efficiently and
effectively by leveraging the computing power of multiple systems.
 Cloud-Based Architecture: This architecture involves the use of cloud-based services
for data mining tasks. It allows users to access data mining tools and resources through
the cloud, which eliminates the need for local hardware and software.
Thank you

Data Mining Techniques and Challenges
No ratings yet
Data Mining Techniques and Challenges
19 pages
DataMining S
No ratings yet
DataMining S
103 pages
Data Preprocessing Techniques for Quality Data
No ratings yet
Data Preprocessing Techniques for Quality Data
68 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
22 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
15 pages
Data Mining and Association Techniques
No ratings yet
Data Mining and Association Techniques
61 pages
Data Mining: Techniques and Applications
No ratings yet
Data Mining: Techniques and Applications
33 pages
KDD and Data Mining Overview
No ratings yet
KDD and Data Mining Overview
46 pages
Data Mining Functionalities Overview
No ratings yet
Data Mining Functionalities Overview
14 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
23 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
27 pages
Data Mining and Processing Overview
No ratings yet
Data Mining and Processing Overview
16 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
32 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
27 pages
Data Preprocessing Techniques in Text Extraction
No ratings yet
Data Preprocessing Techniques in Text Extraction
17 pages
DMC Question Bank for Banking Analytics
No ratings yet
DMC Question Bank for Banking Analytics
28 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
20 pages
Noisy Data Management in Data Mining
No ratings yet
Noisy Data Management in Data Mining
55 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
50 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
48 pages
Big Data Analytics and Data Mining Overview
No ratings yet
Big Data Analytics and Data Mining Overview
49 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
25 pages
Data Pre-processing for Machine Learning
No ratings yet
Data Pre-processing for Machine Learning
61 pages
Data Mining: Concepts and Techniques
100% (1)
Data Mining: Concepts and Techniques
22 pages
Key Challenges in Data Mining
No ratings yet
Key Challenges in Data Mining
5 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
18 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
3 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
22 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
10 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
32 pages
Data Mining and Warehouse Concepts Explained
No ratings yet
Data Mining and Warehouse Concepts Explained
17 pages
Essential Data Preprocessing Techniques
No ratings yet
Essential Data Preprocessing Techniques
5 pages
Comprehensive Data Mining Notes
No ratings yet
Comprehensive Data Mining Notes
25 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
12 pages
Google Miniscape Data Mining Insights
No ratings yet
Google Miniscape Data Mining Insights
91 pages
Data Preprocessing Techniques in Mining
No ratings yet
Data Preprocessing Techniques in Mining
14 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
52 pages
Data Mining Techniques for Business Insights
No ratings yet
Data Mining Techniques for Business Insights
140 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
16 pages
Understanding Data Mining and KDD
No ratings yet
Understanding Data Mining and KDD
22 pages
Data Warehousing and Mining Overview
No ratings yet
Data Warehousing and Mining Overview
56 pages
Data Preprocessing Techniques in Data Mining
No ratings yet
Data Preprocessing Techniques in Data Mining
46 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
53 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
9 pages
Data Preprocessing Techniques Overview
No ratings yet
Data Preprocessing Techniques Overview
8 pages
Data Preprocessing Techniques in Mining
No ratings yet
Data Preprocessing Techniques in Mining
11 pages
Data Mining Challenges and Processes
No ratings yet
Data Mining Challenges and Processes
15 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
30 pages
Unit4 Data Maintenance Overview
No ratings yet
Unit4 Data Maintenance Overview
30 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
77 pages
Data Mining: Preprocessing Essentials
No ratings yet
Data Mining: Preprocessing Essentials
60 pages
Importance of Data Preprocessing
No ratings yet
Importance of Data Preprocessing
39 pages
Introduction to Data Mining Techniques
No ratings yet
Introduction to Data Mining Techniques
26 pages
Data Pre-Processing Techniques Explained
No ratings yet
Data Pre-Processing Techniques Explained
8 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
50 pages
Data Preprocessing Techniques for Analytics
No ratings yet
Data Preprocessing Techniques for Analytics
4 pages
Defining Mole and Basic Chemistry Concepts
No ratings yet
Defining Mole and Basic Chemistry Concepts
3 pages
Big Data Analytics - Unit1
No ratings yet
Big Data Analytics - Unit1
31 pages
Fostering Trust in Agile Development
No ratings yet
Fostering Trust in Agile Development
14 pages
Coping with Change in Agile Software Development
No ratings yet
Coping with Change in Agile Software Development
17 pages
Agile Learning in Software Development
No ratings yet
Agile Learning in Software Development
8 pages
Microsoft Azure Overview and Services Guide
No ratings yet
Microsoft Azure Overview and Services Guide
46 pages
Cybersecurity Policy and Legal Framework
No ratings yet
Cybersecurity Policy and Legal Framework
14 pages
Sequence Diagram for Airline Booking
No ratings yet
Sequence Diagram for Airline Booking
5 pages
AI For Data Analytics - Insights and Solutions Microsoft Azure
No ratings yet
AI For Data Analytics - Insights and Solutions Microsoft Azure
1 page
Eaglercraft 1.12 Offline HTML Download
No ratings yet
Eaglercraft 1.12 Offline HTML Download
1 page
GangaRaju Udimudi: JAVA Engineer Profile
No ratings yet
GangaRaju Udimudi: JAVA Engineer Profile
5 pages
Understanding Data Warehousing Basics
No ratings yet
Understanding Data Warehousing Basics
23 pages
Comprehensive School ERP Software
No ratings yet
Comprehensive School ERP Software
3 pages
101ChatGPT Prompts - SAP Introduction
No ratings yet
101ChatGPT Prompts - SAP Introduction
19 pages
Iptv Sample
No ratings yet
Iptv Sample
16 pages
Organizational Learning Agility: in This Case Study
No ratings yet
Organizational Learning Agility: in This Case Study
28 pages
Operating Systems Syllabus for 4th Sem
No ratings yet
Operating Systems Syllabus for 4th Sem
4 pages
EarlyWatch Alert FAQ for SAP Users
No ratings yet
EarlyWatch Alert FAQ for SAP Users
13 pages
Digital Marketing Skills & Experience Overview
No ratings yet
Digital Marketing Skills & Experience Overview
3 pages
Key Components of Accounting Information Systems
No ratings yet
Key Components of Accounting Information Systems
3 pages
Celebal Technologies Internship Report
No ratings yet
Celebal Technologies Internship Report
25 pages
Open Source Contribution Guide
No ratings yet
Open Source Contribution Guide
24 pages
SPEKTRUM: Efficient Document Management
No ratings yet
SPEKTRUM: Efficient Document Management
11 pages
Indosat Ooredoo Smart City Initiatives
No ratings yet
Indosat Ooredoo Smart City Initiatives
20 pages
OOP Banking System in Java
No ratings yet
OOP Banking System in Java
3 pages
CLARIFYE+ Integration User Manual
No ratings yet
CLARIFYE+ Integration User Manual
14 pages
Automated Solutions in Auto Insurance
No ratings yet
Automated Solutions in Auto Insurance
9 pages
Kubernetes Succinctly
No ratings yet
Kubernetes Succinctly
121 pages
PROGNOSIS IP Telephony Manager Key Features With CUCM Appliances v2 - 0
No ratings yet
PROGNOSIS IP Telephony Manager Key Features With CUCM Appliances v2 - 0
32 pages
Moodle Learning Analytics Plugin Overview
No ratings yet
Moodle Learning Analytics Plugin Overview
9 pages
Behavioral Standards in Java Interfaces
No ratings yet
Behavioral Standards in Java Interfaces
53 pages
Assignment 2 VM Deployment S1 2021
No ratings yet
Assignment 2 VM Deployment S1 2021
3 pages
Unix System Administration Guide
100% (1)
Unix System Administration Guide
547 pages
Windows Basics: Permissions & Accounts
No ratings yet
Windows Basics: Permissions & Accounts
20 pages
Bimodal Biometric Authentication System
No ratings yet
Bimodal Biometric Authentication System
10 pages

Big Data Preprocessing Techniques

Uploaded by

Big Data Preprocessing Techniques

Uploaded by

Unit 2

Big Data Analytics

 Missing Data : Usually occurs when there is a

 Thus we need to “clean” it before processing it.

 Preprocessing of data is mainly to check the data

 Binning: This is a method to smooth or handle noisy data

 The data taken from different

 Eg: The data format in one

Data mining Query

Data mining Primitives

 This specifies the data mining functions to be

 This is based on Structured Query Language (SQL).

 Data Mining Query Languages can be designed to support ad hoc and

 It is designed with the help of Backus Naur Form (BNF) notation/

 In this notation, “[ ]” or “{ }” denotes 0 or other possibilities.

 use database database_name

 use data warehouse data_warehouse_name

 The syntax for characterization is −

 mine characteristics [as pattern_name]

 The analyze clause, specifies aggregate measures, such as count, sum, or

 Description describing customer purchasing habits.

 mine prediction [as pattern_name]

You might also like