0% found this document useful (0 votes)
106 views

Data Warehousing and Data Mining: DR Seema Agarwal

The document discusses data warehousing and data mining. It covers the explosive growth of data, why data mining is useful for tasks like market analysis, fraud detection, and risk management. It also defines key concepts in data mining like data types, attributes, data preprocessing, and the overall KDD process which involves data selection, cleaning, transformation, mining, and evaluation.

Uploaded by

Prakash giri
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views

Data Warehousing and Data Mining: DR Seema Agarwal

The document discusses data warehousing and data mining. It covers the explosive growth of data, why data mining is useful for tasks like market analysis, fraud detection, and risk management. It also defines key concepts in data mining like data types, attributes, data preprocessing, and the overall KDD process which involves data selection, cleaning, transformation, mining, and evaluation.

Uploaded by

Prakash giri
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 72

GA 5103

Data Warehousing
and
Data Mining

Dr Seema Agarwal
Data Mining – UNIT II

• Introduction
• Data
• Types of Data
• Data Mining Functionalities
• Interestingness of Patterns – Classification of Data
• Mining Systems – Data Mining Task Primitives
• Integration of a Data Mining System with a Data
Warehouse
• Issues –Data Preprocessing.
2
Why Data Mining?
• The Explosive Growth of Data: from terabytes to pentabytes ,
Exabytes etc
– Data collection and data availability
• Automated data collection tools, database systems, Web,
computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific
simulation, …
• Society and everyone: news, digital cameras,
3
We are drowning in data, but starving for knowledge!
Evolution of Database Technology
• 1960s:
– Data collection, database creation, IMS and network DBMS
• 1970s:
– Relational data model, relational DBMS implementation
• 1980s:
– RDBMS, advanced data models (extended-relational, OO etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
– Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems

4
What Is Data Mining?

• Data mining (knowledge discovery from


data)
– Extraction of interesting (non-trivial,
implicit, previously unknown and
potentially useful) patterns or knowledge
from huge amount of data
• Alternative name
– Knowledge discovery in databases (KDD)
5
Why Data Mining?—Some Potential Applications

• Data analysis and decision support


– Market analysis and management
• Target marketing, customer relationship management
(CRM), market basket analysis, market segmentation
– Risk analysis and management
• Forecasting, customer retention, quality control,
competitive analysis
– Fraud detection and detection of unusual patterns
(outliers)
6
Market Analysis and Management

• Where does the data come from?


– Credit card transactions, discount coupons,
customer complaint calls
• Target marketing
– Find clusters of “model” customers who share the
same characteristics: interest, income level,
spending habits, etc.
– Determine customer purchasing patterns over time

7
Market Analysis and Management

• Cross-market analysis
– Associations/co-relations between product sales, &
prediction based on such association
• Customer profiling
– What types of customers buy what products
• Customer requirement analysis
– Identifying the best products for different
customers
– Predict what factors will attract new customers
8
Fraud Detection & Mining Unusual Patterns

• Approaches: Clustering & model construction for frauds, outlier analysis


• Applications: Health care, retail, credit card service, telecomm.
– Medical insurance
• Professional patients, and ring of doctors
• Unnecessary or correlated screening tests
– Telecommunications:
• Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm
– Retail industry
• Analysts estimate that 38% of retail shrink is due to dishonest
employees

9
Data
• Collection of data objects and their attributes
An object is an entity
eg. In sales database Customer, store items , sales
In univ DB – students , faculty , courses
• An attribute is a property or characteristic of an
object . Its is data field
– Examples: eye color of a person, height etc.
– Attribute is also known as variable, field,
characteristic, or feature
• A collection of attributes describe an object
– Object is also known as record, point, case, sample,
entity, or instance
Types of Attributes
There are different types of attributes
• Nominal attributes are symbols or names of things.
Each value represent some kind of category, code,
state. So Nominal attributes are referred as categorical.
• Examples: eye color, pin codes
Eye color – black, brown , blue ,grey

Even if nominal attributes are numeric, mathematical


operations are not meaningful and they are not meant
to be used quantitatively ie. No mean, no median.
Only occurrence is of importance.
Binary attributes
A binary attribute is an attribute with only two
states 0 (absent) or 1 (present)
It is nominal attribute with only two states.
Also called Boolean.
Symmetric if both states are equally valuable and
carry same weight. Eg gender Male – 0 female –
1
Asymmetric if outcome of the states are not equally
valuable .example test for Maleria
Positive or negative
Ordinal
Ordinal attribute is an attribute with possible values
that have a meaningful order or ranking among
them but magnitude between successive values
may not not known
• Examples:
Rankings { e.g., taste of potato chips on a scale from
1-10 },
Height in {tall, medium, short}
Grades { A+,A,A-, B+,B,C+,C }
Customer satisfaction {satisfied, neutral, Dissatisfied}
• Ordinal attributes may be obtained from
discretization of numeric quantities by
splitting value range into finite number of
ordered categories. Mode & median can be
defined.

Nominal, binary & ordinal attributes are


qualitative.
Numeric Attributes
• Numeric Attributes is quantitative i.e. measurable quantity
represented in integer or real value. Numeric attributes can be
interval scaled or ratio scaled.
• Interval scaled
The data have the properties of ordinal data, and the interval between
observations is expressed in terms of a fixed unit of measure.
Interval data are always numeric.

Examples: calendar dates, temperatures in Celsius or Fahrenheit


• When distance between attributes has meaning, for example,
temperature (in Fahrenheit) -- distance from 30-40 is same as
distance from 60-70
• Note that ratios don’t make any sense -- 60 degrees is not twice as
hot as 30 degrees (although the attribute values are).

• Temp in C or F has no true zero point ie 0 temp doesn’t indicate no


temp
Ratio scaled
• The data have all the properties of interval data
and the ratio of two values is meaningful.
• Variables such as distance, height, weight, and
time use the ratio scale.
• This scale must contain a zero value that indicates
that nothing exists for the variable at the zero point.
• Mean , Median and Mode can be computed
• Examples: temperature in Kelvin, length, time,
counts
Ratio scaled….
• Examples
• 0 deg Kelvin – point at which no kinetic energy.
• Years of experience
• Number of words

Age ?
IQ ?
Qualitative and Quantitative Data

Data can be qualitative or quantitative.

The appropriate statistical analysis depends


on whether the data for the variable are qualitative
or quantitative.

There are more options for statistical


analysis when the data are quantitative.
Qualitative Data
Labels or names used to identify an attribute of each
element. E.g., Black or white, male or female.

Referred to as categorical data

Use either the nominal or ordinal scale of


measurement

Can be either numeric or nonnumeric

Appropriate statistical analyses are rather limited


Quantitative Data

Quantitative data indicate how many or how much:


Discrete, if measuring how many. E.g., number
of burgers consumed at farewell party.
Continuous, if measuring how much. E.g., pounds
of burger consumed at tail-gate party

Quantitative data are always numeric.

Ordinary arithmetic operations are meaningful for


quantitative data.
Scales of Measurement

Data

Qualitative Quantitative

Numerical Non numerical Numerical

Nominal Ordinal Nominal Ordinal Interval Ratio


Data Mining: A KDD Process

– Data mining—core of Pattern Evaluation


knowledge discovery
process
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases 22
Steps of a KDD Process
• Learning the application domain
– Relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
– to remove noise or irrelevant data),
• Data reduction and transformation
– Find useful features, dimensionality/variable reduction.
– by performing summary or aggregation operations
• Choosing functions of data mining
– Summarization, classification, regression, association, clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
– Visualization, transformation, removing redundant patterns, etc.
– knowledge representation techniques are used to present the mined
knowledge to the user

• Use of discovered knowledge 23


Architecture: Typical Data Mining System

Graphical user interface

Pattern evaluation

Data mining engine

Knowledge-base
Database or data
warehouse server
Data cleaning & data integration Filtering

Data
Databases Warehouse
24
Architecture of a typical data mining
system.
The architecture of a typical data mining system may have the
following major components
1. Database, data warehouse, or other information repository. This
is one or a set of databases, data warehouses, spread sheets, or
other kinds of information repositories. Data cleaning and data
integration techniques may be performed on the data.
2. Database or data warehouse server. The database or data
warehouse server is responsible for fetching the relevant data,
based on the user's data mining request.
3. Knowledge base. This is the domain knowledge that is used to
guide the search, or evaluate the interestingness of resulting
patterns. Such knowledge can include concept hierarchies, used
to organize attributes or attribute values into different levels of
abstraction.
4. Datamining engine. This is essential to the data mining system and
ideally consists of a set of functional modules for tasks such as
characterization, association analysis, classification, evolution and
deviation analysis.

5. Pattern evaluation module. This component typically employs


interestingness measures and interacts with the data mining modules
so as to focus the search towards interesting patterns. It may access
interestingness thresholds stored in the knowledge base.
Alternatively, the pattern evaluation module may be integrated with
the mining module, depending on the implementation of the data
mining method used.

6. Graphical user interface. This module communicates between users


and the data mining system, allowing the user to interact with the
system by specifying a data mining query or task, providing
information to help focus the search, and performing exploratory
data mining based on the intermediate data mining results.
Data Mining: On What Kinds of Data?
Relational database: using SQL queries , aggregation functions( Sum,
Average, max,min)
• Search for trends or patterns. Eg anlayse customer data to predict
credit risk of new customer based on income, age and credit history.
• Detect deviations
Data warehouse : data stored around subjects(customers, items,
suppliers) – gives historical perceptive
Example - you are a manager of All Electronics in charge of sales in the
United States and Canada. In particular, you would like to study the
buying trends of customers in Canada. Rather than mining on the
entire database. These are referred to as relevant attributes
using OLAP roll-up or drill-down operations along dimensions. Such
operations allow data patterns to be expressed from different angles
of view and at multiple levels of abstraction.
27
• Transactional database
– Online transactions are getting captured
– Typically transaction id & items in transaction
– Basket analysis – which items are sold together ? Bundle
items
• Advanced database and information repository
– Spatial and temporal data - maps
– Time-series data – stock exchange
– Data Streams – surveillance and sensor data
– Multimedia database – images , videos and audio
– WWW – huge repository
– Text data – user comments
Data Mining Functionalities
• Data mining functionalities are used to specify the kind
of patterns to be found in data mining tasks. In
general, data mining tasks can be classified into two
categories
Descriptive and Predictive.
• Descriptive mining tasks characterize the general
properties of the data in the database.
• Predictive mining tasks perform inference on the
current data in order to make predictions.

31
Concept/class description
• Characterization and Discrimination
• Data can be associated with classes or concepts. For
example, in the AllElectronics store,
• classes of items for sale include computers and
printers,
• and concepts of customers include bigSpenders and
budgetSpenders.
• It can be useful to describe individual classes and
concepts in summarized, concise, and yet precise
terms.
Such descriptions of a class or a concept are called
class/concept descriptions.
• These descriptions can be derived via
(1) data characterization, by summarizing the
data of the class under study (often called
the target class) in general terms,
(2) data discrimination, by comparison of the
target class with one or a set of comparative
classes (often called the contrasting classes),
(3) both data characterization and
discrimination.
Data characterization
• Data characterization is a summarization of the general
characteristics or features of a target class of data.
• The data corresponding to the user-specified class are
typically collected by a database query.
• Then run through a summarization module to extract the
essence of the data at different levels of abstractions
• The data cube- based OLAP roll-up operation can be used
to perform user-controlled data summarization along a
specified dimension
• The output of data characterization can be presented in
various forms. Examples include pie charts, bar charts,
curves, multidimensional data cubes, and
multidimensional tables.
• For example, one may want to characterize the
OurVideoStore customers who regularly rent
more than 30 movies a year. With concept
hierarchies on the attributes describing the target
class, the attribute-oriented induction method
can be used, for example, to carry out data
summarization.
• A data cube containing summarization of data,
simple OLAP operations fit the purpose of data
characterization.
Data discrimination
• Data discrimination produces what are
called discriminant rules and is basically the
comparison of the general features of objects
between two classes referred to as the target
class and the contrasting class.
• For example, one may want to compare the
general characteristics of the customers who
rented more than 30 movies in the last year with
those whose rental account is lower than 5.
• The techniques used for data discrimination are
very similar to the techniques used for data
characterization with the exception that data
discrimination results include comparative
measures.
Association analysis
• Association analysis is the discovery of what are
commonly called association rules.
• It studies the frequency of items occurring
together in transactional databases, and based on
a threshold called support, identifies the frequent
item sets.
• Another threshold, confidence, which is the
conditional probability than an item appears in a
transaction when another item appears, is used
to pinpoint association rules.
• Association analysis is commonly used for market
basket analysis.
• An association between more than one attribute,
or predicate (i.e., age, income, and buys).
• Suppose, as a marketing manager of
AllElectronics, you would like to determine which
items are frequently purchased together within
the same transactions. An example of such a rule
is
• Contains
• (T; \computer") ) contains(T; \software") [support
= 1%; confidence = 50%]
• meaning that if a transaction T contains
\computer", there is a 50% chance that it
contains \software" as well, and 1% of all of the
transactions contain both.
OurVideoStore manager to know what games are often
rented together or if there is a relationship between
renting a certain type of game and buying popcorn or
pop.
• RentType(X, "game") AND Age(X, "13-19") ->
Buys(X, "pop") [s=2% ,c=55%]
would indicate that 2% of the transactions
considered are of customers aged between 13
and 19 who are renting a game and buying a
pop, and that there is a confidence of 55%
that teenage customers who rent a game also
buy pop.
Use basket analysis to plan
• Marketing or advertising strategies
• Design store layout
– Keep items bought together in close proximity
– Place hw & sw at opposite end so that
customers walk a long way and notice other
items like security systems
– Plan sale items eg sale on printers.
• For association analysis
– minimum threshold values of support and
minimum threshold values of confidence is set by
domain experts.
– Data Mining is done
– Find out result
Classification and Prediction
• Classification is the processing of finding a set of models (or
functions) which describe and distinguish data classes or
concepts, for the purposes of being able to use the model to
predict the class of objects whose class label is unknown.
• Classification approaches normally use a training set where all
objects are already associated with known class labels. The
classification algorithm learns from the training set and builds a
model. The model is used to classify new objects.
• The derived model is based on the analysis of a set of training
data (i.e., data objects whose class label is known).
• The derived model may be represented in various forms, such as
classification (IF-THEN) rules, decision trees, mathematical
formulae, or neural networks.
Classification vs. Prediction
• Classification:
– predicts categorical class labels
– classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and uses it
in classifying new data
• Prediction:
– models continuous-valued functions, i.e., predicts unknown or
missing values
• Typical Applications
– credit approval
– target marketing
– medical diagnosis
– treatment effectiveness analysis
Classification—A Two-Step Process
• Model construction: describing a set of predetermined
classes
– Each tuple is assumed to belong to a predefined class, as determined
by the class label attribute (supervised learning)
– The set of tuples used for model construction: training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying previously unseen objects
– Estimate accuracy of the model using a test set
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
Classification Process: Model
Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


(Model)
M ike A ssistant P rof 3 no
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no THEN tenured = ‘yes’
Classification Process: Model usage
in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)

NAME RANK YEARS TENURED


T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
• A decision tree is a flow-chart-like tree
structure, where each node denotes a test on
an attribute value, each branch represents an
outcome of the test, and tree leaves represent
classes or class distributions. Decision trees
can be easily converted to classification rules.
• A neural network is a collection of linear
threshold units that can be trained to
distinguish objects of different classes.
Prediction:
• Prediction has attracted considerable attention given the
potential implications of successful forecasting in a
business context.
• There are two major types of predictions:
– one can either try to predict some unavailable data values or
pending trends
– or predict a class label for some data. The latter is tied to
classification.
– Once a classification model is built based on a training set, the
class label of an object can be foreseen based on the attribute
values of the object and the attribute values of the classes.
– Prediction is however more often referred to the forecast of
missing numerical values, or increase/ decrease trends in time
related data.
– The major idea is to use a large number of past values to
consider probable future values.
Clustering analysis
• Clustering analyzes data objects without consulting a
known class label.
• In general, the class labels are not present in the
training data simply because they are not known to
begin with.
• Clustering can be used to generate such labels. The
objects are clustered or grouped based on the principle
of maximizing the intra class similarity and minimizing
the interclass similarity.
• That is, clusters of objects are formed so that objects
within a cluster have high similarity in comparison to
one another, but are very dissimilar to objects in other
clusters.
• Each cluster that is formed can be viewed as a class of
objects, from which rules can be derived.
Examples of clustering
• Grouping items when doing laundry – Dryclean, whites,
Bright colored etc.
• Important attributes in common about the way they
behave.
• Look at people in your neighbourhood. People with
similar income. If my income is known then an idea is
about others income.
• Find nearest neighbour and do prediction.
• Toyota Corolla is similar to Honda civic than BMW

Objects that are near each other also have similar


prediction values. Thus if we know prediction value of
one of the objects ,it can be used to predict for
objects in cluster.
Summary -Data Mining Functionalities
• Multidimensional concept description:
Characterization and discrimination – Generalize,
summarize, and contrast data characteristics, e.g., dry
vs. wet regions
• Association-Frequent patterns, Diaper Æ Beer [0.5%,
75%]
• Classification and prediction – Construct models
(functions) that describe and distinguish classes or
concepts for future prediction
• Clustering - Objects within a cluster have high
similarity in comparison to one another, but are very
dissimilar to objects in other clusters.
• E.g., classify countries based on (climate), or classify
cars based on (gas mileage) – Predict some unknown or
missing numerical values
Are All the “Discovered” Patterns Interesting?
• Data mining may generate thousands of patterns: Not all of them
are interesting
• Interestingness measures
– A pattern is interesting if it is
1. easily understood by humans,
2. valid on new or test data with some degree of certainty,
3. potentially useful,
4. novel,
5. validates some hypothesis that a user seeks to confirm

53
• Objective vs. subjective interestingness measures
– Objective: based on statistics and structures of patterns,
e.g., support, confidence, etc.
– Subjective: based on user’s belief in the data, e.g.,
unexpectedness, novelty.
Data Mining: Classification Schemes

• Different views, different classifications


– Kinds of data to be mined
– Kinds of knowledge to be discovered
– Kinds of techniques utilized
– Kinds of applications adapted

55
Data Mining: Classification
• Data to be mined
– Relational, data warehouse, transactional, stream, object-
oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, WWW
• Knowledge to be mined
– Characterization, discrimination, association, classification,
clustering, trend/deviation etc.
– based on the granularity or levels of abstraction of the knowledge
mined, including generalized knowledge (at a high level of
abstraction), primitive-level knowledge (at a raw data level)
– knowledge at multiple levels (considering several levels of
abstraction).
– An advanced data mining system should facilitate the discovery
of knowledge at multiple levels of abstraction.

56
Data Mining: Classification
• Techniques utilized
– according to the degree of user interaction involved
(e.g., autonomous systems, interactive exploratory
systems, query-driven systems),
– methods of data analysis employed -Database-
oriented, data warehouse (OLAP), machine learning,
statistics, visualization, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis,
bio-data mining, stock market analysis, Web mining,
etc.
57
Major Issues in Data Mining

• Mining methodology
– Mining different kinds of knowledge from diverse data
types, e.g., bio, stream, Web
– Performance: efficiency, effectiveness, and scalability
– Pattern evaluation: the interestingness problem
– Incorporation of background knowledge
– Handling noise and incomplete data
• Performance issues
– Efficiency and scalability of data mining algorithms
– Parallel, distributed and incremental mining methods
• Diverse database types
58
Major Issues in Data Mining

• User interaction
– Data mining query languages and ad-hoc mining
– Expression and visualization of data mining results
– Interactive mining of knowledge at multiple levels
of abstraction
• Applications and social impacts
– Domain-specific data mining & invisible data
mining
– Protection of data security, integrity, and privacy
59
Data Preprocessing
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
60
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
– Part of data reduction but with particular importance, especially for
numerical data
1. Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time
of entry
– not register history or changes of the data
• Missing data may need to be inferred

62
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”
– the attribute mean :For example, suppose that the average income of All
Electronics customers is $28,000. Use this value to replace the missing value for
income.
– the attribute mean for all samples belonging to the same class: For example, if
classifying customers according to credit risk, replace the missing value with the
average income value for customers in the same credit risk category as that of the
given tuple.
• the most probable value: inference-based such as Bayesian formula or decision tree
• using the other customer attributes in your data set, you may construct a decision
tree to predict the missing values for income.
63
2. Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data

64
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal
with possible outliers)

65
Simple Discretization Methods: Binning

• Equal-width (distance) partitioning:


– It divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the width
of intervals will be: W = (B-A)/N.
– The most straightforward
– But outliers may dominate presentation
– Skewed data is not handled well.
• Equal-depth (frequency) partitioning:
– It divides the range into N intervals, each containing approximately
same number of samples
– Good data scaling
– Managing categorical attributes can be tricky.
Binning Methods for Data Smoothing

* Sorted data for price (in Rs): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Clustering
• Outliers may be detected by clustering, where
similar values are organized into groups or
“clusters”.
• Intuitively, values which fall outside of the set of
clusters may be considered outliers.
Combined computer and human
inspection
• Outliers may be identified through a combination of
computer and human inspection.
• for example, an information-theoretic measure was
used to help identify outlier patterns in a handwritten
character database for classification.
• The measure's value reflected the “surprise" content
of the predicted character label with respect to the
known label.
• Outlier patterns may be informative or “garbage".
Patterns whose surprise content is above a threshold
are output to a list.
• A human can then sort through the patterns in the list
to identify the actual garbage ones
Regression
• Data can be smoothed by fitting the data to a
function, such as with regression.
• Linear regression involves finding the “best" line
to fit two variables, so that one variable can be
used to predict the other.
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple linear regression is an extension of linear
regression, where more than two variables are
involved and the data are fit to a
multidimensional surface.
y

Regression Analysis
Y1

• Regression analysis: A collective name


Y1’
for techniques for the modeling and y=x+1

analysis of numerical data


• The parameters are estimated so as to X1
x

give a "best fit" of the data


• Most commonly the best fit is • Used for prediction (including
forecasting of time-series data),
evaluated by using the least squares
inference, hypothesis testing,
method, but other criteria have also and modeling of causal
been used relationships
3. Data discrepancy detection
– Use metadata (e.g., domain, range, dependency,
distribution)
– Check field overloading
– Check uniqueness rule, consecutive rule and null rule
– Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g.,
postal code, spell-check) to detect errors and make
corrections
• Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
72

You might also like