0% found this document useful (0 votes)
9 views

Data Mining

Uploaded by

abulgassimg
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Data Mining

Uploaded by

abulgassimg
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 130

Data Mining

By
Dr. Muawia Mohamed
Ahmed
[email protected]
1
Unit-1

Background
Definitions
Data warehouse
Common tasks
Data mining elements
Analysis Levels.
2
Background

Humans have been "manually" extracting


information from data for centuries, but the
increasing volumes of data in modern times has
called for more automatic approaches.
Data sets and the information extracted from
them have grown in size and complexity.

3
Direct hands-on data analysis has increasingly
been supplemented and augmented with indirect,
automatic data processing using more complex
and sophisticated tools, methods and models.
Data mining is the process of using computing
power to apply methodologies, including new
techniques of knowledge discovery, to data.

4
Data

Data are any facts, numbers, or text that can be


processed by a computer.
Today, organizations are accumulating vast and
growing amounts of data in different formats
and different databases.
This includes:

5
Operational or transactional data such as, sales,
cost, inventory, payroll, and accounting
Non-operational data, such as industry data,
forecast data, and macro economic data
Meta- data - data about the data itself, such as
logical database design or data dictionary
definitions.

6
Relational & Multidimensional database
structure
A hotly debated technical issue is whether it is
better to set up a relational database structure
or a multidimensional one.
In a relational structure, data is stored in
tables, permitting ad hoc queries.
In a multidimensional structure, on the other
hand, sets of cubes are arranged in arrays, with
subsets created according to category.

7
Information
The patterns, associations, or relationships
among all this data can provide information.
For example, analysis of retail point of sale
transaction data can yield information on which
products are selling and when.

8
Knowledge

Information can be converted into knowledge


about historical patterns and future trends.
For example, summary information on retail
supermarket sales can be analyzed in light of
promotional efforts to provide knowledge of
consumer buying behaviour.

9
Assignment
Mention the main types of data to be mined
giving short notes on each type.
Flat File, Relational database, transaction
database, multimedia database, spatial
database, time-series database (stock market
or logged activities), World Wide Web
10
Data Mining

Definitions
Data warehouse
Common tasks
Data Mining Elements
Analysis Levels.

11
Data Mining
Data mining is the process of extracting hidden
patterns from data.
As more data is gathered, with the amount of
data doubling every three years (?), data mining
is becoming an increasingly important tool to
transform this data into information.

12
It is commonly used in a wide range of
applications, such as marketing, fraud
detection and scientific discovery.
Data mining can be applied to data sets of any
size.

13
Data mining (sometimes called data or
knowledge discovery) is the process of
analyzing data from different perspectives and
summarizing it into useful information
Information that can be used to increase
revenue, cuts costs, or both.
The data analysis software is what supports data
mining.

14
Data Warehouses
Data warehousing is defined as a process of
centralized data management and retrieval.
Data warehousing represents an ideal vision of
maintaining a central repository (storing) of all
organizational data.
Centralization of data is needed to maximize
user access and analysis.

15
Data mining is primarily used today by
companies with a strong consumer focus - retail,
financial, communication, and marketing
organizations.
It enables these companies to determine
relationships among "internal" factors such as
price, product positioning, or staff skills, and
"external" factors such as economic indicators,
competition, and customer demographics.

16
It enables them to determine the impact on
sales, customer satisfaction, and corporate
profits.
Finally, it enables them to "drill down" into
summary information to view detail
transactional data.

17
Operational Data: Data used in day to day
needs of company.
Informational Data: Supports other
functions such as planning and forecasting.
Data mining tools often access data
warehouses (informative) rather than
operational data.

18
Operational vs. Informational Data

Operational Data Data Warehouse


Application OLTP OLAP
Use Precise Queries Ad Hoc
Temporal Snapshot Historical
Modification Dynamic Static
Orientation Application Business
Data Operational Values Integrated
Size Gigabits Terabits
Level Detailed Summarized
Access Often Less Often
Response Few Seconds Minutes
Data Schema Relational Star/Snowflake
19
Common tasks
Data Mining is commonly used to perform
four classes of tasks :
Classification
Arranges the data into predefined groups. For
example an email program might attempt to
classify an email as legitimate or spam.
Common algorithms include Nearest
neighbor, and Neural network.
20
Clustering :
Is like classification but the groups are not
predefined, so the algorithm will try to group
similar items together.
Regression :
Attempts to find a function which models the
data with the least error.
A common method is to use Genetic
Programming.

21
Association :
Searches for relationships between variables.
For example a supermarket might gather data
of what each customer buys.
Using association rule learning, the
supermarket can work out what products are
frequently brought together, which is useful
for marketing purposes.
This is sometimes referred to as "market
basket analysis".

22
Data Mining Elements
Data mining consists of five major elements:
Extract, transform, and load transaction data
onto the data warehouse system.
Store and manage the data in a multidimensional
database system.
Provide data access to business analysts and
information technology professionals.
Analyze the data by application software.
Present the data in a useful format, such as a
graph or table.
23
Analysis Levels
Different levels of analysis are available:
Artificial neural networks: Non-linear
predictive models that learn through training
and resemble (imitate) biological neural
networks in structure.
Genetic algorithms: Optimization techniques
that use processes such as genetic combination,
mutation (change), and natural selection in a
design based on the concepts of natural
evolution.
24
Decision trees: Tree-shaped structures that
represent sets of decisions.
These decisions generate rules for the
classification of a data set.
Rule induction: The extraction of useful if-
then rules from data based on statistical
significance.
Data visualization: The visual interpretation
of complex relationships in multidimensional
data.
(Graphics tools are used to illustrate data
relationships.) 25
Technological infrastructure
Data mining applications are available on all size
systems for mainframe, client/server, and PC
platforms.
Enterprise-wide applications generally range in
size from 10 gigabytes to over 11 terabytes.
Some companies have the capacity to deliver
applications exceeding 100 terabytes.

26
There are two critical technological drivers:
Size of the database:
The more data being processed and
maintained, the more powerful the system
required.
Query complexity:
The more complex the queries and the greater
the number of queries being processed, the
more powerful the system required.

27
Relational database storage and management
technology is adequate for many data mining
applications less than 50 gigabytes.
However, this infrastructure needs to be
significantly enhanced to support larger
applications.
Some vendors have added extensive indexing
capabilities to improve query performance

28
Others use new hardware architectures such as
Massively Parallel Processors (MPP) to achieve
order-of-magnitude improvements in query
time.
For example, MPP systems from NCR link
hundreds of high-speed Pentium processors to
achieve performance levels exceeding those of
the largest supercomputers.

29
Data mining identifies trends within data that go
beyond simple analysis.
Through the use of sophisticated algorithms,
non-statistician users have the opportunity to
identify key attributes of processes and target
opportunities.
The term data mining is often used to apply to
the two separate processes of knowledge
discovery and prediction.

30
Knowledge discovery provides explicit
information about the characteristics of the
collected data, using a number of techniques.
Recently, there were some efforts to define a
standard for data mining, for example the Cross-
Industry Standard Process for Data Mining
(CRISP-DM) standard tool for analysis
processes and, the Java Data-Mining Standard
tool.

31
available open-source software systems like
RapidMiner and Weka have become an informal
standards for defining data-mining processes.

32
Data Mining Issues
key issues raised by data mining technology
are:
1. Business issues : analyzing routine business
transactions and classifications.
2. social issues: Data mining makes it possible
to analyze routine business transactions and
glean a significant amount of information about
individuals buying habits and preferences.
33
3. Mining Methodology Issues:
Pertain to data mining approaches applied and
their limitations.
The broad analysis needs, the assessment of the
knowledge discovered, the exploitation of
background knowledge and metadata, the
handling of noise in data are examples that can
dictate mining methodology choices.
A choice is needed

34
4. Cost: While system hardware costs have
dropped dramatically within the past few
years, data mining and data warehousing tend
to be self-reinforcing.

This increases pressure for


larger, faster systems, which
are more expensive.

35
5. User Interface Issues:
The knowledge discovered by data mining
tools is useful as long as it is interesting,
and above all understandable by the user.

This needs a good data


mining results
visualization
36
6. Data Source issue
-An excess of data appear when we have
more data than we can handle.
- different types of data are stored in a
variety of repositories.

The concern is whether we are


collecting the right data at the
appropriate amount.
37
Data Mining software
Data mining software is one of a number of
analytical tools for analyzing data.
It allows users to analyze data from many
different dimensions or angles, categorize it, and
summarize the relationships identified.
Technically, data mining is the process of finding
correlations or patterns among dozens of fields in
large relational databases.

38
Analysts separate data mining software into two
groups: data mining tools and data mining
applications.
Data mining tools provide a number of
techniques that can be applied to any business
problem. (SPSS is a general tool)
Data mining applications, on the other hand,
embed techniques inside an application
customized to address a specific business
problem.
39
For example, almost every financial
transaction is processed by a data mining
application to detect fraud.
Both data mining tools and data mining
applications are valuable.
Organizations are using data mining tools and
data mining applications together in an
integrated environment for predictive
analytics.

40
Data Mining Tools
The data mining market consists of software
vendors offering tools that extract predictive
information from large data stores.

Data mining tools provide both


developers and business
users with an interface for
discovering, manipulating,
and analysing corporate data
41
Some Tool’s Vendors and products:

Vendor Product
Angoss Software Mining manager 2.1
IBM DB2 intelligent Miner
KXEN Analytic framework 3
SAS Enterprise Miner 5.1
SPSS Clementine 8.5

42
Clementine data mining
Clementine (fruitful data mining)
Transforms data into actionable results
The SPSS data mining workbench, enables your
organization to quickly develop predictive data
mining models and deploy those data mining
models into your organization's operations -
improving decision making.

43
Using Clementine's powerful, visual data
mining interface and your business expertise,
you can quickly interact with your data and
begin discovering patterns you can use to
change your organization for the better.

44
Text mining and Web mining
Recent advances have led to the newest and
hottest trends in data mining—text mining
and Web mining.
These two data mining technologies open a
rich vein of customer data in the form of
textual comments from survey research and
log files from Web servers

45
UNIT-2

KDD Process
Data Mining Process
Data Mining Functionalities

46
KDD Vs Data Mining
Data Mining is popularly known as
Knowledge Discovery in Database (KDD).
Data mining is actually part of KDD process.

Data Mining is the Core of


KDD

47
Knowledge Discovery in Databases (KDD):
process of finding useful information and
patterns in data.
Data Mining: Use of algorithms to extract the
information and patterns derived by the KDD
process.

48
KDD Process
Classification,
clustering,
Association, etc…

49
Selection: Obtain data from various
sources.
Preprocessing: data cleaning.
Transformation: Convert to common
format. Transform to new format.
Data Mining: Obtain desired results by
applying Data Mining tasks tools.
Interpretation/Evaluation: Present results
to user in meaningful manner.

50
Data Mining Process
Consists of three stages:
(1) The initial exploration,
(2) Model building
(3) Deployment (i.e., the application of the
model to new data in order to generate
predictions patterns).
Remember, the ultimate goal of data mining
is prediction and decision making.

51
Stage 1: Exploration.
This stage usually starts with data preparation
which may involve cleaning data, data
transformations, selecting subsets of records .
In case of data sets with large numbers of
variables ("fields") - performing some
preliminary feature selection operations to bring
the number of variables to a manageable range
(Reduction).

52
analyses is made by using a wide variety of
graphical and statistical methods, for example
Exploratory Data Analysis (EDA)) in order to
identify the most relevant variables and
determine the complexity and/or the general
nature of models that can be taken into account
in the next stage.

53
EDA
Exploratory Data Analysis (EDA) is used to
identify systematic relations between variables
when there are no (or not complete)
expectations as to the nature of those relations.
In a typical exploratory data analysis process,
many variables are taken into account and
compared, using a variety of techniques in the
search for systematic patterns.

54
There is a positive correlation between the
AGE of a person and his/her RISK
TAKING.

55
Computational EDA techniques
Computational exploratory data analysis
methods include both simple basic statistics
and more advanced, designated multivariate
exploratory techniques designed to identify
patterns in multivariate data sets.

56
Graphical EDA techniques :
A large selection of powerful exploratory data
analytic techniques is also offered by graphical
data visualization methods that can identify
relations, trends, and biases "hidden" in
unstructured data sets.
The most common and historically first widely
used technique explicitly identified as
graphical exploratory data analysis is
brushing.
57
Brushing, is an interactive method allowing
one to select on-screen specific data points or
subsets of data and identify their (e.g.,
common) characteristics, or to examine their
effects on relations between relevant variables.
Other graphical exploratory analytic
techniques include function fitting and
plotting.

58
Stage II: Model building:
choose the suitable models to represent
the explored data.
Stage III: Deployment
in deployment ensure that the resultant
patterns meet the required patterns for
prediction and decision making

59
Data Mining Functionalities
Presented in:
- Characterization: summarization of
general features of objects and produces
characteristics rules.
- Discrimination: Comparison between two
classes, target class and contrasting class
(customers who rented 40 movies with those
whose rental account is lower than 5.
60
- Association analysis: the frequency of
items occurring together in transactional
database. Confidence threshold is
conditional probability that an item appears
in a transaction when another item appears.
- Classification: Organization of data in
a given class. (safe, risky, very risky)

61
- Prediction: Successful forecasting.
Two types prediction: to predict some
unavailable data values, or predict a class
label for some data.
- Clustering: organization of data in
classes, but unlike classification, the class
labels are unknown and up to the clustering
algorithm to discover acceptable classes.
clustering is an unsupervised classification.

62
- Outlier analysis: outliers are data
elements that cannot be grouped in a given
class or cluster. Known as exceptions or
surprises.
in some applications they are noise, but
they can reveal important knowledge in
other domains.

63
- Evolution and deviation analysis:
Evolution pertain to the study of time related
data that changes in time.
Deviation analysis considers differences
between measured values and expected
values.

64
Unit-3
Data Processing
Data Cleaning
Data Integration
Data Transformation
Data Reduction

65
Data Cleaning
Data may be incomplete, noisy, and
inconsistent. (15/01/2009 and 2009/1/15)
Data cleaning routines attempt to fill in
missing values, smooth out noise while
identifying outliers, and correct
inconsistencies in the data.
Attempt to improve the quality of data.

66
Data cleaning capabilities include:

-Smoothing noisy data


-Eliminate duplicate records
-Identification of missing or
incomplete data
-Removal of obsolete (not used) data

67
Smoothing Noisy data
Noise is a random error or variance in a
measured or recorded data
Noisy data needs smoothing
In data mining binning method is used to
smooth data
Given a numerical attribute such as Price
with data: 3,27,7,32,25,25,6,28,22
Using Binning (with three bins) will give
68
Partitioning
Bin 1: 3 6 7
Bin 2: 22 25 25
Bin 3: 27 28 32
Smoothing by Bin Mean (for the nearest
recorded value)
Bin 1: 6 6 6
Bin 2: 25 25 25
Bin 3: 28 28 28
69
Clustering:
-Outlier are detected by clustering
-Similar data are clustered into groups
-Values that fall outside the set of
clusters may be considered as outliers.
Cluster

Outlier

70
Data Integration
Combining data from multiple data stores
into a coherent data store as in data
warehousing.
How can equivalent real-world entities from
multiple data sources be match up?
Are Customer_id in one database and
cust_number in another refer to the same
attribute?

71
Redundancy is another important issue.
An attribute such as annual revenue may be
redundant if it can be derived from another
attribute.
Some redundancy can be detected by
correlation analysis (chi-squared).
A third important issue in data integration is
the detection and resolution of data value
conflicts.

72
Attributes values from different sources
may differ due to differences in
representation, scaling or encoding.
A weight attribute may be stored in metric
units in one system, and British units in
another.
Price may be in different currencies.

73
Data Transformation

Data are transformed into forms appropriate


for mining.
Involve the following:
• Aggregation
• Generalization
• Normalization
• Feature Construction
74
In Aggregation operations summary is
applied. For example:
-Daily sales data may be aggregated so

as to compute monthly or annual sales.


- It is a numerical feature.

75
In Generalization the low-level or primitive
(raw) data are replaced by high-level
through the use of concept hierarchies.
- Categorical attribute like street can be
generalized to city or country.
- Values for numerical attributes, like
age may be mapped to higher-level concept
like youth, mid-aged, and senior.

76
In Normalization attribute data are scaled so
as to fall within small specified range.
Useful for classification and clustering.
Common normalization techniques are:

- Min-Max Normalization
- Z-Score normalization

77
Min-Max normalization performs a linear
transformation on the original data.
MinA, and MaxA are the minimum and
maximum values of the attribute A.
Min-Max normalization maps a value v of A to
ύ in the range [new_minA, new_MaxA] by
computing:
ύ = v - minA (new_maxA - new_minA)+ new_minA
maxA-minA

78
Consider min and max values for the attribute
income are $12,000 and $98,000.
Map range = [0.0, 1.0] or minA = 0, maxA=1.0
then a value of v=$73.600 for income is
transformed to:
73,600 – 12,000(1.0 – 0.0) +0
98,000 – 12,000
= 0.716

79
In z-score normalization the value v for an
attribute A is normalized to ύ by:

ύ = v–Ã
A
where:
à is the mean value
A is the standard deviation

80
Consider the mean and standard deviation
of the values for the attribute income are
$54,000 and $16,000 respectively, with z-
score normalization, a value of $73,600 for
income is transformed to :
73,600 – 54,000
16,000
= 1.225

81
In Decimal normalization the formula is:

ύ = v
10j
Where j is the smallest integer such that ύ < 1
for the maximum absolute value of A
A range of A values -986 to 917 then max
absolute value is 986 and thus j is 3, means
divide each value by 1000.

82
Feature construction
In a new attribute construction, new
attributes are constructed from the given
attributes and added in order to help
improve the accuracy and understanding the
structure in high-dimensional data.
- Add the attribute area based on the
attributes height and width

83
Data Reduction
Data mining on huge amounts of data is
impractical and takes a long time.
Data reduction is useful for obtaining reduced
data set without loosing its integrity.

Operation on reduced data is


easier and
efficient
84
The process of data Reduction includes:

- Data cube aggregation


- Attribute subset selection
- Histograms

85
Data Cube Aggregation

A company with A, B, C, and D branches.


Each branch with annual sales value for 4
items.
We would like to aggregate the sales for
the three years 2006, 2007, 2008 in a data
cube.

86
Branches

D
C

A
Item
phone 650
Computer 980

TV 720 Data Cube for


Bell
3 years sales
230

06 07 08 year
87
Attribute Subset Selection:
The goal is to find a less complicated
minimum set of attributes so that it does not
alter the original distribution.
It needs a criteria
(take one subset from repetitive or most

likely repetitive subsets)


Take size from (size, height, width, colour)
88
Heuristic methods include:
- Forward selection
- Backward elimination
- Combination of the above
- Decision Tree induction

89
Unit - 4
Data Mining Techniques

-Classification
-Decision Tree
- Neural Networks
- Genetic Algorithms

90
Classification
Classification is a supervised learning
method that induces a classification model
from a database (one of the main tasks).
First define a conditions for each class
The data mining system then construct
descriptions for the classes.
Given a case with a certain known attribute
value, the system will be able what class
this case belongs to.
91
Classify countries based on climate.
Could be presented by:
- If-THEN rule with
- Decision Tree
But Prediction predict unknown or missing
values.

92
Decision Tree
Tree where the root and each internal node
is labeled with a question.
• The arcs represent each possible answer
to the associated question.
• Each leaf node represents a prediction of
a solution to the problem.
Popular technique for classification; Leaf
node indicates class to which the
corresponding tuple belongs.

93
Decision Tree from database
Num Size Colour Shape Concept
satisfied
1 med blue brick yes
2 small red wedge no
3 small red sphere yes
4 large red wedge no
5 large green pillar yes
6 large red pillar no
7 large green sphere yes
94
Choose target: Concept satisfied.
The DT of the previous database can be
constructed as:

95
{1,2,3,4,5,6,7}

large
medium small
{2,3} {4,5,6,7}
{1} Shape
Colour

Yes Wedge Sphere


Pillar
{2,3} {4} {5,6} {7}

Shape Colour
wedge Sphere Gr yes
No Red
{2} {3}
No Yes No {5) 96
yes
Thus we get the following Rules from the Tree
If (size=large
AND
((shape=wedge) OR (shape=pillar AND
Colour=red))))
OR (size=small AND Shape=wedge)
THEN NO
IF (size=large
AND
((shape=pillar) AND colour=green)
OR shape=sphere))
OR (size=small AND shape=sphere)
OR (Size=medium)
THEN YES
97
A Decision Tree Model is a computational
model consisting of three parts:
• Decision Tree
• Algorithm to create the tree
• Algorithm that applies the tree to data
Creation of the tree is the most difficult
part.

98
DT advantages/Disadvantage
Advantages:
• Easy to understand.
• Easy to generate rules
Disadvantages:
• May suffer from overfitting.
• Classifies by rectangular partitioning.
• Does not easily handle nonnumeric data.
• Can be quite large – pruning is necessary.

99
Neural Networks
Is a collection of processing nodes transferring
activity to each other via connections. (the
brain).
Artificial neuron is the mathematical
representation of neuron.
NN is an analytic technique capable of
predicting new observations from other
observations after executing a process of so
called learning from existing data.
100
I1

W1

I2
W2 N

X =  WjIj X>T? S
W3 j
I3

Wn
In
Artificial Neuron
101
In Artificial Neuron all signals can be 1 or -1 as
a binary case often called classic spin.
The neuron calculates a weighted sum (X) of
the inputs, and compare it with a Threshold (T).
If the input is higher than Threshold T, the
output is set to 1, otherwise to -1.
Output S either 1 or -1.

102
The prediction capability of NN is the most
interesting in data mining.
NN is trained to classify certain patterns
into certain groups, and then used to
classify new patterns presented to the net.
This is called feed-forward approach.

103
Genetic Algorithm
One of the important application technique
in data mining.
GA is a stochastic search method.
Let n be the number of predicted attributes
in the data being mined.
A chromosome is composed of n gens.
Each ith gene is partitioned into three feilds
104
1/Flag F is a binary valued variable. 1
shows that the corresponding condition is
involved in the rule, 0 shows the condition
will be removed from the rule.
2/Relational Operator RO field is a variable
that indicates the relational operator used in
ith condition.
If the attribute is categorical, this field can
involve the operator = and = .
If the attribute is continuous < and = .
105
3/ Value field involves one of the values
belonging to the domain of the attribute.

Gene1 Genen
……
….
F1 RO1 V1 Fn ROn Vn

Chromosome representation
106
GA creates an initial feasible solution and
iteratively creates new “better” solutions.
Based on human evolution and survival of
the fittest.
Must represent a solution as an individual.
Individual: string I=I1,I2,…,In
Each character Ij is called a gene.
Population: set of individuals.

107
Genetic Algorithms
A Genetic Algorithm (GA) is a computational
model consisting of five parts:
• A starting set of individuals, P.
• Crossover: technique to combine two parents to
create offspring (effect).
• Mutation: randomly change an individual.
• Fitness: determine the best individuals.
• Algorithm which applies the crossover and
mutation techniques to P iteratively using the
fitness function to determine the best
individuals in P to keep.

108
GA applications
Financial Data Analysis
Operation and Supply Chain management
Engineering Design.

109
OLAP
On Line Analytical Processing performs
multidimensional analysis of business data and
provide capability for sophisticated data
modeling.
Performed in data warehouse to support ad hoc
querying needed for DSS.
OLAP tools classified as ROLAP (Relational
OLAP) or MOLAP (Multidimensional OLAP)

110
(OLAP): provides more complex queries
than OLTP.
OnLine Transaction Processing (OLTP):
is traditional database/transaction
processing.
Dimensional data: cube view

111
There are several types of OLAP
operations:
- Simple query which looks at a single cell
within the cube.
- Slice : Looks at a sup-cube to get more
specific information (one attribute).
- Dice: looks at sup-cube bur (edges) with
two or more dimensions (many attributes).
-Roll up/Drill Down

112
OLAP Operations

Roll Up

Drill Down

Single Cell Multiple Cells Slice Dice

113
Unit - 5

Data Mining Statistics


-Point Estimation
-Maximum Likelihood Estimation (MLE)
-Models Based on Summarization
(histograms)
-Variance
- Regression and Correlation

114
Point Estimate: estimate a population
parameter.
- May be used to predict value for
missing data.
Ex:
• R contains 100 employees
• 99 have salary information
• Mean salary of these is $50,000
• Use $50,000 as value of remaining employee’s
salary.

(Discuss )

115
Estimation Error:
Bias: Difference
between expected
value and actual value.

116
Maximum Likelihood Estimate (MLE)
Likelihood is the
degree of probability
Joint probability for
observing the sample
data by multiplying
the individual
probabilities.
Likelihood function:

117
MLE Example
Coin toss five
times:
{H,H,H,H,T}
Assuming a perfect
coin with H and T
equally likely, the
likelihood of this
sequence is:
118
However if the
probability of a H is
0.8 then:

119
Histograms
Histograms use binning to approximate data
distribution .
It is a popular form data reduction.
The prices of different pieces may repeat
Number of repeated prices are the counts
Then prices versus counts graph is
established.

120
Count

5 15 25
Price

121
Variance & Standard Deviation
The variance of N observations, x 1, x2,…xN is

2 = 1  (xi – x )2
N n=1
Where x is the mean value

122
The standard deviation  is the square root
of the variance 2
 is the measure of the spread.
 = 0 only when there is no spread.

123
Regression
Regression can be used for prediction
(including forecasting of time-series data .
The regression equation deals with the
following variables:
The unknown parameters denoted as β.
This may be a scalar or a vector of length k.
The independent variables, X.
The dependent variable, Y.

124
Regression equation is a function of
variables X and β.
Y = f (X, β)

Assignment:
give an example of linear Regression of a
set of data.

125
Chi Squared
O – observed value
E – Expected value
based on hypothesis.
Ex:
• O={50,93,67,78,87}
• E=75
• 2=15.55 and
therefore significant

126
Correlation

Examine the degree to which the values for


two variables behave similarly.
Correlation coefficient r:
• 1 = perfect correlation
• -1 = perfect but opposite correlation
• 0 = no correlation

127
References

128
References
1. Gajendra Sharma, Data mining, data
warehousing, and OLAP, Kataria & Sons,
Second Edition, 2008.
2.
http://www.anderson.ucla.edu/faculty/jason.
frand/teacher/technologies/palace/
index.htm,january 2009.

129
3.
http://www-users.cs.umn.edu/~kumar/dmbook/index.
php,Dec. 2008 .
4. www.datashaping.com/data_mining.shtml,
January 2009.
5.
www.anderson.ucla.edu/faculty/jason.frand/teacher/te
chnologies/palace/datamining.htm, January 2009

130

You might also like