DWDM_Concept_Demonstration
DWDM_Concept_Demonstration
DATA MINING
Dept: Computer Science & Engineering
1
Aim: Evolution of data management technologies, introduction to Data
Warehousing concepts.
Theory:
1. Data Governance
o Data asset
o Data governance
o Data steward
a. Data analysis
b. Data architecture
c. Data modeling
3. Database Management
2
:
o Data maintenance
o Database administration
o Data access
o Data erasure
o Data privacy
o Data security
o Data cleansing
o Data integrity
o Data quality
o Data integration
o Reference data
o Business intelligence
Department Of Computer Science & Engineering 3
:
o Data mart
o Data mining
o Data warehousing
o Records management
o Metadata management
o Metadata
o Metadata discovery
o Metadata publishing
o Metadata registry
Data Warehouse:
DATA WAREHOUSING :
Data warehouse is a relational database that is designed for query and analysis
rather than for transaction processing. It usually contains historical data derived
from transaction data, but it can include data from other sources. It separates
4
analysis workload from transaction workload and enables an organization to
consolidate data from several sources.
• Subject Oriented
• Integrated
• Nonvolatile
• Time Variant
1. Subject Oriented
Data warehouses are designed to help you analyze data. For example, to learn more
about your company's sales data, you can build a warehouse that concentrates on
sales. Using this warehouse, you can answer questions like "Who was our best
customer for this item last year?" This ability to define a data warehouse by subject
matter, sales in this case, makes the data warehouse subject oriented.
2. Integrated
Integration is closely related to subject orientation. Data warehouses must put data
from disparate sources into a consistent format. They must resolve such problems
as naming conflicts and inconsistencies among units of measure. When they
achieve this, they are said to be integrated.
3. Nonvolatile
5
:
Nonvolatile means that, once entered into the warehouse, data should not change.
This is logical because the purpose of a warehouse is to enable you to analyze what
has occurred.
4. Time Variant
In order to discover trends in business, analysts need large amounts of data. This is
very much in contrast to online transaction processing (OLTP) systems, where
performance requirements demand that historical data be moved to an archive. A
data warehouse's focus on change over time is what is meant by the term time
variant.
6
:
Demo 2
Theory:
Design schemas
• snowflake schema
:
• Ease of creation
• Potential users are more clearly defined than in a full Data warehouse
Star schema architecture is the simplest data warehouse design. The main feature of
a star schema is a table at the center, called the fact table and the dimension tables
which allow browsing of specific categories, summarizing, drilldowns and
specifying criteria.
Typically, most of the fact tables in a star schema are in database third normal form,
while dimensional tables are denormalized (second normal form). Despite the fact
that the star schema is the simplest data warehouse architecture, it is most
commonly used in the data warehouse implementations across the world today
(about 9095% cases).
:
Fact table
The fact table is not a typical relational database table as it is denormalized on
purpose to enhance query response times. The fact table typically contains records
that are ready to explore, usually with ad hoc queries. Records in the fact table are
often referred to as events, due to the timevariant nature of a data warehouse
environment.
The primary key for the fact table is a composite of all the columns except numeric
values / scores (like QUANTITY, TURNOVER, exact invoice date and time).
Typical fact tables in a global enterprise data warehouse are (usually there may be
additional company or business specific fact tables):
Dimension table
Nearly all of the information in a typical fact table is also present in one or more
dimension tables. The main purpose of maintaining Dimension Tables is to allow
browsing the categories quickly and easily.
The primary keys of each of the dimension tables are linked together to form the
composite primary key of the fact table. In a star schema design, there is only one
denormalized table for a given dimension.
Fig A.1
The problem is that the more normalized the dimension table is, the more
complicated SQL joins must be issued to query them. This is because in order for a
query to be answered, many tables need to be joined and aggregates generated.
Fig A.2
:
Demonstrations 3
Aim: Develop an application to implement OLAP, roll up, drill down, slice and
dice operation
S/w Requirement: ORACLE, DB2.
Theory:
OLAP is an acronym for On Line Analytical Processing.Online Analytical
Processing: An OLAP system manages large amount of historical data, provides
facilities for summarization and aggregation, and stores and manages information
at different levels of granularity.
Month
Fi A.1
:
Office Day
OLAP operations:
The analyst can understand the meaning contained in the databases using multi
dimensional analysis. By aligning the data content with the analyst's mental model,
the chances of confusion and erroneous interpretations are reduced. The analyst can
navigate through the database and screen for a particular subset of the data,
changing the data's orientations and defining analytical calculations. The user
initiated process of navigating by calling for page displays interactively, through
the specification of slices via rotations and drill down/up is sometimes called "slice
and dice". Common operations include slice and dice, drill down, roll up, and
pivot.
Dice: The dice operation is a slice on more than two dimensions of a data cube (or
more than two consecutive slices).
Rollup: A rollup involves computing all of the data relationships for one or more
dimensions. To do this, a computational relationship or formula might be defined.
Other operations
Drill through: through the bottom level of the cube to its backend relational tables
(using SQL)
:
Fig A.2
:
Cube Operation:
FROM SALES
()
Drilldown is the reverse of rollup. It navigates from less detailed data to more
detailed data. Drilldown can be realized by either stepping down a concept
hierarchy for a dimension or introducing additional dimensions
:
The slice operation performs a selection on one dimension of the given cube,
resulting in a sub_cube.
Drill Down:
DrillDown
Roll up:
:
Slice:
:
Dice:
:
Demonstrations
4
Theory:
Multidimensional data model is to view it as a cube. The cable at the left contains
detailed sales data by product, market and time. The cube on the right associates
sales number (unit sold) with dimensionsproduct type, market and time with the
unit variables organized as cell in an array.
This cube can be expended to include another arraypricewhich can be associates
with all or only some dimensions.
As number of dimensions increases number of cubes cell increase exponentially.
Dimensions are hierarchical in nature i.e. time dimension may contain hierarchies
for years, quarters, months, weak and day. GEOGRAPHY may contain country,
state, city etc.
:
FigA.0
Fig.A.1
a) Logical Cubes
Logical cubes provide a means of organizing measures that have the same shape,
that is, they have the exact same dimensions. Measures in the same cube have the
same relationships to other logical objects and can easily be analyzed and
displayed together.
b) Logical Measures
Measures populate the cells of a logical cube with the facts collected about
business operations. Measures are organized by dimensions, which typically
include a Time dimension.
:
Measures are static and consistent while analysts are using them to inform their
decisions. They are updated in a batch window at regular intervals: weekly, daily,
or periodically throughout the day. Many applications refresh their data by adding
periods to the time dimension of a measure, and may also roll off an equal number
of the oldest time periods. Each update provides a fixed historical record of a
particular business activity for that interval. Other applications do a full rebuild of
their data rather than performing incremental updates.
The base level determines whether analysts can get an answer to this question. For
this particular question, Time could be rolled up into months, Customer could be
rolled up into regions, and Product could be rolled up into items (such as dresses)
with an attribute of color. However, this level of aggregate data could not answer
the question: At what time of day are women most likely to place an order? An
important decision is the extent to which the data has been preaggregated before
being loaded into a data warehouse.
:
c)Logical Dimensions
Dimensions contain a set of unique values that identify and categorize data. They
form the edges of a logical cube, and thus of the measures within the cube.
Because measures are typically multidimensional, a single value in a measure must
be qualified by a member of each dimension to be meaningful. For example, the
Sales measure has four dimensions: Time, Customer, Product, and Channel. A
particular Sales value (43,613.50) only has meaning when it is qualified by a
specific time period (Feb01), a customer (Warren Systems), a product (Portable
PCs), and a channel (Catalog).
Each level represents a position in the hierarchy. Each level above the base (or
most detailed) level contains aggregate values for the levels below it. The members
at different levels have a onetomany parentchild relation. For example, Q102 and
Q202 are the children of 2002, thus 2002 is the parent of Q102 and Q202.
Suppose a data warehouse contains snapshots of data taken three times a day, that
is, every 8 hours. Analysts might normally prefer to view the data that has been
aggregated into days, weeks, quarters, or years. Thus, the Time dimension needs a
hierarchy with at least five levels.
:
Similarly, a sales manager with a particular target for the upcoming year might
want to allocate that target amount among the sales representatives in his territory;
the allocation requires a dimension hierarchy in which individual sales
representatives are the child values of a particular territory.
e)Logical Attributes
An attribute provides additional information about the data. Some attributes are
used for display. For example, you might have a product dimension that uses Stock
Keeping Units (SKUs) for dimension members. The SKUs are an excellent way of
uniquely identifying thousands of products, but are meaningless to most people if
they are used to label the data in a report or graph. You would define attributes for
the descriptive labels.
You might also have attributes like colors, flavors, or sizes. This type of attribute
can be used for data selection and answering questions such as: Which colors were
the most popular in women's dresses in the summer of 2002? How does this
compare with the previous summer?
Time attributes can provide information about the Time dimension that may be
useful in some types of analysis, such as identifying the last day or the number of
days in each time period.
:
In Oracle Database, you can define a logical multidimensional model for relational
tables using the OLAP Catalog or AWXML. The metadata distinguishes level
columns from attribute columns in the dimension tables and specifies the
hierarchical relationships among the levels. It identifies the various measures that
are stored in columns of the fact tables and aggregation methods for the measures.
And it provides display names for all of these logical objects.
:
FigA.2
a)Dimension Tables
A star schema stores all of the information about a dimension in a single table.
Each level of a hierarchy is represented by a column or column set in the
dimension table. A dimension object can be used to define the hierarchical
relationship between two columns (or column sets) that represent two levels of a
hierarchy; without a dimension object, the hierarchical relationships are defined
only in metadata. Attributes are stored in columns of the dimension tables.
:
b) Fact Tables
Measures are stored in fact tables. Fact tables contain a composite primary key,
which is composed of several foreign keys (one for each dimension table) and a
column for each measure that uses these dimensions.
c) Materialized Views
Aggregate data is calculated on the basis of the hierarchical relationships defined in
the dimension tables. These aggregates are stored in separate tables, called
summary tables or materialized views. Oracle provides extensive support for
materialized views, including automatic refresh and query rewrite.
Queries can be written either against a fact table or against a materialized view. If a
query is written against the fact table that requires aggregate data for its result set,
the query is either redirected by query rewrite to an existing materialized view, or
the data is aggregated on the fly.
In an analytic workspace, the cube shape also represents the physical storage of
multidimensional measures, in contrast with twodimensional relational tables. An
advantage of the cube shape is that it can be rotated: there is no one right way to
manipulate or view the data. This is an important part of multidimensional data
storage, calculation, and display, because different analysts need to view the data in
different ways. For example, if you are the Sales Manager for the Pacific Rim, then
you need to look at the data differently from a product manager or a financial
analyst.
Assume that a company collects data on sales. The company maintains records that
quantify how many of each product was sold in a particular sales region during a
specific time period. You can visualize the sales measure as the cube shown in fig
A.3.
:
Fig A.3 compares the sales of various products in different cities for January 2001
(shown) and February 2001 (not shown). This view of the data might be used to
identify products that are performing poorly in certain markets. Fib A.4 shows sales
of various products during a fourmonth period in Rome (shown) and Tokyo (not
shown). This view of the data is the basis for trend analysis.
:
A cube shape is three dimensional. Of course, measures can have many more than
three dimensions, but three dimensions are the maximum number that can be
represented pictorially. Additional dimensions are pictured with additional cube
shapes.
for clarity, these objects are shown only once in the diagram. Variables and
formulas can have any number of dimensions; three are shown here.
Dimensions have several intrinsic characteristics that are important for data
analysis:
lagdif(sales, 1, time)
:
Not all data is hierarchical in nature, however, and you can create data dimensions
that do not have levels. A line item dimension is one such dimension, and the
relationships among its members require a model rather than a multilevel hierarchy.
The extensive data modeling subsystem available in analytic workspaces enables
you to create both simple and complex models, which can be solved alone or in
conjunction with aggregation methods.
No special physical relationship exists among variables that share the same
dimensions. However, a logical relationship exists because, even though they store
different data that may be a different data type, they are identical containers.
Variables that have identical dimensions compose a logical cube.
If you change a dimension, such as adding new time periods to the Time
dimension, then all variables dimensioned by Time are automatically changed to
include these new time periods, even if the other variables have no data for them.
Variables that share dimensions (and thus are contained by the same logical cube)
can also be manipulated together in a variety of ways, such as aggregation,
allocation, modeling, and numeric calculations. This type of calculation is easy and
fast in an analytic workspace, while the equivalent singlerow calculation in a
relational schema can be quite difficult.
:
Preaggregated data is stored in a compact format in the same container as the base-
level data, and the performance impact of aggregating data on the fly is negligible
when the aggregation rules have been defined according to known good methods.
If aggregate data needed for the result set is stored in the variable, then it is simply
retrieved. If the aggregate data does not exist, then it is calculated on the fly.
Formulas can also be used to calculated other results like ratios, differences,
moving totals, and averages on the fly.
:
Demonstrations 5
Theory:
Specialization: An entity set may include sub groupings of entities that are distinct
in some way from other entities in the set. For instance, a subset of entities within
an entity set may have attributes that are not shared by all the entities in the entity
set. The ER model provides a means for representing these distinctive entity
groupings. Consider an entity set person, with attributes name, street, and city. A
person may be further classified as one of the following:
•customer
• employee
:
Each of these person types is described by a set of attributes that includes all the
attributes of entity set person plus possibly additional attributes. For example,
customer entities may be described further by the attribute customerid, whereas
employee entities may be described further by the attributes employeeid and salary.
The process of designating sub groupings within an entity set is called
specialization. The specialization of person allows us to distinguish among persons
according to whether they are employees or customers.
Demonstrations
6
Theory:
Association rule mining is defined as: Let be a set of n binary attributes called
items. Let be a set of transactions called the database. Each transaction in D has a
unique transaction ID and contains a subset of the items in I. A rule is defined as an
implication of the form X=>Y where X,Y C I and X Π Y=Φ . The sets of items (for
short itemsets) X and Y are called antecedent (lefthandside or LHS) and
consequent (righthandside or RHS) of the rule respectively.
To illustrate the concepts, we use a small example from the supermarket domain.
The set of items is I = {milk,bread,butter,beer} and a small database containing the
items (1 codes presence and 0 absence of an item in a transaction) is shown in the
table to the right. An example rule for the supermarket could be meaning that if
milk and bread is bought, customers also buy butter.
To select interesting rules from the set of all possible rules, constraints on various
measures of significance and interest can be used. The bestknown constraints are
:
The confidence of a rule is defined . For example, the rule has a confidence of 0.2 /
0.4 = 0.5 in the database, which means that for 50% of the transactions containing
milk and bread the rule is correct. Confidence can be interpreted as an estimate of
the probability P(Y | X), the probability of finding the RHS of the rule in
transactions under the condition that these transactions also contain the LHS
1 1 1 0 0
2 0 1 1 0
3 0 0 0 1
4 1 1 1 0
5 0 1 0 0
Fig A.1
:
The lift of a rule is defined as or the ratio of the observed confidence to that
expected by chance. The rule has a lift of .
Many algorithms for generating association rules were presented over time.
Some well known algorithms are Apriori, Eclat and FPGrowth, but they only do
half the job, since they are algorithms for mining frequent itemsets. Another step
needs to be done after to generate rules from frequent itemsets found in a database.
ALGORITHM:
Association rule mining is to find out association rules that satisfy the predefined
minimum support and confidence from a given database. The problem is usually
decomposed into two subproblems. One is to find those itemsets whose
:
• Join Step.
• Prune Step.
where
Apriori Pseudocode
:
Apriori (T,£)
EXAMPLE:
A large supermarket tracks sales data by SKU (item), and thus is able to know what
items are typically purchased together. Apriori is a moderately efficient way to
build a list of frequent purchased item pairs from this data. Let the database of
transactions consist of the sets are
T1:{1,2,3,4},
T2: {2,3,4},
T3: {2,3},
T4:{1,2,4},
T5:{1,2,3,4}, and
T6: {2,4}.
Each number corresponds to a product such as "butter" or "water". The first step of
Apriori to count up the frequencies, called the supports, of each member item
separately:
Item Support
1 3
2 6
3 4
4 5
Item Support
{1,2} 3
{1,3} 2
{1,4} 3
{2,3} 4
{2,4} 5
{3,4} 3
This is counting up the occurrences of each of those pairs in the database. Since
:
minsup=3, we don't need to generate 3sets involving {1,3}. This is due to the fact
that since they're not frequent, no supersets of them can possibly be frequent. Keep
going
Item Support
{1,2,4} 3
{2,3,4} 3
:
Demonstrations 7
Theory:
Classification is a data mining function that assigns items in a collection to target
categories or classes. The goal of classification is to accurately predict the target
class for each case in the data. For example, a classification model could be used to
identify loan applicants as low, medium, or high credit risks.
A classification task begins with a data set in which the class assignments are
known. For example, a classification model that predicts credit risk could be
developed based on observed data for many loan applicants over a period of time.
In addition to the historical credit rating, the data might track employment history,
home ownership or rental, years of residence, number and type of investments, and
so on. Credit rating would be the target, the other attributes would be the
predictors, and the data for each customer would constitute a case.
credit rating or low credit rating. Multiclass targets have more than two values: for
example, low, medium, high, or unknown credit rating.
Classification models are tested by comparing the predicted values to known target
values in a set of test data. The historical data for a classification project is typically
divided into two data sets: one for building the model; the other for testing the
model.
• Decision Tree
• Naive Bayes
EXAMPLE:
If you enter the temperature of the day and you will get output as the
temperature belongs to which class.
:
Demonstrations 8
Theory:
Cluster analysis or clustering is the assignment of a set of observations into subsets
(called clusters) so that observations in the same cluster are similar in some sense.
Clustering is a method of unsupervised learning, and a common technique for
statistical data analysis used in many fields, including machine learning, data
mining, pattern recognition, image analysis and bioinformatics.
Types of clustering
Data clustering algorithms can be hierarchical. Hierarchical algorithms find
successive clusters using previously established clusters. These algorithms can be
either agglomerative ("bottomup") or divisive ("topdown"). Agglomerative
algorithms begin with each element as a separate cluster and merge them into
successively larger clusters. Divisive algorithms begin with the whole set and
proceed to divide it into successively smaller clusters.
Partitional algorithms typically determine all clusters at once, but can also be used
as divisive algorithms in the hierarchical clustering.
data objects exceeds a threshold. DBSCAN and OPTICS are two typical
algorithms of this kind.
kmeans clustering:
ALGORITHM:
• If k and d are fixed, the problem can be exactly solved in time O(ndk+1 log n),
where n is the number of entities to be clustered
:
Standard algorithm
The most common algorithm uses an iterative refinement technique. Due to its
ubiquity it is often called the kmeans algorithm; it is also referred to as Lloyd's
algorithm, particularly in the computer science community.
Assignment step: Assign each observation to the cluster with the closest
mean (i.e. partition the observations according to the Voronoi diagram
generated by the means).
Update step: Calculate the new means to be the centroid of the observations
in the cluster.
We recall from the previous lecture, that clustering allows for unsupervised
learning. That is, the machine / software will learn on its own, using the data
(learning set), and will classify the objects into a particular class – for example, if
our class (decision) attribute is tumor Type and its values are: malignant, benign,
etc. these will be the classes. They will be represented by cluster1, cluster2, etc.
However, the class information is never provided to the algorithm. The class
information can be used later on, to evaluate how accurately the algorithm
classified the objects.
.
Texture
1.2
x1
Blood
D Consump
0.4
B
A
0.23
After all the objects are plotted, we will calculate the distance between them, and
the ones that are close to each other – we will group them together, i.e. place them
in the same cluster.
Texture
1.2
.. . ..
Cluster 1
benign
Cluster 2
malignant
Blood
D Consump
0.4
B
A
0.23
0.8
Curvature
EXAMPLE:
Problem: Cluster the following eight points (with (x, y) representing locations) into
three clusters A1(2, ) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5)
A6(6, 4) A7(1, 2)
A8(4, 9). Initial cluster centers are: A1(2, ), A4(5, 8) and
A7(1, 2). The distance function between two points a=(x1, y1) and b=(x2, y2) is
defined as: ρ(a, b) = |x2 – x1| + |y2 – y1| .
Use kmeans algorithm to find the three cluster centers after the second iteration.
Iteration 1
(2, (5, 8) (1, 2)
)
Point Dist Mean 1 Dist Mean 2 Dist Mean 3 Cluster
A1 (2,
)
A2 (2, 5)
A3 (8, 4)
A4 (5, 8)
A5 (7, 5)
A6 (6, 4)
A7 (1, 2)
A8 (4, 9)
First we list all points in the first column of the table above. The initial cluster
centers – means, are (2, ), (5, 8) and (1, 2) chosen randomly.
Next, we will calculate the distance from the first point (2,
) to each of the three means, by using the distance function:
point mean1
x1, y1 x2, y2
(2, (2,
) )
ρ(a, b) ==|x2
ρ(point, mean1) |x2––x1| |y2––y1|
x1|++|y2 y1|
= |2 – 2| + |
–
|
=0+0
=0
Department Of Computer Science & Engineering 61
:
point mean2
x1, y1 x2, y2
(2, (5, 8)
)
ρ(a, b) = |x2 – x1| + |y2 – y1|
ρ(point, mean2) = |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 –
|
=3+2
=5
point mean3
x1, y1 x2, y2
(2, (1, 2)
)
ρ(a, b) = |x2 – x1| + |y2 – y1|
ρ(point, mean2) = |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 –
|
=1+8
=9
So, which cluster should the point (2, ) be placed in? The
one, where the point has the shortest distance to the mean – that is mean 1 (cluster
1), since the distance is 0.
Cluster 1 Cluster 2 Cluster 3
(2, )
So, we go to the second point (2, 5) and we will calculate the distance to each of
the three means, by using the distance function:
point mean1
x1, y1 x2, y2
(2, 5) (2,
)
point =5
mean2
x1, y1 x2, y2
(2, 5) (5, 8)
So, which cluster should the point (2, 5) be placed in? The one, where the point has
the shortest distance to the mean – that is mean 3 (cluster 3), since the distance is 0.
Iteration 1
(2, (5, 8) (1, 2)
)
Point Dist Mean Dist Mean Dist Mean Cluster
1 2 3
A1 (2, 0 5 9 1
)
A2 (2, 5) 5 6 4 3
A3 (8, 4) 12 7 9 2
A4 (5, 8) 5 0 2
A5 (7, 5) 5 9 2
A6 (6, 4) 5 7 2
A7 (1, 2) 9 0 3
A8 (4, 9) 3 2 2
The initial cluster centers are shown in red dot. The new cluster centers are shown in
red x.
That was Iteration1 (epoch1). Next, we go to Iteration2 (epoch2), Iteration3, and so
on until the means do not change anymore.
In Iteration2, we basically repeat the process from Iteration1 this time using the
new means we computed.
Restricted to data in Euclidean spaces, but variants of Kmeans can be used for
other types of data.
Kmediod method:
The kmedoids algorithm is a clustering algorithm related to the kmeans algorithm
and the medoidshift algorithm. Both the kmeans and kmedoids algorithms are
partitional (breaking the dataset up into groups) and both attempt to minimize
squared error, the distance between points labeled to be in a cluster and a point
designated as the center of that cluster. In contrast to the kmeans algorithm k
medoids chooses datapoints as centers (medoids or exemplars).
ALGORITHM:
The most common realization of kmedoid clustering is the Partitioning Around
Medoids (PAM) algorithm and is as follows:
2. Associate each data point to the closest medoid. ("closest" here is defined
using any valid distance metric, most commonly Euclidean distance,
Manhattan distance or Minkowski distance)
EXAMPLE:
Cluster the following data set of ten objects into two clusters i.e k = 2.
Consider a data set of ten objects as follows:
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X 7 6
Calculating distance so as to associate each data object to its nearest medoid. Cost
is calculated using Minkowski distance metric with r = 1.
3 4 2 6 3
3 4 3 8 4
3 4 4 7 4
3 4 6 2 5
3 4 6 4 3
3 4 7 3 5
3 4 8 5 6
3 4 7 6 6
7 4 2 6 7
7 4 3 8 8
7 4 4 7 6
7 4 6 2 3
7 4 6 4 1
7 4 7 3 1
7 4 8 5 2
7 4 7 6 2
where x is any data object, c is the medoid, and d is the dimension of the object
which in this case is 2.
Total cost is the summation of the cost of data object from its medoid in its cluster
so here:
Totaol cost={cost((3,4),(2,6)) +cost((3,4),(3,8))+cost((3,4),(4,7)) }
+cost((7,4),(6,2))+cost((7,4),(6,4))+cost((7,4)(7,3))
+cost((7,4),(8,5))+cost((7,4),(7,6))}
=(3+4+4) +(3+1+1+2+2)
=20
Step 2
Selection of nonmedoid O′ randomly
Let us assume O′ = (7,3)
So now the medoids are c1(3,4) and O′(7,3)
If c1 and O′ are new medoids, calculate the total cost involved
By using the formula in the step 1
3 4 2 6 3
3 4 3 8 4
3 4 4 7 4
3 4 6 2 5
3 4 6 4 3
3 4 7 4 4
3 4 8 5 6
3 4 7 6 6
7 3 2 6 8
7 3 3 8 9
7 3 4 7 7
7 3 6 2 2
7 3 6 4 2
7 3 7 4 1
7 3 8 5 3
7 3 7 6 3
total cost=3+4+4+2+2+1+3+3
=22
So cost of swapping medoid from c2 to O′ is
S=current total costpast total cost
=2220
=2
So moving to O′ would be bad idea, so the previous choice was good and algorithm
terminates here (i.e there is no change in the medoids).
It may happen some data points may shift from one cluster to another cluster
depending upon their closeness to medoid
Demonstrations 9
Theory:
Naive Bayes classifier assumes that the presence (or absence) of a particular
feature of a class is unrelated to the presence (or absence) of any other feature. For
example, a fruit may be considered to be an apple if it is red, round, and about 4" in
diameter. Even though these features depend on the existence of the other features,
a naive Bayes classifier considers all of these properties to independently
contribute to the probability that this fruit is an apple.
An advantage of the naive Bayes classifier is that it requires a small amount of
training data to estimate the parameters (means and variances of the variables)
necessary for classification. Because independent variables are assumed, only the
variances of the variables for each class need to be determined and not the entire
covariance matrix
The naive Bayes probabilistic model :
P(C|F1 .................Fn)
P(C|F1...............Fn)=[ {p(C)p(F1..................Fn|C)}/p(F1,........Fn)]
In practice we are only interested in the numerator of that fraction, since the
denominator does not depend on C and the values of the features Fi are given, so
that the denominator is effectively constant. The numerator is equivalent to the
joint probability model
p(C,F1........Fn)
p(C,F1........Fn)
=p(C) p(F1............Fn|C)
=p(C)p(F1|C) p(F2.........Fn|C,F1,F2)
=p(C)p(F1|C) p(F2|C,F1)p(F3.........Fn|C,F1,F2)
= p(C)p(F1|C) p(F2|C,F1)p(F3.........Fn|C,F1,F2)......p(Fn|C,F1,F2,F3.........Fn1)
Now the "naive" conditional independence assumptions come into play: assume
that each feature Fi is conditionally independent of every other feature Fj for j≠i .
This means that
p(Fi|C,Fj)=p(Fi|C)
p(C,F1,.......Fn)=p(C)p(F1|C)p(F2|C)...........
=p(C)π p(Fi|C)
This means that under the above independence assumptions, the conditional
distribution over the class variable C can be expressed like this:
Models of this form are much more manageable, since they factor into a socalled
class prior p(C) and independent probability distributions p(Fi|C). If there are k
classes and if a model for eachp(Fi|C=c) can be expressed in terms of r parameters,
then the corresponding naive Bayes model has (k − 1) + n r k parameters. In
practice, often k = 2 (binary classification) and r = 1 (Bernoulli variables as
features) are common, and so the total number of parameters of the naive Bayes
model is 2n + 1, where n is the number of binary features used for prediction
P(D)
• D : Set of tuples
– X : (x1,x2,x3,…. xn)
• P(X/Ci) = n
P( xk/ Ci)
k=1
EXAMPLE !:
• P(h/D) : Probability that customer D will buy our computer given that we
know his age and income
• P(h) : Probability that any customer will buy our computer regardless of age
(Prior Probability)
• P(D/h) : Probability that the customer is 35 yrs old and earns $50,000, given
that he has bought our computer (Posterior Probability)
• P(D) : Probability that a person from our set of customers is 35 yrs old and
earns $50,000
• Generally we want the most probable hypothesis given the training data
P(D)
• Generally we want the most probable hypothesis given the training data
P(D)
2 30 High No Average No
7 35 Medium No Good No
8 25 Low No Good No
9 28 High No Average No
• P (customer is 35 yrs & earns $50,000 / buys computer = yes) = 3/5 =0.6
• P (customer is 35 yrs & earns $50,000 / buys computer = no) = 1/5 = 0.2
= max(0.6, 0.2)
• Parameter estimation for naive Bayes models uses the method of maximum
likelihood
EXAMPLE 2:
ADVANTAGES:
The advantage of Bayesian spam filtering is that it can be trained on a peruser
basis.
The spam that a user receives is often related to the online user's activities. For
example, a user may have been subscribed to an online newsletter that the user
considers to be spam. This online newsletter is likely to contain words that are
common to all newsletters, such as the name of the newsletter and its originating
email address. A Bayesian spam filter will eventually assign a higher probability
based on the user's specific patterns.
The legitimate emails a user receives will tend to be different. For example, in a
corporate environment, the company name and the names of clients or customers
will be mentioned often. The filter will assign a lower spam probability to emails
containing those names.
The word probabilities are unique to each user and can evolve over time with
corrective training whenever the filter incorrectly classifies an email. As a result,
Bayesian spam filtering accuracy after training is often superior to predefined
rules.
It can perform particularly well in avoiding false positives, where legitimate email
is incorrectly classified as spam. For example, if the email contains the word
"Nigeria", which is frequently used in Advance fee fraud spam, a predefined rules
filter might reject it outright. A Bayesian filter would mark the word "Nigeria" as a
probable spam word, but would take into account other important words that
usually indicate legitimate email. For example, the name of a spouse may strongly
indicate the email is not spam, which could overcome the use of the word
"Nigeria."
DISADVANTAGES:
Bayesian spam filtering is susceptible to Bayesian poisoning, a technique used by
spammers in an attempt to degrade the effectiveness of spam filters that rely on
Bayesian filtering. A spammer practicing Bayesian poisoning will send out emails
with large amounts of legitimate text (gathered from legitimate news or literary
sources). Spammer tactics include insertion of random innocuous words that are
not normally associated with spam, thereby decreasing the email's spam score,
making it more likely to slip past a Bayesian spam filter.
Another technique used to try to defeat Bayesian spam filters is to replace text with
pictures, either directly included or linked. The whole text of the message, or some
part of it, is replaced with a picture where the same text is "drawn". The spam filter
is usually unable to analyze this picture, which would contain the sensitive words
like "Viagra". However, since many mail clients disable the display of linked
pictures for security reasons, the spammer sending links to distant pictures might
reach fewer targets. Also, a picture's size in bytes is bigger than the equivalent
text's size, so the spammer needs more bandwidth to send messages directly
including pictures. Finally, some filters are more inclined to decide that a message
is spam if it has mostly graphical contents.
A probably more efficient solution has been proposed by Google and is used by its
Gmail email system, performing an OCR (Optical Character Recognition) to every
mid to large size image, analyzing the text inside.
Demonstrations
Theory:
Decision tree learning, used in data mining and machine learning, uses a decision
tree as a predictive model which maps observations about an item to conclusions
about the item's target value In these tree structures, leaves represent classifications
and branches represent conjunctions of features that lead to those classifications. In
decision analysis, a decision tree can be used to visually and explicitly represent
decisions and decision making. In data mining, a decision tree describes data but
not decisions; rather the resulting classification tree can be an input for decision
making. This page deals with decision trees in data mining.
Decision tree learning is a common method used in data mining. The goal is to
create a model that predicts the value of a target variable based on several input
variables. Each interior node corresponds to one of the input variables; there are
edges to children for each of the possible values of that input variable. Each leaf
represents a value of the target variable given the values of the input variables
represented by the path from the root to the leaf.
A tree can be "learned" by splitting the source set into subsets based on an attribute
value test. This process is repeated on each derived subset in a recursive manner
called recursive partitioning. The recursion is completed when the subset at a node
all has the same value of the target variable, or when splitting no longer adds value
to the predictions.
In data mining, trees can be described also as the combination of mathematical and
computational techniques to aid the description, categorisation and generalisation
of a given set of data.
The dependent variable, Y, is the target variable that we are trying to understand,
classify or generalise. The vector x is comprised of the input variables, x1, x2, x3
etc., that are used for that task.
Types of tree:
Weather data
EXAXMPLE:
• Outlook
1. Info=0.693
2. Gain=0.9400.693=0.247
3. Split info=info[5,4,5]=1.577
4. Gain ratio=0.247/1.577=0.156
Info=5/14[(2/5log2(2/5)3/5log2(3/5)]+4/14[4/4log2(4/4)0log20]+5/14[
3/5log2(3/5)2/5log2(2/5)]=0.693
Gain=.940.693=.247
Split info=1.577
Gain ratio=.247/1.577=.156
The attribute with the highest information gain is selected as the splitting attribute
ADVANTAGES:
Amongst other data mining methods, decision trees have various advantages:
• Able to handle both numerical and categorical data. Other techniques are
usually specialised in analysing datasets that have only one type of variable.
Ex: relation rules can be used only with nominal variables while neural
networks can be used only with numerical variables.
• Robust. Performs well even if its assumptions are somewhat violated by the
true model from which the data were generated.
• Perform well with large data in a short time. Large amounts of data can be
analysed using personal computers in a time short enough to enable
stakeholders to take decisions based on its analysis.
DISADVANTAGES:
• The problem of learning an optimal decision tree is known to be NP
complete. Consequently, practical decisiontree learning algorithms are based
on heuristic algorithms such as the greedy algorithm where locally optimal
decisions are made at each node. Such algorithms cannot guarantee to return
the globally optimal decision tree.
• There are concepts that are hard to learn because decision trees do not
express them easily, such as XOR, parity or multiplexer problems. In such
cases, the decision tree becomes prohibitively large. Approaches to solve the
problem involve either changing the representation of the problem domain
(known as propositionalisation) or using learning algorithms based on more
expressive representations (such as statistical relational learning or inductive
logic programming).
WARMUP EXCERCISES:
What is database?
What is DBMS?
Advantage of database
What is data model?
What is object oriented model?
What is an entity?
What is an entity type?
What is an entity set?
What is weak entity set?
What is relationship?
What is DDL?
What is DML?
What is normalization?
What is functional dependency?
What is 1st NF?
What is 2nd NF?
What is 3rd NF?
What is BCNF?
What is fully functional dependency?
What do you mean of aggregation, atomicity?
What are different phases of transaction?
What do you mean of flat file database?
What is query?
What is the name of buffer where all commands are stored?
Are the resulting of relation PRODUCT and JOIN operation is same?
Department Of Computer Science & Engineering 99
:
What is slicing and dicing? Explain with real time usage and business reasons of
its use?
Explain the flow of data starting with OLTP to OLAP including staging ,summary
tables, Facts and dimensions.?
What is the difference between OLAP and OLTP?
What are the different tracing options available? Can tracing options be set for
individual transformations?
The marking patterns should be justifiable to the students without any ambiguity and teacher
should see that students are faced with unjust circumstances.
The assessment is done according to the directives of the Principal/ Vice-Principal/ Dean
Academics.