Answers PDF
Answers PDF
0 Data Mining
Tuesdays, Thursdays 13:00-14:20 LAS 3033
Fall Semester, 2014
_____________________________________________________________________________________________
THE BIG ASSIGNMENT
_____________________________________________________________________________________________
Name: _________________________________
1.
How is a data warehouse different from a database? How are they similar?
Answer:
Differences between a data warehouse and a database: A data warehouse is a repository of information
collected from multiple sources over a history of time, stored under a unified schema, and used for data
analysis and decision support, whereas a database, is a collection of interrelated data that represents the
current status of the stored data. There could be multiple heterogeneous databases where the schema of
one database may not agree with the schema of another. A database system supports ad-hoc query and
on-line transaction processing. Additional differences are detailed in the Han & Kamber textbook
Section 3.1.1: Differences between Operational Databases Systems and Data Warehouses.
Similarities between a data warehouse and a database: Both are repositories of information, storing
huge amounts of persistent data.
2.
Define each of the following data mining functionalities: characterization, discrimination, association and
correlation analysis, classification, prediction, clustering, and evolution analysis. Give examples of each
data mining functionality, using a real-life database that you are familiar with.
Answer:
Characterization is a summarization of the general characteristics or features of a target class of data.
For example, the characteristics of students can be produced, generating a profile of all the University
first year computing science students, which may include such information as a high GPA and large
number of courses taken.
Discrimination is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes. For example, the general features of
students with high GPA's may be compared with the general features of students with low GPA's. The
resulting description could be a general comparative profile of the students such as 75% of the students
with high GPA's are fourth-year computing science students while 65% of the students with low GPA's
are not.
Association is the discovery of association rules showing attribute-value conditions that occur
frequently together in a given set of data. For example, a data mining system may find association
rules like
major (X, computing science) owns (X, personal computer) [support=12%; confidence=98%]
where X is a variable representing a student. The rules indicate that of the students under study, 12%
(support) major in computing science and own a personal computer. There is a 98% probability
(confidence, or certainty) that a student in this group owns a personal computer.
Classification differs from prediction in that the former constructs a set of models (or functions) that
describe and distinguish data classes or concepts, whereas the latter builds a model to predict some
Page 1
3.
missing or unavailable, and often numerical, data values. Their similarity is that they are both tools for
prediction: Classification is used for predicting the class label of data objects and prediction is
typically used for predicting missing numerical data values.
Clustering analyzes data objects without consulting a known class label. The objects are clustered or
grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass
similarity. Each cluster that is formed can be viewed as a class of objects. Clustering can also facilitate
taxonomy formation, that is, the organization of observations into a hierarchy of classes that group
similar events together.
Data evolution analysis describes and models regularities or trends for objects whose behavior changes
over time. Although this may include characterization, discrimination, association, classification, or
clustering of time-related data, distinct features of such an analysis include time-series data analysis,
sequence or periodicity pattern matching, and similarity-based data
4.
What are the major challenges of mining a huge amount of data (such as billions of tuples) in comparison
with mining a small amount of data (such as a few hundred tuple data set)?
Answer:
One challenge to data mining regarding performance issues is the efficiency and scalability of data mining
algorithms. Data mining algorithms must be efficient and scalable in order to effectively extract
information from large amounts of data in databases within predictable and acceptable running times.
Another challenge is the parallel, distributed, and incremental processing of data mining algorithms. The
need for parallel and distributed data mining algorithms has been brought about by the huge size of many
databases, the wide distribution of data, and the computational complexity of some data mining methods.
Due to the high cost of some data mining processes, incremental data mining algorithms incorporate
database updates without the need to mine the entire data again from scratch.
5.
Suppose that the values for a given set of data are grouped into intervals. The intervals and corresponding
frequencies are as follows.
Age
1-5
5-15
15-20
20-50
50-80
80-110
Frequency
200
450
300
1500
700
44
In real-world data, tuples with missing values for some attributes are a common occurrence. Describe
various methods for handling this problem.
Answer:
The various methods for handling the problem of missing values in data tuples include:
Page 2
(a) Ignoring the tuple: This is usually done when the class label is missing (assuming the mining task
involves classification or description). This method is not very effective unless the tuple contains several
attributes with missing values. It is especially poor when the percentage of missing values per attribute
varies considerably.
(b) Manually filling in the missing value: In general, this approach is time-consuming and may not be a
reasonable task for large data sets with many missing values, especially when the value to be filled in is not
easily determined.
(c) Using a global constant to fill in the missing value: Replace all missing attribute values by the same
constant, such as a label like Unknown, or -. If missing values are replaced by, say, Unknown, then
the mining program may mistakenly think that they form an interesting concept, since they all have a value
in common that of Unknown. Hence, although this method is simple, it is not recommended.
(d) Using the attribute mean for quantitative (numeric) values or attribute mode for categorical (nominal)
values: For example, suppose that the average income of AllElectronics customers is $28,000. Use this
value to replace any missing values for income.
(e) Using the attribute mean for quantitative (numeric) values or attribute mode for categorical (nominal)
values, for all samples belonging to the same class as the given tuple: For example, if classifying
customers according to credit risk, replace the missing value with the average income value for customers
in the same credit risk category as that of the given tuple.
(f) Using the most probable value to fill in the missing value: This may be determined with regression,
inference-based tools using Bayesian formalism, or decision tree induction. For example, using the other
customer attributes in the data set, we can construct a decision tree to predict the missing values for
income.
7.
8.
Assume a base cuboid of 10 dimensions contains only three base cells: (1) (a1, d2, d3, d4, , d9, d10), (2)
(d1, b2, d3, d4, , d9, d10 ), and (3) (d1, d2, c3, d4, , d9, d10), where a1 d1 , b2 d2 , and c3 d3 . The
measure of the cube is count.
(a) How many nonempty cuboids will a full data cube contain?
(b) How many nonempty aggregate (i.e., nonbase) cells will a full cube contain?
(c) How many nonempty aggregate cells will an iceberg cube contain if the condition of the iceberg cube is
count 2?
Page 3
(d) A cell, c, is a closed cell if there exists no cell, d , such that d is a specialization of cell c (i.e., d is
obtained by replacing a * in c by a non-* value) and d has the same measure value as c . A closed cube is
a data cube consisting of only closed cells. How many closed cells are in the full cube?
Answer:
(a) How many nonempty cuboids will a complete data cube contain?
210.
(b) How many nonempty aggregated (i.e., nonbase) cells a complete cube will contain?
(1) Each cell generates 210 - 1 nonempty aggregated cells, thus in total we should have 3 x 210 - 3
cells with overlaps removed.
(2) We have 3 x 27 cells overlapped once (thus count 2) and 1 x 27 (which is (*,d*, *, d4, , d10
)) overlapped twice (thus count 3). Thus we should remove in total 5 x 27 overlapped cells.
(3) Thus we have: 3 x 8 x 27 - 5 x 27 - 3 = 19 x 27 - 3.
(c) How many nonempty aggregated cells will an iceberg cube contain if the condition of the iceberg cube
is count 2?
Analysis: (1) (*, *, d3, d4, , d9; d10 ) has count 2 since it is generated by both cell 1 and cell 2; similarly,
we have (2) (*, d2; , d4, , d9, d10 ), and (3) (, , d3, d4, , d9, d10 ):2; and (4) (, *, *, d4, , d9, d10 ):3.
Therefore we have, 4 x 27 = 29.
(d) A cell, c , is a closed cell if there exists no cell, d , such that d is a specialization of cell c (i.e., d is
obtained by replacing a * in c by a non-* value) and d has the same measure value as c . A closed cube
is a data cube consisting of only closed cells. How many closed cells are in the full cube?
There are seven cells, as follows
(1) (a1, d2, d3, d4, , d9, d10) : 1,
(2) (d1, d3, d4, , d9, d10) : 1,
(3) (d1; d2; c3; d4, , d9, d10) : 1,
(4) (*, *, d3, d4, , d9, d10) : 2,
(5) (.; d2; .; d4, , d9, d10) : 2,
(6) (d1, *, *, d4, , d9, d10) : 2, and
(7) (*,*, *, d4, , d9, d10) : 3.
9.
Often, the aggregate measure value of many cells in a large data cuboid is zero, resulting in a huge, yet
sparse, multidimensional matrix.
(a) Design an implementation method that can elegantly overcome this sparse matrix problem. Note that
you need to explain your data structures in detail and discuss the space needed, as well as how to retrieve
data from your structures.
(b) Modify your design in (a) to handle incremental data updates. Give the reasoning behind your new
design.
Answer:
(a) Design an implementation method that can elegantly overcome this sparse matrix problem. Note that
you need to explain your data structures in detail and discuss the space needed, as well as how to retrieve
data from your structures.
A way to overcome the sparse matrix problem is to use multiway array aggregation . (Note: this answer is
based on the paper by Zhao, Deshpande, and Naughton entitled \An array-based algorithm for simultaneous
multidimensional aggregates " in Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data, pages 159170, Tucson, Arizona, May 1997 [ZDN97]).
The first step consists of partitioning the array-based cube into chunks or subcubes that are small enough to
fit into the memory available for cube computation. Each of these chunks is first compressed to remove
cells that do not contain any valid data, and is then stored as an object on disk. For storage and retrieval
purposes, the chunkID + offset can be used as the cell address. The second step involves computing the
aggregates by visiting cube cells in an order that minimizes the number of times that each cell must be
Page 4
revisited, thereby reducing memory access and storage costs. By first sorting and computing the planes of
the data cube according to their size in ascending order, a smaller plane can be kept in main memory while
fetching and computing only one chunk at a time for a larger plane.
(b) Modify your design in (a) to handle incremental data updates. Give the reasoning behind your new
design.
In order to handle incremental data updates, the data cube is first computed as described in (a).
Subsequently, only the chunk that contains the cells with the new data is recomputed, without needing to
recompute the entire cube. This is because, with incremental updates, only one chunk at a time can be
affected. The recomputed value needs to be propagated to its corresponding higher-level cuboids. Thus,
incremental data updates can be performed efficiently.
10.
Suppose that a data relation describing students at Big University has been generalized to the generalized
relation R in the Table below.
Page 5
Let the minimum support threshold be 20% and the minimum confidence threshold be 50% (at each of the
levels).
(a) Draw the concept hierarchies for status, major, age, and nationality.
(b) Write a program to find the set of strong multilevel association rules in R using uniform support for all
levels, for the following rule template,
S R, P (S, x) ^ Q (S, y) gpa (S, z)
[s, c]
Page 6
11.
The following table consists of training data from an employee database. The data have been generalized.
For example, 31 35 for age represents the age range of 31 to 35. For a given row entry, count
represents the number of data tuples having the values for department, status, age , and salary given in that
row.
department
status
age
salary
count
sales
senior
3135
46K50K
30
sales
junior
2630
26K30K
40
sales
junior
3135
31K35K
40
systems
junior
2125
46K50K
20
systems
senior
3135
66K70K
systems
junior
2630
46K50K
systems
senior
4145
66K70K
marketing
senior
3640
46K50K
10
marketing
junior
3135
41K45K
secretary
senior
4650
36K40K
secretary
junior
2630
26K30K
The count of each tuple must be integrated into the calculation of the attribute selection measure (such
as information gain).
Take the count into consideration to determine the most common class among the tuples.
(b) Use your algorithm to construct a decision tree from the given data.
The resulting tree is:
(salary = 26K...30K:
junior
= 31K...35K:
junior
Page 7
= 36K...40K:
senior
= 41K...45K:
junior
= 46K...50K
(department
= secretary:
junior
= sales:
senior
= systems:
junior
= marketing:
senior)
= 66K...70K:
senior)
(c) Given a data tuple with the values systems, junior, and 26...30 for the attributes department,
status, and age, respectively, what would a nave Bayesian classification of the salary for the tuple be?
P (X|senior) = 0; P (X|junior) = 0.018. Thus, a nave Bayesian classification predicts junior.
(d) Design a multilayer feed-forward neural network for the given data. Label the nodes in the input and
output layers.
No standard answer. Every feasible solution is correct. As stated in the Han & Kamber book, discretevalued attributes may be encoded such that there is one input unit per domain value. For hidden layer units,
the number
should be smaller than that of input units, but larger than that of output units.
(e) Using the multilayer feed-forward neural network obtained above, show the weight values after one
iteration of the back propagation algorithm, given the training instance \(sales, senior, 31...35, 46K...50K).
Indicate your initial weight values and biases and the learning rate used.
No standard answer. Every feasible solution is correct.
12.
That is, each pixels RGB vector gets replaced by that of the closest prototypical colour.
Take a six-pixel image, with RGB values (0.22, 0.37, 0.8), (0.19, 0.8, 0.19), (0.6, 0.1. 0.05), (0.8, 0.3,
0.22), (0.7, 0.32, 0.8), (1, 0.4, 0.34). What is the bag-of-colours representation of the image? What is the
representation after norming by Euclidean length?
Answer:
The colour labels of the points are, in order, blue, green, red, red, magenta, and red. The bagof-colours representation would thus be
blue green magenta red
1
1
1
3
before normalization and
blue green magenta red
0.29 0.29
0.29
0.87
after normalization.
Page 8
13.
The UC Irvine Machine Learning Repository (http://archive.ics.uci.edu/ml/) currently maintain 298 data
sets as a service to the machine learning community. You may view all data sets through their searchable
interface. Their old web site (http://mlearn.ics.uci.edu/MLRepository.html) is still available. For a general
overview of the Repository, please visit http://archive.ics.uci.edu/ml/about.html.
Using the following four datasets, use WEKA to classify (show all induced rules) for each dataset using 3
different discretization methods for each dataset and compare the results (accuracy mean, accuracy
standard deviation).
http://archive.ics.uci.edu/ml/datasets/Iris
Iris Data Set: for flower classification.
http://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008
Diabetes 130-US hospitals for years 1999-2008 Data Set: This data has been prepared to analyze factors
related to readmission as well as other outcomes pertaining to patients with diabetes.
http://archive.ics.uci.edu/ml/datasets/Wine+Quality
Wine data set: Using chemical analysis determine the origin of wines.
http://archive.ics.uci.edu/ml/datasets/Tennis+Major+Tournament+Match+Statistics
Tennis Major Tournament Match Statistics Data Set: This is a collection of 8 files containing the match
statistics for both women and men at the four major tennis tournaments of the year 2013. Each file has 42
columns and a minimum of 76 rows.
Page 9