Lecture 3
Lecture 3
Data mining derives its name from the similarities between searching for
valuable business information in a large database — for example, finding
linked products in gigabytes of store scanner data — and mining a mountain
for a vein of valuable ore. Both processes require either sifting through an
Data Warehouse and Data Mining 2023-2024
A typical data mining system may have the following major components.
1. Knowledge Base:
This is the domain knowledge that is used to guide the search or evaluate
the interestingness of resulting patterns. Such knowledge can include
concept hierarchies, used to organize attributes or attribute values into
different levels of abstraction. Knowledge such as user beliefs, which can
be used to assess a pattern’s interestingness based on its unexpectedness,
may also be included. Other examples of domain knowledge are
additional interestingness constraints or thresholds, and metadata (e.g.,
describing data from multiple heterogeneous sources).
Prepared by Dr. Dunia H. Hameed Page 27
Data Warehouse and Data Mining 2023-2024
This is essential to the data mining system and ideally consists of a set of
functional modules for tasks such as characterization, association and
correlation analysis, classification, prediction, cluster analysis, outlier
analysis, and evolution analysis.
4. User Interface:
This module communicates between users and the data mining system,
allowing the user to interact with the system by specifying a data mining
query or task, providing information to help focus the search, and
performing exploratory datamining based on the intermediate data
mining results. In addition, this component allows the user to browse
database and data warehouse schemas or data structures, evaluate mined
patterns, and visualize the patterns in different forms.
This step is concerned with how the data are generated and collected. In
general, there are two distinct possibilities. The first is when the data-
generation process is under the control of an expert (modeler): this
approach is known as a designed experiment. The second possibility is
Prepared by Dr. Dunia H. Hameed Page 29
Data Warehouse and Data Mining 2023-2024
when the expert cannot influence the data- generation process: this is
known as the observational approach. An observational setting, namely,
random data generation, is assumed in most data-mining applications.
Typically, the sampling distribution is completely unknown after data are
collected, or it is partially and implicitly given in the data-collection
procedure. It is very important, however, to understand how data
collection affects its theoretical distribution, since such a priori
knowledge can be very useful for modeling and, later, for the final
interpretation of results. Also, it is important to make sure that the data
used for estimating a model and the data used later for testing and
applying a model come from the same, unknown, sampling distribution.
If this is not the case, the estimated model cannot be successfully used in
a final application of the results.
In the observational setting, data are usually "collected" from the existing
databses, data warehouses, and data marts. Data preprocessing usually
includes at least two common tasks: