7931 Ecap446 Data Warehousing and Data Mining
7931 Ecap446 Data Warehousing and Data Mining
Data Mining
ECAP446
Edited by
Sartaj Singh
Data Warehousing and
Data Mining
Edited By:
Sartaj Singh
CONTENT
Objectives
After studying this unit, you will be able to:
Introduction
Data warehouses simplify and combine data in multidimensional space. The building of data
warehouses includes data cleaning, data integration, and data transformation, and can be seen as an
significant preprocessing step for data mining. Furthermore, data warehouses offer online analytical
processing (OLAP) tools for the collaborative analysis of multidimensional data of diverse
granularities, which simplifies effective data generalization and data mining. Numerous other data
mining tasks, such as association, classification, prediction, and clustering, can be combined with
OLAP operations to improve interactive mining of knowledge at several levels of abstraction.
Henceforth, the data warehouse has convert an progressively important stage for data analysis and
OLAP and will deliver an effective platform for data mining. So, data warehousing and OLAP form
an important step in the knowledge discovery process. This chapter focus on the overview of data
warehouse and OLAP technology.”
Example: A typical data warehouse is organized around major subjects, such as customer,
vendor, product, and sales rather than concentrating on the day-to-day operations and
transaction processing of an organization.
• The goal is to execute statistical queries and provide results that can influence
decision-making in favor of the Enterprise.
• These systems are thus called Online Analytical Processing Systems (OLAP).
1.2 The need for a Separate Data Warehouse
Because operational databases store huge amounts of data, you may wonder, “Why not perform
online analytical processing directly on such databases instead of spending additional time and
resources to construct a separate data warehouse?” A major reason for such a separation is to help
promote the high performance of both systems. An operational database is designed and tuned from
known tasks and workloads like indexing and hashing using primary keys, searching for particular
records, and optimizing “canned” queries. On the other hand, data warehouse queries are often
complex. They involve the computation of large data groups at summarized levels and may require
the use of special data organization, access, and implementation methods based on multidimensional
views. Processing OLAP queries in operational databases would substantially degrade the
performance of operational tasks.
Moreover, an operational database supports the concurrent processing of multiple transactions.
Concurrency control and recovery mechanisms (e.g., locking and logging) are required to ensure the
consistency and robustness of transactions. An OLAP query often needs read-only access to data
records for summarization and aggregation. Concurrency control and recovery mechanisms, if
applied for such OLAP operations, may jeopardize the execution of concurrent transactions and thus
substantially reduce the throughput of an OLTP system.
Finally, the separation of operational databases from data warehouses is based on the different
structures, contents, and uses of the data in these two systems. Decision support requires historic
data, whereas operational databases do not typically maintain historic data. In this context, the data
in operational databases, though abundant, are usually far from complete for decision making.
Decision support requires consolidation (e.g., aggregation and summarization) of data from
heterogeneous sources, resulting in high-quality, clean, integrated data. In contrast, operational
databases contain only detailed raw data, such as transactions, which need to be consolidated before
analysis. Because the two systems provide quite different functionalities and require different kinds
of data, it is presently necessary to maintain separate databases.
A recommended method for the development of data warehouse systems is to implement the
warehouse incrementally and evolutionarily, as shown in Figure 1.
First, a high-level corporate data model is defined within a reasonably short period (such as one or
two months) that provides a corporate-wide, consistent, integrated view of data among different
subjects and potential usages. This high-level model, although it will need to be refined in the further
development of enterprise data warehouses and departmental data marts, will greatly reduce future
integration problems. Second, independent data marts can be implemented in parallel with the
enterprise warehouse based on the same corporate data model set noted before. Third, distributed
data marts can be constructed to integrate different data marts via hub servers. Finally, a multitier
data warehouse is constructed where the enterprise warehouse is the sole custodian of all warehouse
data, which is then distributed to the various dependent data marts.
Adopts an entity-relationship
Adopts either a star or
(ER) data model and an
Database design snowflake model and a subject-
application-oriented database
oriented database design.
design
Focus Data in
Information out
Continue….
Number of records accessed Tens
Millions
DB size 100 MB to GB
100 GB to TB
Transaction throughput
Metric Query response time
Exactly what the difference between operational database and data warehouse? Explain
with the suitable of suitable example.
is appropriate for on-line transaction processing. The data warehouse requires a concise, subject-
oriented schema that facilitates OLAP. Data warehouses and OLAP tools are based on a
multidimensional data model. This model views data in the form of a data cube. A data cube allows
data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts. In
general terms, dimensions are the perspectives or entities concerning which an organization wants
to keep records. Each dimension may have a table associated with it, called a dimension table, which
further describes the dimension. For example, a dimension table for an item may contain the
attributes item name, brand, and type. Dimension tables can be specified by users or experts, or
automatically generated and adjusted based on data distributions.
A multidimensional data model is typically organized around a central theme, such as sales. This
theme is represented by a fact table. Facts are numeric measures. Think of them as the quantities by
which we want to analyze relationships between dimensions. Examples of facts for a sales data
warehouse include dollars sold (sales amount in dollars), units sold (number of units sold), and the
amount budgeted. The fact table contains the names of the facts, or measures, as well as keys to each
of the related dimension tables. You will soon get a clearer picture of how this works when we look
at multidimensional schemas.
Table 2 2-D View of Sales Data for AllElectronics According to time and item
Note: The sales are from branches located in the city of Vancouver. The measure displayed is
dollars sold (in thousands).
Table 3 3-D View of Sales Data for AllElectronics According to time, item, and location
Table 2 and Table 3 show the data at different degrees of summarization. In the data warehousing
research literature, a data cube like those shown in Figure 2 and Figure 3 is often referred to as a
cuboid. Given a set of dimensions, we can generate a cuboid for each of the possible subsets of the
given dimensions. The result would form a lattice of cuboids, each showing the data at a different
level of summarization, or group-by. The lattice of cuboids is then referred to as a data cube. Figure
4 shows a lattice of cuboids forming a data cube for the dimensions time, item, location, and supplier.
Figure 2 A 3-D data cube representation of the data in Table 3 according to time, item, and location.
Figure 3 A 4-D data cube representation of sales data, according to time, item, location, and supplier.
The cuboid that holds the lowest level of summarization is called the base cuboid. For example, the
4-D cuboid in Figure 3 is the base cuboid for the given time, item, location, and supplier dimensions.
Figure 2 is a 3-D (non-base) cuboid for time, item, and location summarized for all suppliers. The 0-
D cuboid, which holds the highest level of summarization, is called the apex cuboid. In our example,
this is the total sales, or dollars sold, summarized over all four dimensions. The apex cuboid is
typically denoted by all.”
Figure 4 Lattice of cuboids, making up a 4-D data cube for time, item, location, and supplier.
Example: “Date” can be grouped into “day”, “month”, “quarter”, “year” or “week”,
which forms a lattice structure.
Example: Let, an organization sells products throughout the world. The main four major
dimensions are time, location, time, and branch.
Snowflake schema: The snowflake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into additional tables. The
resulting schema graph forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the dimension tables of
the snowflake model may be kept in the normalized form to reduce redundancies. Such a table is
easy to maintain and saves storage space. However, this space savings is negligible in comparison to
the typical magnitude of the fact table. Furthermore, the snowflake structure can reduce the
effectiveness of browsing, since more joins will be needed to execute a query. Consequently, the
system performance may be adversely impacted. Hence, although the snowflake schema reduces
redundancy, it is not as popular as the star schema in data warehouse design.
A snowflake schema for AllElectronics sales is given in Figure 6. Here, the sales fact table is identical
to that of the star schema in Figure 5. The main difference between the two schemas is in the definition
of dimension tables. The single dimension table for an item in the star schema is normalized in the
snowflake schema, resulting in new item and supplier tables. For example, the item dimension table
now contains the attributes item key, item name, brand, type, and supplier key, where supplier key is
linked to the supplier dimension table, containing supplier key and supplier type information.
Similarly, the single dimension table for location in the star schema can be normalized into two new
tables: location and city. The city key in the new location table links to the city dimension. Notice
that, when desirable, further normalization can be performed on province or state and country in the
snowflake schema shown in Figure 6.
Fact constellation: Sophisticated applications may require multiple fact tables to share dimension
tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema
or a fact constellation.
A fact constellation schema is shown in Figure 7. This schema specifies two fact tables, sales, and
shipping. The sales table definition is identical to that of the star schema (Figure 5). The shipping
table has five dimensions, or keys—item key, time key, shipper key, from location, and to location—
and two measures—dollars cost and units shipped. A fact constellation schema allows dimension tables
to be shared between fact tables. For example, the dimensions tables for time, item, and location are
shared between the sales and shipping fact tables.
Vancouver, Toronto, New York, and Chicago. Each city, however, can be mapped to the province or state to
which it belongs. For example, Vancouver can be mapped to British Columbia, and Chicago to Illinois. The
provinces and states can in turn be mapped to the country (e.g., Canada or the United States) to which they
belong. These mappings form a concept hierarchy for the dimension location, mapping a set of low-level
concepts (i.e., cities) to higher-level, more general concepts (i.e., countries). This concept hierarchy is
illustrated in
Figure 8.
1. Roll-up (drill-up):- The roll-up operation performs aggregation on a data cube either by climbing
up the hierarchy or by dimension reduction.
Delhi, New York, Patiala, and Los Angeles wins 5, 2, 3, and 5 medals respectively. So in this example,
we roll upon Location from cities to countries.
2. Drill-down:- Drill-down is the reverse of roll-up. That means lower-level summary to higher-level
summary.
Drill-down can be performed either by:-
Location Medal
Delhi 5
Los Angles 5
The dice operation defines a sub-cube by performing a selection on two or more dimensions. For
example, if we want to make a selection where Medal = 3 or Location = New York.
4. Pivot:- Pivot is also known as rotate. It rotates the data axis to view the data from different
perspectives.
• The top-down view allows the selection of the relevant information necessary for the data
warehouse. This information matches current and future business needs.
• The data source view exposes the information being captured, stored, and managed by
operational systems. This information may be documented at various levels of detail and
accuracy, from individual data source tables to integrated data source tables. Data sources
are often modeled by traditional data modeling techniques, such as the entity-relationship
model or CASE (computer-aided software engineering) tools.
• The data warehouse view includes fact tables and dimension tables. It represents the
information that is stored inside the data warehouse, including pre-calculated totals and
counts, as well as information regarding the source, date, and time of origin, added to
provide historical context.
• Finally, the business query view is the data perspective in the data warehouse from the
end-user’s viewpoint.
Building and using a data warehouse is a complex task because it requires business skills, technology
skills, and program management skills. Regarding business skills, building a data warehouse involves
understanding how systems store and manage their data, how to build extractors that transfer data
from the operational system to the data warehouse, and how to build warehouse refresh software
that keeps the data warehouse reasonably up-to-date with the operating system’s data. Using a data
warehouse involves understanding the significance of the data it contains, as well as understanding
and translating the business requirements into queries that can be satisfied by the data warehouse.
Regarding technology skills, data analysts are required to understand how to make assessments from
quantitative information and derive facts based on conclusions from historic information in the data
warehouse. These skills include the ability to discover patterns and trends, extrapolate trends based
on history and look for anomalies or paradigm shifts, and to present coherent managerial
recommendations based on such analysis. Finally, program management skills involve the need to
interface with many technologies, vendors, and end-users to deliver results in a timely and cost-
effective manner.
Because data warehouse construction is a difficult and long-term task, its implementation scope
should be clearly defined. The goals of an initial data warehouse implementation should be
specific, achievable, and measurable. This involves determining the time and budget allocations, the
subset of the organization that is to be modeled, the number of data sources selected, and the
number and types of departments to be served.
Once a data warehouse is designed and constructed, the initial deployment of the warehouse
includes initial installation, roll-out planning, training, and orientation. Platform upgrades and
warehouses, which include access, integration, consolidation, and transforma tion of multiple
heterogeneous databases, ODBC/OLEDB connections, Web access, and service facilities, and
reporting and OLAP analysis tools. It is prudent to make the best use of the available infrastructures
rather than constructing everything from scratch.
OLAP-based exploration of multidimensional data: Effective data mining needs exploratory data
analysis. A user will often want to traverse through a database, select portions of relevant data, and
analyze them at different granularities, and present knowledge/results in different forms.
Multidimensional data mining provides facilities for mining on different subsets of data and at
varying levels of abstraction—by drilling, pivoting, filtering, dicing, and slicing on a data cube
and/or intermediate data mining results. This, together with data/knowledge visualization tools,
greatly enhances the power and flexibility of data mining.
Online selection of data mining functions: Users may not always know the specific kinds of
knowledge they want to mine. By integrating OLAP with various data mining functions,
multidimensional data mining provides users with the flexibility to select desired data mining
functions and swap data mining tasks dynamically.
The base cuboid contains all three dimensions, city, item, and year. It can return the total sales for
any combination of the three dimensions. The apex cuboid, or 0-D cuboid, refers to the case where
the group-by is empty. It contains the total sum of all sales. The base cuboid is the least generalized
(most specific) of the cuboids. The apex cuboid is the most generalized (least specific) of the cuboids
and is often denoted as all. If we start at the apex cuboid and explore downward in the lattice, this is
equivalent to drilling down within the data cube. If we start at the base cuboid and explore upward,
this is akin to rolling up.
Online analytical processing may need to access different cuboids for different queries. Therefore, it
may seem like a good idea to compute in advance all or at least some of the cuboids in a data cube.
A major challenge related to this pre-computation, however, is that the required storage space may
explode if all the cuboids in a data cube are pre-computed, especially when the cube has many
dimensions. The storage requirements are even more excessive when many of the dimensions have
associated concept hierarchies, each with multiple levels. This problem is referred to as the curse of
dimensionality.
“How many cuboids are there in an n-dimensional data cube?” If there were no hierarchies associated
with each dimension, then the total number of cuboids for an n-dimensional data cube, as we have
seen, is 2n. However, in practice, many dimensions do have hierarchies. For example, time is usually
explored not at only one conceptual level (e.g., year), but rather at multiple conceptual levels such as
in the hierarchy “day < month < quarter < year.” For an n-dimensional data cube, the total number
of cuboids can be generated.
where Li is the number of levels associated with dimension i. One is added to Li to include the virtual
top level, all. This formula is based on the fact that, at most, one abstraction level in each dimension
will appear in a cuboid. For example, the time dimension as specified before has four conceptual
levels, or five if we include the virtual level all. If the cube has 10 dimensions and each dimension
has five levels (including all), the total number of cuboids that can be generated is The
size of each cuboid also depends on the cardinality (i.e., number of distinct values) of each dimension.
“
3. Partial materialization: Selectively compute a proper subset of the whole set of possible cuboids.
Alternatively, we may compute a subset of the cube, which contains only those cells that satisfy some
user-specified criterion, such as where the tuple count of each cell is above some threshold. We will
use the term subcube to refer to the latter case, where only some of the cells may be precomputed for
various cuboids. Partial materialization represents an interesting trade-off between storage space and
response time.
The partial materialization of cuboids or subcubes should consider three factors: (1) identify th e
subset of cuboids or subcubes to materialize; (2) exploit the materialized cuboids or subcubes during
query processing; and (3) efficiently update the materialized cuboids or subcubes during load and
refresh. The selection of the subset of cuboids or subcubes to materialize should take into account the
queries in the workload, their frequencies, and their accessing costs. Alternatively, we can compute
an iceberg cube, which is a data cube that stores only those cube cells with an aggregate value (e.g.,
count) that is above some minimum support threshold. Another common strategy is to materialize a
shell cube. This involves precomputing the cuboids for only a small number of dimensions (e.g.,
three to five) of a data cube.
Bitmap indexing is advantageous compared to hash and Tree indices. It is especially useful for low-
cardinality domains because comparison, join, and aggregation operations are then reduced to bit
arithmetic, which substantially reduces the processing time. Bitmap indexing leads to significant
reductions in space and input/output (I/O) since a string of characters can be represented by a single
bit. For higher-cardinality domains, the method can be adapted using compression techniques.
The join indexing method gained popularity from its use in relational database query processing.
Traditional indexing maps the value in a given column to a list of rows having that value. In contrast,
join indexing registers the joinable rows of two relations from a relational database. For example, if
two relations R.RID, A/ and S.B, SID/ join on the attributes A and B, then the join index record
contains the pair.RID, SID/, where RID and SID are record identifiers from the R and S relations,
respectively. Hence, the join index records can identify joinable tuples without performing costly join
operations. Join indexing is especially useful for maintaining the relationship between a foreign key
and its matching primary keys, from the joinable relation.
Figure 12 Linkages between a sales fact table and location and item dimension tables.
We defined a star schema for AllElectronics of the form “sales star [time, item, branch, location]:
dollars_sold = sum (sales_in_dollars).” An example of a join index relationship between the sales fact
table and the location and item dimension tables is shown in Figure 12. For example, the “Main
Street” value in the location dimension table joins with tuples T57, T238, and T884 of the sales fact
table. Similarly, the “Sony-TV” value in the item dimension table joins with tuples T57 and T459 of
the sales fact table. The corresponding join index tables are shown in Figure 13.
Figure 13 Join index tables based on the linkages between the sales fact table and the location and item.
“Which of these four cuboids should be selected to process the query?” Finer-granularity data cannot be
generated from coarser-granularity data. Therefore, cuboid 2 cannot be used because the country is
a more general concept than province or state. Cuboids 1, 3, and 4 can be used to process the query
because (1) they have the same set or a superset of the dimensions in the query, (2) the selection clause
in the query can imply the selection in the cuboid, and (3) the abstraction levels for the item and
location dimensions in these cuboids are at a finer level than brand and province or state,
respectively.
“How would the costs of each cuboid compare if used to process the query?” Using cuboid 1 would
likely cost the most because both item name and city are at a lower level than the brand and province
or state concepts specified in the query. If there are not many year values associated with items in the
cube, but there are several item names for each brand, then cuboid 3 will be smaller than cuboid 4,
and thus cuboid 3 should be chosen to process the query. However, if efficient indices are available
for cuboid 4, then cuboid 4 may be a better choice. Therefore, some cost-based estimation is required
to decide which set of cuboids should be selected for query processing.
Summary
• A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile data
collection organized in support of management decision-making.
• A data warehouse contains back-end tools and utilities for populating and refreshing the
warehouse. These cover data extraction, data cleaning, data transformation, loading,
refreshing, and warehouse management.
• A multidimensional data model is typically used for the design of corporate data
warehouses and departmental data marts.
• A data cube consists of a lattice of cuboids, each corresponding to a different degree of
summarization of the given multidimensional data.
• Concept hierarchies organize the values of attributes or dimensions into gradual abstraction
levels. They are useful in mining at multiple abstraction levels.
• Full materialization refers to the computation of all of the cuboids in the lattice defining a
data cube.
• OLAP query processing can be made more efficient with the use of indexing techniques.
Keywords
Data Sources: Data sources refer to any electronic repository of information that contains data of
interest for management use or analytics.
Data Warehouse: It is a relational database that is designed for query and analysis rather than for
transaction processing.
Data Mart: Data marts contain a subset of organization-wide data that is valuable to specific groups
of people in an organization.
Dimensions: Dimensions contain a set of unique values that identify and categories data.
Hierarchy: A hierarchy is a way to organize data at different levels of aggregation.
Star Schema: A star schema is a convention for organizing the data into dimension tables, fact tables,
and materialized views.
Self Assessment
1) OLTP stands for
(a) On-Line Transactional Processing
(b) On Link Transactional Processing
(c) On-Line Transnational Process
(d) On-Line Transactional Program
2) Data warehouse is
Review Questions
1. Describe materialized views with the help of a suitable example.
2. What are the differences between the three main types of data warehouse usage:
information processing, analytical processing, and data mining? Discuss the motivation
behind OLAP mining (OLAM).
3. Describe OLAP operations in the multi-dimensional data model.
4. “Concept hierarchies that are common to many applications may be predefined in the data
mining system”. Explain.
5. “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection
of data in support of management’s decision-making process.” Discuss.
6. Differences between operational database systems and data warehouses.
Answers: Self-Assessment
1. a 2. c
3. c 4. relational database
5. dependent data mart
Further Readings
Jiawei Han, Micheline Kamber, Data Mining – Concepts and Techniques, Morgan
Kaufmann Publishers, First Edition, 2003.
Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, Panos Vassiliadis, Fundamentals of
Data Warehouses, Publisher: Springer.
The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition.
Data Warehousing Fundamentals for IT Professionals.
Sam Anohory, Dennis Murray, Data Warehousing in the Real World, Addison Wesley, First
Edition, 2000.
https://www.javatpoint.com/data-mining-cluster-vs-data-warehousing.
https://www.classcentral.com/subject/data-warehousing
https://www.tutorialspoint.com/dwh/dwh_data_warehousing.htm
https://www.oracle.com/in/database/what-is-a-data-warehouse
Notes
Objectives
After this unit you will be able to:
Introduction
Data mining states to the withdrawal of hidden analytical information from huge databases. Data
mining techniques can produce the benefits of computerization on prevailing software and
hardware stages. Tools of Data mining can respond to business queries that traditional methods
were too time-consuming to determine.
Data
Data are the raw facts, figures, numbers, or text that can be processed by a computer. Today,
organizations are gathering massive and growing amounts of data in different formats and
different databases.
The operational or transactional data contains the day-to-day operation data (such
inventory data, on-line shopping data), non-operational data, and metadata i.e. data about
data.
Information
The arrangements, relations, or associations among all types of data can deliver information.
Which products are selling when are based upon the analysis of sales transactions by
considering a retail idea.
Knowledge
Information can be converted into knowledge.
Data together in large data repositories develop “data tombs”. Data tombs are converted into
“golden nuggets” of knowledge with the use of data mining tools see in Figure 1. Golden nuggets
mean “small but valuable facts”. Data mining is also called as mining of knowledge from data,
extraction of knowledge, data/arrangement analysis, data -archaeology, and data-dredging.
Notes
Identify Data
Prepare Data
Present data
Business Goal
3. Data selection: During the selection process, data relevant to the examination are fetched from
the data sets.
4. Data transformation: Data is converted or combined into forms suitable for mining by carrying
out summarization or aggregate operations.
5. Data mining: This is a critical cycle where category-wise strategies are useful to extract
information strategies.
6. Pattern evaluation: This step is used to classify the essentially interesting patterns demonstrating
knowledge based on some remarkable measures.
7. Knowledge presentation: Knowledge representation and visualization methods are used to
present the mined knowledge to the user.
Flat Files
Transactional
Spatial Databases
Databases
Multimedia
Databases
Relational Databases : Data mining algorithms using traditional databases can be more flexible
than other data mining procedures precisely on paper for flat files. Data mining can benefit from
SQL for data selection. Operational databases are quite possibly the most ordinarily accessible and
most extravagant data archives. Application: Knowledge Extraction, Relational Online Analytical
Processing model, etc.
Data Warehouse : A data warehouse center is characterized as the assortment of information
incorporated from numerous sources that will inquiries and dynamic and useful for decision
making. There are three types of data warehouses: Enterprise data warehouse, Data
Mart and Virtual Warehouse. Two approaches can be used to update data in Data
Warehouse: Query-driven Approach and Update-driven Approach. A data warehouse is generally
demonstrated by a structure that uses multidimensional data that is called a data cube. Each
attribute or collection of attributes is represented using dimension in the schema, and each fact
table stores the aggregated measures like avg_ sales along with the key values of all the dimension
tables. Applications of a data warehouse are commercial decision making, data extraction, etc.
Multimedia databases : This database consists of the data of audio, video, and other related files. Notes
Such multi-media information we are not able to store in traditional databases so object-oriented
databases are used to store such type of information. Multimedia is characterized by high
dimensionality, which makes data mining even more challenging. It may require computer vision,
computer visuals, image elucidation, and natural language processing. Various applications of
multi-media databases are digital libraries, video-on-demand, news-on-demand, musical databases,
etc.
Spatial Databases : A spatial database is a database that is optimized for storing and querying
data that represents objects defined in a geometric space. Most spatial databases allow the
representation of simple geometric objects such as points, lines, and polygons. It stores data in the
form of coordinates, topology, lines, polygons, etc. Applications of spatial databases are Maps,
Global positioning, etc.
Time series data : A temporal database is a database that has certain features that support time-
sensitive status for entries. Where some databases are considered current databases and only
support factual data considered valid at the time of use, a temporal database can establish at what
times certain entries are accurate. Data mining in time series databases encompasses the study of
developments and associations between valuations of different variables as well as a prediction of
trends. Various submissions of time series data are stock market data and logging activities.
World Wide Web: WWW states to World wide web is an assembly of different type of
documents and resources which includeaudial, visuals, textual, etc which are recognized by
Uniform Resource Locators (URLs) with the help of web browsers, connected by HTML pages, and
reachable via the network. It is the most mixed and active data repository. Data in the WWW is
prepared in interrelated documents. These documents can be audio, video, text, etc. Online
shopping, Job search, Research, studying, etc are the various applications of WWW.
Predictive: It helps developers to offer unlabeled explanations of attributes. Based on earlier tests,
the software assesses the absent features.
Judging from the findings of a patient’s medical examinations that is he suffering from any
particular disease.
Class/Concept Descriptions:
Classes or definitions can be correlated with results. In simplifie d, descriptive, and yet accurate
ways, it can be helpful to define individual groups and concepts. These class or concept definitions
are referred to as class/concept descriptions.
Data Characterization:
This refers to the summary of general characteristics or features of the class that is under the study.
For example. To study the characteristics of a software product whose sales increased by 15% two
years ago, anyone can collect these types of data related to such products by running SQL queries.
Data Discrimination:
It compares common features of class which is under study. The output of this process can be
represented in many forms. Eg., bar charts, curves, and pie charts.
Patterns that occur frequently in different transactions are termed as frequent patterns. There are
several kinds of frequent patterns, including item-sets, and subsequences. A frequent item-set
usually refers to a set of items that appear together frequently in a transactional data set, such as
pencil and eraser, which are frequently bought together in stationery stores by many customers.
The mining of frequent patterns leads to the detection of motivating relations and associations
within the data.
Association Analysis
Assuming, as a marketing manager, you would like to define which items are frequently purchased
collectively within the alike transactions.
IF - THEN Rules
The IF part of the rule where we specify the condition is entitled to rule antecedent or precondition.
The THEN fragment of the rule is called rule consequent. The antecedent part of the condition
comprises of at least one trait test and these tests are consistently AND ed. The subsequent part
comprises of the class forecast.
Decision Tree
A decision tree is a structure that represents the data in a hierarchal form that comprises a root
node, branches, and leaf nodes. Every non-leaf node signifies a test condition on an attribute, each
branch signifies the different outcomes of a test, and each leaf node represents a class label which
the classifier is used for prediction. The top node in the tree is the root node. The following
decision tree is for the idea buy_ computer that demonstrates whether a client at an organization is
probably going to purchase a PC or not. Each interior hub addresses a test on a characteristic. Each
leaf hub addresses a class.
Notes
Cluster Analysis
Unlike classification and prediction, which examine class-labeled data objects, clustering evaluates
data objects deprived of referring to a known class label. In general, the class labels do not exist in
the training data merely because they are not recognized, to begin with. Clustering can be used to
create such labels. The items are clustered or gathered based on the principle of “maximizing the
intra-class similarity and minimizing the interclass similarity”
Outlier Analysis
Outlier Analysis is a process that involves identifying the anomalous observation in the dataset. Let
us first understand what outliers are. Outliers are nothing but an extreme value that deviates from
the other observations in the dataset. Data objects that do not match with the general behavior or
model of the data. Most analyses discard outliers as noise or exceptions. Outliers may be detected
using statistical tests or using distance measures where objects that are a substantial distance from
any other cluster are considered outliers. The examination of outlier data is referred to as outlier
mining.
• Descriptive
• Classification and Prediction
Classification
Predictive Regression
Time series analysis
Prediction
Data Mining
Association rules
Descriptive Sequence discovery
Clustering
Summarization
Classification
During the learning phase, it constructs a model that classifies the data based on the training set
and the values known as class labels and uses these labels for classifying new data.
Data classification is a two-step process:
Learning step (where a classification model is constructed)
Classification step (where the model is used to predict class labels for given data).
In the learning step (or training phase), a classification algorithm builds the classifier by analyzing
or “learning from” a training set .A tuple, X, is represented by an N-dimensional attribute vector,
X ={x1, x2,……..xN}
Each tuple, X, is assumed to belong to a predefined class as determined by another database
attribute called the class label attribute. The individual tuples making up the training set are
referred to as training tuples and are randomly sampled from the database under analysis.
pattern recognition
Clustering
Clustering is the unsupervised learning process of building a group the different objects based
upon their similarity index. While clustering one thing we need to remember that the cluster
quality can be measured as well if and only if the intra-cluster similarity is high and inter-cluster
similarity is low. During cluster analysis, we initially partition the data-set into groups based on the
data similarity index and then assign the labels to the groups. It is the task of segmenting a
dissimilar group into several similar sub-groups or clusters. Similar data items are grouped in one
cluster while dissimilar in another cluster.
Bank customer
Stock exchange.
Summarization
Data Summarization is a simple term for a short conclusion of a big theory or a paragraph. This is
something where you write the code and, in the end, you declare the final result in the form
of summarizing data. Data summarization has great importance in data mining. Abstraction or
generalization of data resulting in a smaller set that gives a general overview of data. Alternatively,
summary-type information can be derived from data.
Association Notes
Association is a data mining function that discovers the probability of the co-occurrence of items in
a collection. The relationships between co-occurring items are expressed
as association rules. Association rules are often used to analyze sales transactions. Association is a
further prevalent data mining task. Association is also termed as market basket analysis. The goal
of the association task is as follows:
• Finding of frequent item-sets
• Finding of association rules.
A frequent item set may look like
{Product = “coke”, Product = “Fries”, Product = “Squash”}.
Regression
Regression is a data mining function that predicts a number. For example, a regression model could
be used to predict the value of a house based on location, number of rooms, lot size, and other
factors. A regression task begins with a data set in which the target values are known. The task of
regression is related to classification. The key difference is that a probable feature is a continuous
number. Regression methods have been extensively studied for centuries on the ground of
statistics. Linear regression and logistic regression are the most prevalent regression methods.
• Ubiquitous Data Mining: This technique includes the withdrawal of data from
movable devices to get data about different entities. Regardless of having several
challenges in this type such as complexity, confidentiality, cost, etc. this method has a lot
of prospects to be vast in various trades especially in learning human-computer
interactions.
• Spatial and Geographic Data Mining: Type of data mining which includes extracting
information from environmental, astronomical, and geographical data which also includes
images taken from outer space. This type of data mining can reveal various aspects such as
distance and topology which is mainly used in geographic information systems and other
navigation applications.
• Time Series and Sequence Data Mining: The primary application of this type of data
mining is the study of cyclical and seasonal trends. This method is mainly being used by
retail companies to access customer's buying patterns and their behaviors.
Numerical Evolution
Statistical, data and of 4G PL
machine structured and
Past Business
learning data stored in various
techniques a traditional related
database techniques
High-speed
Heterogeneous
Statistical, networks,
data formats
machine high-end
include Business,
learning, storage
structured, web,
artificial devices,
Current semi- medical
intelligence, and
structured, diagnosis,
pattern parallel
and etc.
recognition distributed
unstructured
techniques computing,
data
etc.
Notes
• Integration of background knowledge − To monitor the discovery process and to prompt
the discovered patterns, contextual knowledge can be used. Background knowledge may
be used to rapidly the discovered patterns not only in concise terms but at multiple levels
of abstraction.
• Data mining query languages and ad hoc data mining − Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
• Presentation and visualization of data mining results − Once the patterns are discovered it
needs to be expressed in high-level languages, and visual representations. These
representations should be easily understandable.
• Handling noisy or incomplete data −Data cleaning methods are a prerequisite to handling
the noise and incomplete objects whereas mining the data uniformities. In the absence of
data cleaning methods, the accuracy of the discovered patterns will be poor.
• Pattern evaluation −Patterns discovered from the collected data should be motivating
because either they signify common knowledge or lack novelty.
Performance Issues
• There can be performance-related issues such as follows −
• Parallel, distributed, and incremental mining algorithms − The factors such as the huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which are further processed in a parallel
fashion. Then the results from the partitions are merged. The incremental algorithms,
update databases without mining the data again from scratch.
• Mining information from heterogeneous databases and global information systems − The
data is available at different data sources on LAN or WAN. These data sources may be
structured, semi-structured, or unstructured. Therefore mining the knowledge from them
adds challenges to data mining.
in the Ford Motor company case illustrated how insufficient companies are at protecting their
customers’ personal information. Companies are making profits from their customers’ personal
data, but they do not want to spend a lot amount of money to design a sophisticated security
system to protect that data. At least half of Internet users interviewed by Statistical Research, Inc.
claimed that they were very concerned about the misuse of credit card information given online,
the selling or sharing of personal information by different websites, and cookies that track
consumers’ Internet activity.
Data mining can also be used to discriminate against a certain group of people in the population.
For example, if through data mining, a certain group of people was determined to carry a high risk
for a deathly disease (eg. HIV, cancer), then the insurance company may refuse to sell an insurance
policy to them based on this information. The insurance company’s action is not only unethical but
may also have a severe impact on our health care system as well as the individuals involved. If
these high-risk people cannot buy insurance, they may die sooner than expected because they
cannot afford to go to the doctor as often as they should. Also, the government may have to step in
and provide insurance coverage for those people, thus would drive up the health care costs.
have to spend on any one particular case. Thus, allowing them to deal with more problems. Notes
Hopefully, this would make the country becomes a safer place. Also, data mining may also help
reduce terrorist acts by allowing government officers to identify and locate potential terrorists
early. Thus, preventing another incidence likes the World Trade Center tragedy from occurring on
American soil.
Data mining can also benefit society by allowing researchers to collect and analyze data more
efficiently. For example, it took researchers more than a decade to complete the Human Genome
Project. But with data mining, similar projects could be completed in a shorter amount of time. Da ta
mining may be an important tool that aids researchers in their search for new medications,
biological agents, or gene therapy that would cure deadly diseases such as cancers or AIDS.
Summary
• The information and knowledge gained can be used for applications ranging from
business management, production control, and market analysis, to engineering design and
science exploration.
• Data mining can be viewed as a result of the natural evolution of information technology.
• An evolutionary path has been witnessed in the database industry in the development of
data collection and database creation, data management, and data analysis functionalities.
• Data mining refers to extracting or “mining” knowledge from large amounts of data. Some
other terms like knowledge mining from data, knowledge extraction, data/pattern
analysis, data archaeology, and data dredging are also used for data mining.
• Knowledge discovery is a process and consists of an iterative sequence of data cleaning,
data integration, data selection data transformation, data mining, pattern evaluation, and
knowledge presentation.
Keywords
Data: Data are any facts, numbers, or text that can be processed by a computer.
Data mining : Data mining is the practice of automatically searching large stores of data to
discover patterns and trends that go beyond simple analysis.
Data cleaning : To remove noisy and inconsistent data.
Data integration: Multiple data sources may be combined.
Data selection: Data relevant to the analysis task are retrieved from the database.
Data transformation: Where data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations.
KDD: Many people treat data mining as a synonym for another popularly used term.
Knowledge presentation: Visualisation and knowledge representation techniques are used to
present the mined knowledge to the user.
Pattern evaluation: To identify the truly interesting patterns representing knowledge based on
some interestingness measures.
Self Assessment
Choose the appropriate answers:
1. KDD stands for
(a) Knowledge Design Database
(b) Knowledge Discovery Database
(c) Knowledge Discovery Design
(d) Knowledge Design Development
d) None of these
1. B 2. D 3. B 4. B 5. B
Notes
Review Questions
1. What is data mining? How does data mining differ from traditional database access?
2. Discuss, in brief, the characterization of data mining algorithms.
3. Briefly explain the various tasks in data mining.
4. Distinguish between the KDD process and data mining.
5. Define data, information, and knowledge.
6. Explain the process of knowledge discovery.
7. Elaborate on the issues of data mining.
Further Readings
www.stanford.edu/class/stats315b/Readings/DataMining.pdf
Objectives
After this lecture, you will be able to
Introduction
Data warehouse provides architectures and tools for business executives to systematically organize,
understand, and use their data to make strategic decisions. In the last several years, many firms
have spent millions of dollars in building enterprise-wide data warehouses as it is assumed a way
to keep customers by learning more about their needs.
In simple terms, a data warehouse refers to a database that is maintained separately from an
organization’s operational databases. Data warehouse systems allow for the integration of a variety
of application systems. They support information processing by providing a solid platform of
consolidated, historical data for analysis.
An ODS is an integrated database of operational data. Its sources include legacy systems, and it
contains current or near-term data. An ODS may contain 30 to 60 days of information, while a data
warehouse typically contains years of data. ODSs are used in some data warehouse architectures to
provide near-real-time reporting capability if the Data Warehouse’s loading time or architecture
prevents it from being able to provide near-real-time reporting capability. The ODS then only
provides access to the current, fine-grained, and non-aggregated data, which can be queried in an
integrated manner without burdening the transactional systems. However, more complex analyses
requiring high-volume historical and/or aggregated data are still conducted on the actual data
warehouse.
Characteristics of Operational Data Store Systems
The following are the characteristics of Operational Data Stores(ODS):
Consolidation
Applications Real-time
System integration of Reporting
ODS
Troubleshooting
• Consolidation: The ODS approach can bring together disparate data sources into a single
repository. It lacks the benefits of other repositories such as data lakes and data warehouses, but an
operational data store has the advantage of being fast and light. ODS can con solidate data from
different sources, different systems, or even different locations.
• Real-time reporting: An operational data store will generally hold very recent versions of business
data. Combined with the right BI tools, businesses can perform real-time BI tasks, such as tracking
orders, managing logistics, and monitoring customer activity.
• Troubleshooting: The current state view of ODS makes it easier to identify and diagnose issues
when they occur. For The ODS will hold both versions of the data, allowing for easy comparison
between the two systems. Automated processes can spot these problems and take action.
• System integration: Integration requires a continuous flow of data between systems, and ODS can
provide the platform for this kind of exchange. It's possible to build business rules on an ODS so
that data changes in one system triggers a corresponding action on another system.
A user might create an order on the e-commerce system, which should create a
corresponding order on the logistics system. But this might have the wrong details due to
an integration error.
Working of ODS
The extraction of data from source databases needs to be efficient, and the quality of records needs
to be maintained. Since the data is refreshed generally and frequently, suitable checks are required
to ensure the quality of data after each refresh. An ODS is a read-only database other than regular
refreshing by the OLTP systems. Customers should not be allowed to update ODS information.
Populating an ODS contains an acquisition phase of extracting, transforming, and loading
information from OLTP source systems. This procedure is ETL. Completing populating the
database, analyze for anomalies, and testing for performance is essential before an ODS system can
go online.
An ODS consist of only a short window A data warehouse includes the entire history
of data. of data.
It is used for detailed decision-making and It is used for long-term decision-making and
operational reporting. management reporting.
It serves as a conduit for data between It serves as a repository for cleansed and
operational and analytics systems. consolidated data sets.
It is updated often as the transactions system It is usually updated in batch processing mode
generates new data. on a set scheme
It serves as a conduit for data between It serves as a repository for cleansed and
operational and analytics systems. consolidated data sets.
It is updated often as the transactions system It is usually updated in batch processing mode
generates new data. on a set scheme
You know about store room and warehouse. Exactly what the difference between ODS and
data warehouse? Explain with the suitable of suitable example.
• Suppose we are loading the EPOS sales transaction,we need to perform the following
checks:
➢ Strip out all the columns that are not required within the warehouse.
Warehouse Manager
The warehouse manager is responsible for the warehouse management process. It consists of the
third-party system software, C programs, and shell scripts. The size and complexity of a warehouse
manager vary between specific solutions.
Components
A warehouse manager includes the following: −
• The controlling process
• Stored procedures or C with SQL
• Backup/Recovery tool
• SQL scripts
• It stores query profiles to allow the warehouse manager to determine which indexes and
aggregations are appropriate.
• Performs operations associated with management of user queries
• Constructed using vendor end-user data access tools, data warehouse monitoring tools,
database facilities, and custom-built programs
• Complexity determined by facilities provided by end-user access tools and database
• Operations:
➢ Directing queries to appropriate tables
➢ Scheduling execution of queries
support activities and therefore require different types of tools.When it comes time to start
creating reports out of the data in your warehouse and to start making decisions with this
data you are going to need to have a good query tool.Managed query tools shield end
users from the Complexities of SQL and database structure by inserting a meta-layer
between the user and the database.
Features for Query Tools
• Cross-Browsing of Dimension Attributes:A real dimension table, such as a list of all of
your products or customers, takes the form of a large dimension table with many, many
attributes (fields). Cross-browsing, on the other hand, refers to the capability of a query
tool to present the valid values of the product brand, subject to a constraint elsewhere on
that dimension table.
• Open Aggregate Navigation:Aggregate navigation is the ability to automatically choose
pre-stored summaries, or aggregates, in the course of processing a user's SQL requests.
Aggregate navigation must be performed silently and anonymously
• Multipass SQL:Breaking a single complex request into several small requests is called
multipass SQL. It also allows drilling across several conformed data marts in different
databases, in which the processing of a single galactic SQL statement would otherwise be
impossible.
• Semi-Additive Summations:Semi Additive measures are values that you can summarize
across any related dimension except time.Stock levels however are semi-additive; if you
had 100 in stock yesterday, and 50 in stock today, you’re total stock is 50, not 150. It
doesn’t make sense to add up the measures over time, you need to find the most recent
value.
• Show Me What Is Important:Your query tools must help you automatically sift through
the data to show you only what is important. At the low end, you simply need to show
data rows in your reports that meet certain threshold criteria.
OLAP Tools
Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It allows
managers, and analysts to get an insight into the information through fast, consistent, and
interactive access to information.
• IBM Cognos: It provides tools for reporting, analysis, monitoring of events and metrics.
• SAP NetWeaver BW: It is known as SAP NetWeaver Business Warehouse. Just like IBM
Cognos, It also delivers reporting, analysis, and interpretation of business data. It runs on
Industry-standard RDBMS and SAP’s HANA in-memory DBMS.
• Microsoft Analysis Services: Microsoft Analysis Services is used by Organisations to
make sense of data that is spread out in multiple databases or it is spread out in a discrete
form.
• MicroStrategy Intelligence Server: MicroStrategy Intelligence Server helps the business to
standardize themselves on a single open platform which in return will reduce their
maintenance and operating cost.
ROLAP: The ‘R’ in ROLAP stands for Relational. So, the full form of ROLAP becomes Relational
Online Analytical Processing. The salient feature of ROLAP is that the data is stored in relational
databases. Some of the top ROLAP is as follows:
• IBM Cognos
• SAP NetWeaver BW
• Microsoft Analysis Services
• Essbase
• Jedox OLAP Server
ROLAP engines include the commercial IBM Informix Metacube (www.ibm.com) and the
Micro-strategy DSS server (www.microstrategy.com), as well as the open-source product
Mondrian (mondrian.sourceforge.net).
HOLAP: It stands for Hybrid Online Analytical Processing. So, HOLAP bridges the shortcomings
of both MOLAP and ROLAP by combining their capabilities. Now how does it combine? It
combines data by dividing data of the database between relational and specialized storage. Some of
the top HOLAP is as follows:
Example IBM Cognos
SAP NetWeaver BW
but due to its long time horizon is typically migrated to a "less-expensive" alternate storage
medium.
Current Detail Data:This data represents data of a recent nature and always has a shorter time
horizon than older detail data. Although it can be voluminous, it is almost always stored on a disk
to permit faster access. The current detail record is central in importance as it:
• Reflects the most current happenings, which are commonly the most stimulating.
• It is numerous as it is saved at the lowest method of the Granularity.
• It is always (almost) saved on disk storage, which is fast to access but expensive and
difficult to manage.
Lightly summarized data: Lightly summarized data represents data distilled from current detail
data. It is summarized according to some unit of time and always resides on disk. This data extract
from the low level of detail found at the current, detailed level and usually is stored on disk
storage. When building the data warehouse have to remember what unit of time is the
summarization done over and also the components or what attributes the summarized data will
contain.
Highly summarized data is compact and directly available and can even be found outside the
warehouse. Highly summarized data represents data distilled from lightly summarized data. It is
always compact and easily accessible and resides on a disk.A final component of the data
warehouse is that of metadata. Metadata is best described as data about data. It provides
information about the structure of a data warehouse as well as the various algorithms used in data
summarizations. It provides a descriptive view, or "blueprint", of how data is mapped from the
operational level to the data warehouse.
WORM (write once, read many) technologies use media that is not rewritable.
Backup data
A data warehouse is a complex system and it contains a huge volume of data. Therefore it is
important to back up all the data so that it becomes available for recovery in the future as per
requirement.
Backup Terminologies
• Complete backup − It backs up the entire database at the same time. This backup includes
all the database files, control files, and journal files.
• Partial backup − As the name suggests, it does not create a complete backup of the
database. Partial backup is very useful in large databases because they allow a strategy
whereby various parts of the database are backed up in a round-robin fashion on a day-to-
day basis so that the whole database is backed up effectively once a week.
• Cold backup − Cold backup is taken while the database is completely shut down. In a
multi-instance environment, all the instances should be shut down.
• Hot backup − Hot backup is taken when the database engine is up and running. The
requirements of hot backup vary from RDBMS to RDBMS.
• Online backup − It is quite similar to hot backup.
3.7 Metadata
• Metadata is data about data that defines the data warehouse. It is used for building,
maintaining, and managing the data warehouse.In the Data Warehouse Architecture,
meta-data plays an important role as it specifies the source, usage, values, and features of
data warehouse data. It also defines how data can be changed and processed. For example,
a line in the sales database may contain:
4030 KJ732 299.90
This is meaningless data until we consult the Meta that tell us it was
• Model number: 4030
Metadata is the final element of the data warehouses and is really of various dimensions in which it
is not the same as file drawn from the operational data, but it is used as:-
• A directory to help the DSS investigator locate the items of the data warehouse.
• A guide to the mapping of record as the data is changed from the operational data to the
• A guide to the method used for summarization between the current, accurate data and the
Categories of metadata
• Technical MetaData: This kind of Metadata contains information about the warehouse
which is used by Data warehouse designers and administrators.
• Business MetaData: This kind of Metadata contains detail that gives end-users a way easy
to understand the information stored in the data warehouse.
Data warehouse metadata include table and column names, their detailed descriptions,
their connection to business meaningful names, the most recent data load date, the
business meaning of a data item and the number of users that are logged in currently
The requirement for separation plays an essential role in defining the two-tier architecture for a
data warehouse system, as shown in fig:
Source layer: A data warehouse system uses a heterogeneous source of data. That data is stored
initially to corporate relational databases or legacy databases, or it may come from an information
system outside the corporate walls.
Data Staging: The data stored to the source should be extracted, cleansed to remove inconsistencies
and fill gaps, and integrated to merge heterogeneous sources into one standard schema. The so-
named Extraction, Transformation, and Loading Tools (ETL) can combine heterogeneous
schemata, extract, transform, cleanse, validate, filter, and load source data into a data warehouse.
Data Warehouse layer: Information is saved to one logically centralized individual repository: a
data warehouse. The data warehouses can be directly accessed, but they can also be used as a
source for creating data marts, which partially replicate data warehouse contents and are designed
for specific enterprise departments. Meta-data repositories store information on sources, access
procedures, data staging, users, data mart schema, and so on.
Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports,
dynamically analyze information, and simulate hypothetical business scenarios. It should feature
aggregate information navigators, complex query optimizers, and customer-friendly GUIs.
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source systems), the
reconciled layer, and the data warehouse layer (containing both data warehouses and data marts).
The reconciled layer sits between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data model for a
whole enterprise. At the same time, it separates the problems of source data extraction and
integration from those of the data warehouse population. In some cases, the reconciled layer is also
directly used to accomplish better some operational tasks, such as producing daily reports that
4-Tier architecture
User:At the end-user layer, data in the ODS, data warehouse, and data marts can be accessed by
using a variety of tools such as query and reporting tools, data visualization tools, and analytical
applications.
Presentation layer: Its functions contain receiving data inputted, interpreting users' instructions,
and sending requests to the data services layer, and displaying the data obtained from the data
services layer to users by the way they can understand. It closest to users and provide an interactive
operation interface.
Discuss various factors play vital role to design a good data warehouse.
Business logic:It is located between the PL and data access layer playing a connecting role in the
data exchange.The layer’s concerns are focused primarily on the development of business rules,
business processes, and business needs related to the system.It’s also known as the domain layer.
Data Access:It is located in the innermost layer that implements persistence logic and responsible
for access to the database.Operations on the data contain finding, adding, deleting, modifying, etc.
This level works independently, without relying on other layers.DAL extracts the appropriate data
from the database and passes the data to the upper.
Summary
• OLAP servers may adopt a relational OLAP (ROLAP), a multidimensional OLAP
(MOLAP), or a hybrid OLAP (HOLAP) implementation.
• Data warehousing is the consolidation of data from disparate data sources into a single
target database to be utilized for analysis and reporting purposes.
• The primary goal of data warehousing is to analyze the data for business intelligence
purposes. For example, an insurance company might create a data warehouse to
capture policy data for catastrophe exposure.
• The data is sourced from front-end systems that capture the policy information into the
data warehouse.
• The data might then be analyzed to identify windstorm exposures in coastal areas prone to
hurricanes and determine whether the insurance company is overexposed.
• The goal is to utilize the existing information to make accurate business decisions.
Keywords
Data Sources: Data sources refer to any electronic repository of information that contains data of
interest for management use or analytics.
Data Warehouse Architecture: It is a description of the elements and services of the warehouse,
with details showing how the components will fit together and how the system will grow over
time.
Data Warehouse: It is a relational database that is designed for query and analysis ra ther than for
transaction processing.
Job Control: This includes job definition, job scheduling (time and event), monitoring, logging,
exception handling, error handling, and notification.
Metadata: Metadata, or “data about data”, is used not only to inform operators and users of the
data warehouse about its status and the information held within the data warehouse
Self Assesment
1) OLTP stands for
(a) On Line Transactional Processing
(b) On Link Transactional Processing
(c) On Line Transnations Processing
(d) On Line Transactional Program
3) The data from the operational environment enter ........................ of data warehouse.
A) Current detail data
B) Older detail data
C) Lightly Summarized data
D) Highly summarized data
4) .............................are designed to overcome any limitations placed on the warehouse by the nature
of the relational data model.
A) Operational database
B) Relational database
C) Multidimensional database
D) Data repository
5) Data warehouse contains ________data that is seldom found in the operational environment
Select one:
a)informational
b)normalized
c)denormalized
d)summary
6) _______ are numeric measurements or values that represent a specific business aspect or activity
Select one:
a)Dimensions
b)Schemas
c)Facts
d)Tables
a)Relational data
b)Operational data
c)Informational data
d)Meta data
13) ................. stores the data based on the already familiar relational DBMS technology.
14) Which of the following features usually applies to data in a data warehouse?
A.Data are often deleted
B.Most applications consist of transactions
C.Data are rarely deleted
D.Relatively few records are processed by applications
Review Questions
1. What is a data warehouse? How is it better than traditional information-gathering techniques?
2. Describe the data warehouse environment.
3. List and explain the different layers in the data warehouse architecture.
4. Differentiate between ROLAP and MOLAP.
5.Describe three-tier data warehouse architecture in detail.
1. A 2. C 3. A 4. C 5. D
6. C 7. D 8. B 9. D 10. Bussiness
Area
Further Readings
A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
Alex Berson, Data Warehousing Data Mining and OLAP, Tata McGraw Hill, 1997
Alex Berson, Stephen J. Smith, Data warehousing, Data Mining & OLAP, Tata McGraw Hill,
Publications, 2004.
Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel Processing,
Kluwer Academic Publishers, 1998.
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, 1993.
Jiawei Han, MichelineKamber, Data Mining – Concepts and Techniques, Morgan Kaufmann
Publishers, First Edition, 2003.
Matthias Jarke, Maurizio Lenzerini, YannisVassiliou, PanosVassiliadis, Fundamentals of
Data Warehouses, Publisher: Springer
Michael Berry and Gordon Linoff, Data Mining Techniques (For Marketing, Sales, and
Customer Support), John Wiley & Sons, 1997.
Michael J. A. Berry, Gordon S Linoff, Data Mining Techniques, Wiley Publishing Inc, Second
Edition, 2004.
www.en.wikipedia.org
www.web-source.net
www.webopedia.com
Objectives
• Understand the use of Rapid Miner.
• Know the different Data Mining products.
• Learn the steps for installing Rapid Miner.
• Understand what WEKA is.
• Learn the installation process of WEKA.
Introduction
Rapid Miner provides an environment for machine learning and data mining processes. It follows a
modular operator concept which allows the design of complex nested operator chains for a huge
number of learning problems. Allows for the data handling to be transparent to the operators.
Weka is an open-source tool designed and developed by the scientists/researchers at the University
of Waikato, New Zealand. WEKA stands for Waikato Environment for Knowledge Analysis. It is
developed by the international scientific community and distributed under the free GNU GPL
license. It provides a lot of tools for data preprocessing, classification, clustering, regression
analysis, association rule creation, feature extraction, and data visualization. It is a powerful tool
that supports the development of new algorithms in machine learning.
4.1 RapidMiner
The idea behind the Rapid Mining tool is to create one place for everything. RapidMiner is an
integrated enterprise artificial intelligence framework that offers AI solutions to positively impact
businesses. It is used as a data science software platform for data extraction, data mining, deep
learning, machine learning, and predictive analytics. It is widely used in many business and
commercial applications as well as in various other fields such as research, training, education,
rapid prototyping, and application development. All major machine learning processes such as
data preparation, model validation, results from visualization, and optimization can be carried out
by using RapidMiner.
Facilities of RapidMiner
1. Rapid Miner provides its collection of datasets but it also provides options to set up a database
in the cloud for storing large amounts of data. You can store and load the data from Hadoop,
Cloud, RDBMS, NoSQL, etc. Apart from this, you can load your CSV data very easily and start
using it as well.
RapidMiner Products
The following are the various products of RapidMiner:
• RapidMiner Studio
• RapidMiner Auto Model
• RapidMiner Auto Model
RapidMiner Studio
• With RapidMiner Studio, one can access, load, and analyze both traditional structured data
and unstructured data like text, images, and media.
• It can also extract information from these types of data and transform unstructured data into
structured.
RapidMiner Auto Model
• Auto Model is an advanced version of RapidMiner Studio that increments the process of
building and validating data models.
• Majorly three kinds of problems can be resolved with Auto Model namely prediction,
clustering, and outliers.
RapidMiner Installation
And proceed to click on the version of RapidMiner Studio that fits your system:
• Windows 32bits
• Windows 64bits
• Mac OS 10.8+
• Linux
• Select a destination folder (or leave the default). Please ensure that the folder path does not
contain + or % characters. By clicking Install, the wizard extracts and installs RapidMiner
Studio. When the installation completes, click Next and then Finish to close the wizard and
start RapidMiner Studio.
• Read the terms of the license agreement and click I Accept to continue.
The download will start automatically. When it finishes, click “RUN” so that the installation
can begin.
Agree with Terms and Select Destination
You’ll see the welcome screen for the installation. Click “Next” to move to the terms of Use, and if
you agree with them, click “I Agree”. Finally, select the destination folder for the application (You’ll
need 224.5MB of free space to install) and click “Install”.
A progress bar will show the progress of the installation, and when it finishes (takes less than 5
min) you’ll see the “Completing the RapidMiner Studio Setup Wizard”. Click the Finish button to
finish.
Click the Finish button to finish the installation process, and congratulations! You are ready to use
the application! To open it, just look for it on the desktop, or search for “RapidMiner Studio” on the
Windows Start Menu.
Once you launch RapidMiner Studio, a Welcome screen appears, prompting with two options:
Table:
Option Description
I already have an If you previously registered, enter your rapidminer.com credentials to log in
account or license key so that you can download or install your license(s).
Figure 9:
Once you validate, you'll receive a success message in your browser:
Figure10:
Return to the RapidMiner Studio application. Complete the installation by clicking I'm ready.
Figure 11
Enter your email address and password to log in with your RapidMiner.com account and then
click Login and Install.
4.3 Weka
WEKA - open-source software that provides tools for data preprocessing, implementation of
several Machine Learning algorithms, and visualization tools so that you can develop machine
learning techniques and apply them to real-world data mining problems. What WEKA offers is
summarized in the following diagram:
If you observe the beginning of the flow of the image, you will understand that there are many
stages in dealing with Big Data to make it suitable for machine learning −
First, you will start with the raw data collected from the field. This data may contain several null
values and irrelevant fields. You use the data preprocessing tools provided in WEKA to cleanse
the data. Then, you would save the preprocessed data in your local store for applying ML
algorithms. Next, depending on the kind of ML model that you are trying to develop you would
select one of the options such as Classify, Cluster, or Associate. The Attributes Selection allows the
automatic selection of features to create a reduced dataset. Note that under each category, WEKA
provides the implementation of several algorithms. You would select an algorithm of your choice,
set the desired parameters, and run it on the dataset. Then, WEKA would give you the statistical
output of the model processing. It provides you a visualization tool to inspect the data.
The various models can be applied to the same dataset. You can then compare the outputs of
different models and select the best that meets your purpose. Thus, the use of WEKA results in
quicker development of machine learning models on the whole.
Weka – Installation
To install WEKA on your machine, visit WEKA’s official website and download the installation file.
WEKA supports installation on Windows, Mac OS X, and Linux. You just need to follow the
instructions on this page to install WEKA for your OS.
The steps for installing on Mac are as follows −
The GUI Chooser application allows you to run five different types of applications as listed here −
• Explorer
• Experimenter
• KnowledgeFlow
• Workbench
• Simple CLI
Install Weka and explore all the options of Weka Home Screen
3) The License Agreement terms will open. Read it thoroughly and click on “I Agree”.
Figure17:
Figure18:
5) Select the destination folder and Click on Next.
Figure19:
Figure20:
7) After the installation is complete, the following window will appear. Click on Next.
Figure21:
Figure22:
9) WEKA Tool and Explorer window opens.
Figure23:
Summary
• RapidMiner is a free-of-charge, open-source software tool for data and text mining.
• In addition to Windows operating systems, RapidMiner also supports Macintosh, Linux, and
Unix systems.
• RapidMiner Studio is a visual data science workflow designer accelerating the prototyping &
validation of models.
• With RapidMiner Studio, you can access, load, and analyze any type of data – both traditional
structured data and unstructured data.
• Weka is a collection of machine learning algorithms for data mining tasks.
• Weka algorithms can either be applied directly to a dataset or called from your own Java code.
• Weka contains tools for data pre-processing, classification, regression, clustering, association
rules, and visualization.
Keywords
Numeric: All kinds of number values; include the date, time, integer, and real numbers.
Text: Nominal data type that allows for more granular distinction (to differentiate from
polynomial).
Delete Repository Entry: An operator to delete a repository entry within a process.
Specificity: Specificity relates to the classifier’s ability to identify negative results.
Arff Format: An Arff file contains two sections - header and data.
Self Assessment
1. A database where all of the values for a particular column are stored contiguously is called?
a. Column-oriented storage
b. In-memory database
c. Partitioning
d. Data Compression
2. Data mining can also be applied to other forms such as ................
i) Data streams
ii) Sequence data
iii) Networked data
iv) Text data
v) Spatial data
4. Data mining is used to refer to the ______ stage in knowledge discovery in the database.
a. selection.
b. retrieving.
c. discovery.
d. coding.
5. Which of the following can be considered as the correct process of Data Mining?
a. Infrastructure, Exploration, Analysis, Interpretation, Exploitation
b. Exploration, Infrastructure, Analysis, Interpretation, Exploitation
c. Exploration, Infrastructure, Interpretation, Analysis, Exploitation
d. Exploration, Infrastructure, Analysis, Exploitation, Interpretation
6. Which of the following is an essential process in which the intelligent methods are applied to
extract data patterns?
a. Warehousing
b. Data Mining
c. Text Mining
d. Data Selection
7. In data mining, how many categories of functions are included?
a. 5
b. 4
c. 2
d. 3
8. The issues like efficiency, scalability of data mining algorithms comes under_______
a. Performance issues
b. Diverse data type issues
c. Mining methodology and user interaction
d. All of the above
9. Which of the following statements about the query tools is correct?
a. Tools developed to query the database
b. Attributes of a database table that can take only numerical values
c. Both and B
d. None of the above
10. Which one of the following refers to the binary attribute?
a. This takes only two values. In general, these values will be 0 and 1, and they can be coded
as one bit
b. The natural environment of a certain species
c. Systems that can be used without knowledge of internal operations
d. All of the above
11. Which of the following is the data mining tool?
a. Borland C.
b. Weka.
c. Borland C++.
d. Visual C.
12. Which one of the following issues must be considered before investing in data mining?
a. Compatibility
b. Functionality
c. Vendor consideration
d. All of the above
Review Questions
1. What is Rapid Miner? explain the various facilities provided by Rapid Miner.
2. Explain the process of creating a user account in Rapid Miner.
3. Elaborate on various Rapid Miner products.
4. Write down the installation steps of Rapid Miner.
5. How to install Weka in a windows environment. Write all the steps required for the
installation.
Answers:
1 a 2 d
3 b 4 c
5 a 6 b
7 c 8 a
9 a 10 a
11 b 12 d
Further Readings
Hofmann, M., & Klinkenberg, R. (Eds.). (2016). RapidMiner: Data mining use cases and
business analytics applications. CRC Press.
Kotu, V., & Deshpande, B. (2014). Predictive analytics and data mining: concepts and practice
with rapidminer. Morgan Kaufmann.
Witten, I. H., Frank, E., Trigg, L. E., Hall, M. A., Holmes, G., & Cunningham, S. J. (1999).
Weka: Practical machine learning tools and techniques with Java implementations.
Garner, S. R. (1995, April). Weka: The Waikato environment for knowledge analysis.
In Proceedings of the New Zealand computer science research students conference (Vol. 1995, pp.
57-64).
https://rapidminer.com/get-started/
https://rapidminer.software.informer.com/5.3/
https://waikato.github.io/weka-wiki/downloading_weka/
https://www.tutorialspoint.com/weka/weka_installation.htm
Objectives
After this lecture, you will be able to
• Understand the basics of rapidminer and its products.
• Learn the working of the Weka tool.
• understand various features of a rapid miner and Weka Tool.
• Know the Interface of RapidMiner and Weka.
Introduction
Several open-source and proprietary-based data mining and data visualization tools exist which are
used for information extraction from large data repositories and for data analysis. Some of the data
mining tools which exist in the market are Weka, Rapid Miner, Orange, R, KNIME, ELKI, GNU
Octave, Apache Mahout, SCaViS, Natural Language Toolkit, Tableau, etc.
5.1 RapidMiner
RapidMiner is an integrated enterprise artificial intelligence framework that offers AI solutions to
positively impact businesses. RapidMiner is widely used in many business and commercial
applications as well as in various other fields such as research, training, education, rapid
prototyping, and application developmentAll major machine learning processes such as data
preparation, model validation, results in visualization, and optimization can be carried out by using
RapidMiner.
Rapidminer comes with :
• Over 125 mining algorithms
• Over 100 data cleaning and preparation functions.
• Over 30 charts for data visualization,and selection of metrics to evaluate model
performance.
RapidMiner Products
There are many products of RapidMiner that are used to perform multiple operations. Some of the
products are:
• RapidMiner Studio:With RapidMiner Studio, one can access, load, and analyze both
traditional structured data and unstructured data like text, images, and media. It can also
extract information from these types of data and transform unstructured data into
structured.
• RapidMiner Auto Model:Auto Model is an advanced version of RapidMiner Studio that
increments the process of building and validating data models. You can customize the
processes and can put them in production based on your needs. Majorly three kinds of
problems can be resolved with Auto Model namely prediction, clustering, and outliers.
• RapidMiner Turbo Prep:Data preparation is time-consuming and RapidMiner Turbo Prep
is designed to make the preparation of data much easier. It provides a user interface where
your data is always visible front and center, where you can make changes step-by-step
and instantly see the results, with a wide range of supporting functions to prepare the data
for model-building or presentation.
TOOL CHARACTERISTICS
• Usability: Easy to use
• Tool orientation:The tool is designed for general-purpose analysis
• Data mining type:This tool is made for Structured data mining, Text mining, Image mining,
Audio mining, Video mining, Data gathering, Social network analysis.
• Manipulation type:This tool is designed for Data extraction, Data transformation, Data
analysis, Data visualization, Data conversion, Data cleaning
All data that you load will be contained in an example set. Each example is described by Attributes
(a.k.a. features) and all attributes have Value Types and their specific Roles. The Value types define
how data is treated
• Numeric data has an order (2 is closer to 1 than to 5)
• Nominal data has no order (red is as different from green as from blue)
Table 1: Different Value Type
Role Description
The Repository
This is where you store your data and processes.Only if you load data from the repository,
RapidMiner can show you which attributes exist. Add data via the “Add Data” button or the
“Store” operator. You can load data via drag ‘n’ drop or the “Retrieve” operator. If you have a
question starting with “Why does RapidMiner not show me …?” Then the answer most likely is
“Because you did not load your data into the Repository!”
Figure2:Reository
A repository is simply a folder that holds all of your RapidMiner data sets (we call them
"ExampleSets), processes, and other file objects that you will create using RapidMiner Studio. This
folder can be stored locally on your computer, or on a RapidMiner Server.
Example: Themost simple process which can be created with RapidMiner: It just retrieves data
from a repository and delivers it as a result to the user to allow for inspection.
Benefits of RapidMiner
The main benefits of RapidMiner are its robust features, user-friendly interface, and maximization
of data usage. Learn more of its benefits below:
Not only does the system allows the usage of any data but it also allows them to create models and
plans out of them, which can then be used as a basis for decision making and formulation of
strategies. RapidMiner has data exploration features, such as descriptive statistics and graphs and
visualization, which allows users to get valuable insights out of the information they
gained.RapidMiner is also powerful enough to provide analytics that is based on real-life data
transformation settings. This means that users can manipulate their data any way they want since
they have control of its formatting and the system. Because of this, they can create the optimal data
set when performing predictive analytics.
Finding an operator
Once you get familiar with operator names, you can find them more easily using the filter at the top
of the operator window.
5.5 Weka
Weka is data mining software that uses a collection of machine learning algorithms. These
algorithms can be applied directly to the data or called from the Java code.The algorithms can
either be applied directly to a dataset or called from your own Java code. Weka contains tools for
data pre-processing, classification, regression, clustering, association rules, and visualization. It is
also well-suited for developing new machine learning schemes.
WEKA an open-source software that provides tools for data preprocessing, implementation of
several Machine Learning algorithms, and visualization tools so that you can develop machine
learning techniques and apply them to real-world data mining problems. What WEKA offers is
summarized in the following diagram −
If you observe the beginning of the flow of the image, you will understand that there are many
stages in dealing with Big Data to make it suitable for machine learning −
First, you will start with the raw data collected from the field. This data may contain several null
values and irrelevant fields. You use the data preprocessing tools provided in WEKA to cleanse
the data.Then, you would save the preprocessed data in your local store for applying ML
algorithms.
Next, depending on the kind of ML model that you are trying to develop you would select one of
the options such as Classify, Cluster, or Associate. The Attributes Selection allows the automatic
selection of features to create a reduced dataset.Note that under each category, WEKA provides
the implementation of several algorithms. You would select an algorithm of your choice, set the
desired parameters, and run it on the dataset.
Then, WEKA would give you the statistical output of the model processing. It provides you a
visualization tool to inspect the data.The various models can be applied to the same dataset. You
can then compare the outputs of different models and select the best that meets your
purpose.Thus, the use of WEKA results in quicker development of machine learning models on
the whole.
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can
either be applied directly to a dataset or called from your own Java code. Weka contains tools for
data pre-processing, classification, regression, clustering, association rules, and visualization. The
WEKA GUI Chooser application will start and you would see the following screen −
The GUI Chooser application allows you to run five different types of applications as listed here −
• Explorer
• Experimenter
• KnowledgeFlow
• Workbench
• Simple CLI
When you click on the Explorer button in the Applications selector, it opens the following
screen –
Preprocess Tab
Initially, as you open the explorer, only the Preprocess tab is enabled. The first step in machine
learning is to preprocess the data. Thus, in the Preprocess option, you will select the data file,
process it, and make it fit for applying the various machine learning algorithms.
Classify Tab
The Classify tab provides you several machine learning algorithms for the classification of your
data. To list a few, you may apply algorithms such as Linear Regression, Logistic Regression,
Support Vector Machines, Decision Trees, RandomTree, RandomForest, NaiveBayes, and so on.
The list is very exhaustive and provides both supervised and unsupervised machine learning
algorithms.
Cluster Tab
Under the Cluster tab, there are several clustering algorithms provided - such as SimpleKMeans,
FilteredClusterer, HierarchicalClusterer, and so on.
Associate Tab
Under the Associate tab, you would find Apriori, FilteredAssociator, and FPGrowth.
Example:They can be used, for example, to store an identifier with each instance in a dataset.
Visualize Tab
The Visualize option allows you to visualize your processed data for analysis. WEKA provides
several ready-to-use algorithms for testing and building your machine learning applications. To use
WEKA effectively, you must have a sound knowledge of these algorithms, how they work, which
one to choose under what circumstances, what to look for in their processed output, and so on. In
short, you must have a solid foundation in machine learning to use WEKA effectively in building
your apps.
we start with the first tab that you use to preprocess the data. This is common to all algorithms that
you would apply to your data for building the model and is a common step for all subsequent
operations in WEKA.
For a machine-learning algorithm to give acceptable accuracy, you must cleanse your data first.
This is because the raw data collected from the field may contain null values, irrelevant columns,
and so on.
First, you will learn to load the data file into the WEKA Explorer. The data can be loaded from the
following sources −
Analyze your data with WEKA Explorer using various learning schemes and interpret
received results.
Now, navigate to the folder where your data files are stored. WEKA installation comes up with
many sample databases for you to experiment with. These are available in the data folder of the
WEKA installation.
For learning purposes, select any data file from this folder. The contents of the file would be loaded
in the WEKA environment. We will very soon learn how to inspect and process this loaded data.
Before that, let us look at how to load the data file from the Web.
Developed by the University of Waikato, New Zealand, Weka stands for Waikato
Environment for Knowledge Analysis.
We will open the file from a public URL Type the following URL in the popup box −
https://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/weather.nominal.arff
You may specify any other URL where your data is stored. The Explorer will load the data from the
remote site into its environment.
Set the connection string to your database, set up the query for data selection, process the query,
and load the selected records in WEKA.WEKA supports a large number of file formats for the
data. The types of files that it supports are listed in the drop-down list box at the bottom of the
screen. This is shown in the screenshot given below.
As you would notice it supports several formats including CSV and JSON. The default file type is
Arff.
Arff Format
An Arff file contains two sections - header and data.
As an example for Arff format, the Weather data file loaded from the WEKA sample databases is
shown below −
The attributes can take nominal values as in the case of outlook shown here −
@attribute outlook (sunny, overcast, rainy)
The attributes can take real values as in this case −
@attribute temperature real
You can also set a Target or a Class variable called to play as shown here −
@attribute play (yes, no)
The Target assumes two nominal values yes or no.
When you open the file, your screen looks like as shown here −
Understanding Data
Let us first look at the highlighted Current relation sub-window. It shows the name of the database
that is currently loaded. You can infer two points from this sub window −
On the left side, notice the Attributes sub-window that displays the various fields in the database.
The weather database contains five fields - outlook, temperature, humidity, windy, and play. When
you select an attribute from this list by clicking on it, further details on the attribute itself are
displayed on the right-hand side.
Let us select the temperature attribute first. When you click on it, you would see the following
screen
• The table underneath this information shows the nominal values for this field as hot, mild,
and cold.
• It also shows the count and weight in terms of a percentage for each nominal value.
At the bottom of the window, you see the visual representation of the class values.
If you click on the Visualize All button, you will be able to see all features in one single window as
shown here −
Removing Attributes
Many a time, the data that you want to use for model building comes with many irrelevant fields.
For example, the customer database may contain his mobile number which is relevant in analyzing
his credit rating.
To remove Attribute/s select them and click on the Remove button at the bottom.
The selected attributes would be removed from the database. After you fully preprocess the data,
you can save it for model building.
Next, you will learn to preprocess the data by applying filters to this data.
Applying Filters
Some of the machine learning techniques such as association rule mining requires categorical data.
To illustrate the use of filters, we will use weather-numeric.arff database that contains
two numeric attributes - temperature and humidity.
We will convert these to nominal by applying a filter to our raw data. Click on the Choose button
in the Filter subwindow and select the following filter −
weka→filters→supervised→attribute→Discretize
After you are satisfied with the preprocessing of your data, save the data by clicking the Save ...
button. You will use this saved file for model building.
• Usability
• Speed
• Visualization
• Algorithms supported
• Data Set Size
• Memory Usage
• Primary Usage
• Interface Type Supported
Data Set Size Support small and large data Supports only small data sets.
sets.
Summary
• A perspective consists of a freely configurable selection of individual user interface
elements, the so-called views.
• RapidMiner will eventually also ask you automatically if switching to another perspective.
• All work steps or building blocks for different data transformation or analysis tasks are
called operators in RapidMiner. Those operators are presented in groups in the Operator
View on the left side.
• One of the first steps in a process for data analysis is usually to load some data into the
system. RapidMiner supports multiple methods for accessing datasets.
• It is always recommended to use the repository whenever this is possible instead of files.
• Open Recent Process opens the process which is selected in the list below the actions.
Alternatively, you can open this process by double-clicking inside the list.
• WEKA supports many different standard data mining tasks such as data pre-processing,
classification, clustering, regression, visualization and feature selection.
• The WEKA application allows novice users a tool to identify hidden information from
database and file systems with simple to use options and visual interfaces.
Keywords
Process: A connected set of Operators that help you to transform and analyze your data.
Port:To build a process, you must connect the output from each Operator to the input of the next
via a port.
Repository: your central data storage entity. It holds connections, data, processes, and results,
either locally or remotely.
Operators: The elements of a Process, each Operator takes input and creates output, depending on
the choice of parameters.
Filters. Processes that transform instances and sets of instances are called filters.
New Process: Starts a new analysis process.
b) Experiments
c) workflow and visualization.
d) All
10) Which of the following is the data mining tool?
a) RapidMiner
b) Weka.
c) Both a and b.
d) None
1. A 2. C 3. B 4. D 5. B
6. D 7. A 8. D 9. D 10. C
Further Readings
Witten, I. H., Frank, E., Trigg, L. E., Hall, M. A., Holmes, G., & Cunningham, S. J. (1999).
Weka: Practical machine learning tools and techniques with Java implementations.
Markov, Z., & Russell, I. (2006). An introduction to the WEKA data mining system. ACM
SIGCSE Bulletin, 38(3), 367-368.
Kotu, V., & Deshpande, B. (2014). Predictive analytics and data mining: concepts and
practice with rapidminer. Morgan Kaufmann.
Hofmann, M., &Klinkenberg, R. (Eds.). (2016). RapidMiner: Data mining use cases and
business analytics applications. CRC Press.
Web Links
https://www.cs.waikato.ac.nz/ml/weka/
https://www.analyticsvidhya.com/learning-paths-data-science-business-analytics-
business-intelligence-big-data/weka-gui-learn-machine-learning/
https://storm.cis.fordham.edu/~gweiss/data-mining/weka.html
https://docs.rapidminer.com/9.5/server/configure/connections/creating-other-
conns.html
https://docs.rapidminer.com/latest/studio/connect/
Objectives
After this lecture, you will be able to
• Know the process of accessing and loading information from the Repository into the
Process using retrieve Operator.
• Implementation of storage operator to store data and model.
• Various methods of visualizing the data.
• Creation of a new repository and usage of an existing repository.
Introduction
Following the directed dialogue or using the drag and drop feature to import data to your
repository is easy. Simply drag the file from your file browser onto the canvas and follow the on-
screen instructions. Check that the data types are right and that the goal or mark is correctly
flagged. The fact that this "import then open" method is not like other methods of data opening is a
significant difference.
Data from Microsoft Excel spreadsheets can be loaded using this operator. Excel 95, 97, 2000, XP,
and 2003 data can be read by this operator. The user must specify which of the workbook's
spreadsheets will be used as a data table. Each row must represent an example, and each column
must represent an attribute in the table.Please keep in mind that the first row of the Excel sheet can
be used for attribute names that are defined by a parameter. The data table can be put anywhere on
the sheet and can include any formatting instructions, as well as empty rows and columns. Empty
cells or cells with only "?" can be used to show missing data values in Excel.
The easiest and shortest way to import an Excel file is to use the import configuration wizard from the
Parameters panel. The easiest approach, which could take a little more time, is to set all of the
parameters in the Parameters panel first, then use the wizard. Before creating a method that uses
the Excel file, please double-check that it can be read correctly.
To get started with RapidMiner Studio, build a local repository on your computer.
From the Repositories view, select Import Excel Sheet from the pull-down to import the training
data set.
The Import Wizard launches. Browse to the location where you saved customer-churn-data.xlsx,
select the file, and click Next.
The wizard will walk you through the process of importing data into the repository. Verify that
you're importing the right Excel sheet by looking at the tabs at the end. Although there is only one
sheet in this file, RapidMiner Data, it is always a good idea to double-check.If there were more
sheets, it may look like this:
Step 2 also allows you to select a range of cells for import. For this tutorial, you want all
cells (the default). Click Next.
RapidMiner has preselected the first row as the row that contains column names in this process.
You could fix it here if it was inaccurate, but this isn't important for your data collection. Accept the
default and move on to the next step. Define the data that will be imported. The example set
consists of the entire spreadsheet, with each row or example representing one customer.
This phase has four essential components: The specifies the columns should be imported. The
column names (or attributes, as RapidMiner refers to them) are those that were defined in the
previous phase by the Name annotation. These are Gender, Age, Payment Method, Churn, and
LastTransaction.The data types for each attribute are described by the drop-down boxes in the third
row. The data type determines the values are permitted for an attribute (polynomial, numeric,
integer, etc.).
The pull-down showing attribute in the fourth row is the most important task for this import. This
pull-down allows you to tell RapidMiner which attribute is the main focus of your model. You set
the function of Churn to mark because it is the column you want to forecast. RapidMiner can
identify the example by predicting a value for each label based on what it has learned.
To finish the import, navigate to the data folder in Getting Started and give the file a name. It's
worth noting that while it's in the RapidMiner archive, it'll be saved in RapidMiner's special format,
which means it won't have a file name extension by convention. Click on Finish.
When you import data in RapidMiner, you need to select the attribute type "label" for the
column you wish to classify.
When you click Finish, the data set loads into your repository and RapidMiner switches to the
Resultsperspective, where your data displays. The following parameters you need to set to import
your data successfully.
Person research tasks should be organized into new folders in the repository, which should be
named appropriately.
Input
An Excel file is expected to be a file object that can be generated by using other operators that have
file output ports, such as the Read File operator.
Output
This port provides a tabular version of the Excel file as well as metadata. This contribution is
comparable to that of the Retrieve operator.
Parameters
• excel_file:The path of the Excel file is specified here. It can be selected using the choose a
file button.
• sheet_selection: This option allows you to change the sheet selection between sheet number and
sheet name.
• sheet_number: The number of the sheet which you want to import should be specified
here.Range: integer
• sheet_name: The name of the sheet which you want to import should be specified here.Range:
string
• imported_cell_range: This is a mandatory parameter. The range of cells to be imported from the
specified sheet is given here. It is specified in 'xm:yn' format where 'x' is the column of the first cell
of the range, 'm' is the row of the first cell of the range, 'y' is the column of the last cell of the range,
'n' is the row of the last cell of the range. 'A1:E10' will select all cells of the first five columns from
rows 1 to 10.
• first_row_as_names: If this option is set to true, it is assumed that the first line of the Excel file
has the names of attributes. Then the attributes are automatically named and the first line of the
• annotations: If the first row as names parameter is not set to true, annotations can be added using
the 'Edit List' button of this parameter which opens a new menu. This menu allows you to select
any row and assign an annotation to it. Name, Comment, and Unit annotations can be assigned. If
row 0 is assigned Name annotation, it is equivalent to setting the first row as names parameter to
true. If you want to ignore any rows you can annotate them as Comments.
• date_format: The date and time format are specified here. Many predefined options exist; users
can also specify a new format. If text in an Excel file column matches this date format, that column
is automatically converted to date type. Some corrections are automatically made in the date type
values. For example, a value '32-March' will automatically be converted to '1-April'. Columns
containing values that can't be interpreted as numbers will be interpreted as nominal, as long as
they don't match the date and time pattern of the date format parameter. If they do, this column of
the Excel file will be automatically parsed as the date and the according attribute will be
of date type.
• time_zone: This is an expert parameter. A long list of time zones is provided; users can select any
of them.
• Locale: This is an expert parameter. A long list of locales is provided; users can select any of them.
• read_all_values_as_polynominal: This option allows you to disable the type handling for this
operator. Every column will be read as a polynomial attribute. To parse an excel date afterwards
use 'date_parse(86400000 * (parse(date_attribute) - 25569))' (- 24107 for Mac Excel 2007) in the
metadata of the ExampleSet created from the specified Excel file. Column index, name, type,
and role can be specified here. The Read Excel operator tries to determine an appropriate type of
attribute by reading the first few lines and checking the occurring values. If all values are integers,
the attribute will become an integer. Similarly, if all values are real numbers, the attribute will
become of type real. Columns containing values that can't be interpreted as numbers will be
interpreted as nominal, as long as they don't match the date and time pattern of the date
format parameter. If they do, this column of the Excel file will be automatically parsed as the
date and the according attribute will be of type date. Automatically determined types can be
with the expected value type are considered as missing values and are replaced by '?'. For example,
if 'back is written in an integer column, it will be treated as a missing value. A question mark (?) or
an empty cell in the Excel file is also read as a missing value.Range: boolean
Create two datasets and merge the results of both datasets and store the merged results using
store operator.
There are two quick and easy ways to store a RapidMiner model in a repository
1. Right-click on the tab in the Results panel and you should see an option to store the model in the
repository:
Either way will get you a model object in the repository which you can use/view any time you like:
Output
The RapidMiner Object with the path defined in the repository entry parameter is returned.
Parameters
repository_entry
The location of the RapidMiner Object to be loaded. This parameter points to a registry entry,
which will be returned as the Operator's output.
Repository locations are resolved relative to the Repository folder containing the current Process.
Folders in the Repository are separated by a forward slash ('/'). A '..' references the parent folder. A
leading forward slash references the root folder of the Repository containing the current Process. A
leading double forward slash ('//') is interpreted as an absolute path starting with the name of a
Repository. The list below shows the different methods:
• 'MyData' looks up an entry 'MyData' in the same folder as the current Process
• '../Input/MyData' looks up an entry 'MyData' located in a folder 'Input' next to the folder
containing the current Process
• '/data/Model' looks up an entry 'Model' in a top-level folder 'data' in the Repository
holding the current Process
Create a student data set and load it in the local repository of rapidminer studio and generate
the results.
2D Scatter Plot
The resulting interface demonstrates a wide range of visualization options; in this case, we'll use
RapidMiner's advanced plotting capabilities. Set the domain dimension of the iris dataset to a1, the
axis to a2, and the color dimension to the mark in the advanced charts tab. Adding a dimension to
the axis of a 2D scatter plot, as shown in Figure 14.
Line Chart
If you'd rather make a line chart, the process is the same, but you'll need to change the format to
lines as shown in Figure 15.
Scatter plots are most useful when the objects in the dataset are static points in time. When plotting
time series, line charts are extremely useful.
Histogram
A histogram is a chart that charts the frequency of occurrence of a particular value, similar to a
scatter plot. In more detail, this implies that a collection of bins is generated for one of the axes'
range values. Consider the iris dataset and the a2, a3, and a4 axis once more. In RapidMiner, go to
the charts tab and pick a2, a3, and a4 to plot a histogram. The number of bins can now be set to 40
as illustrated in Figure 16.
Figure 16:Histogram
Pie Chart
A pie chart is typically depicted as a circle divided into sectors, with each sector representing a
percentage of a given quantity. Each of the sectors, or slices, is often annotated with a percentage
indicating how much of the sum falls into each of the categories. In data analytics, pie charts can be
useful for observing the proportion of data points that belong to each of the categories.
Figure 17:Pie-Chart
To make this map in RapidMiner, go to the Design tab, then to the Charts tab on the top, then to the
Pie chart. Pie charts only accept one value as input, and in this case, the labeled column should be
considered the input. An example of such an interface is shown in Figure 14. Pie charts, in
particular, are a simple way to display how well a dataset is balanced.The Iris dataset is perfectly
balanced, as you can see in Figure 17, with the area equally split between the three classes in the
dataset.
Did you Know?To display numerical data, pie graphs are circular graphs divided into sectors
or slices.
Box Plots
Bubble Charts
Bubble charts can also be modeled with RapidMiner. Bubble charts are a type of two-dimensional
diagram that represents three-dimensional data. The ray of a circle represents the third dimension.
Using the Iris dataset, Figure 16 shows an example of this diagram in RapidMiner. Bubble charts
are important because they provide a two-dimensional representation of the dataset, allowing
viewers to maintain a sense of the third dimension based on the size of the circle surrounding the
object.
Figure 19:Bubble Chart of Iris Dataset
Summary
• Store operator stores an IO Object in the data repository.
• The IO Object provided at the input port is delivered through this output port without any
modifications.
• The stored object can be used by other processes by using the Retrieve operator.
• The behavior of an Operator can be modified by changing its parameters.
• There are two quick and easy ways to store a RapidMiner model in a repository
Keywords
Operators: The elements of a Process, each Operator takes input and creates output, depending on
the choice of parameters.
Parameters: Options for configuring the behavior of an Operator.
Help: Displays a help text for the current Operator.
Repository_entry: This parameter is used to specify the location where the input IO Object is to be
stored.
Bubble charts: These charts are used for representing three-dimensional data in the form of a two-
dimensional diagram.
a) Export
b) Repository Access
c) Import
d) Evaluation
3.Data, processes, and results are stored in the________
a) Process.
b) Repository.
c) Program
d) RapidMiner Server
4. The Repository can be used to store:
a) Data
b) Processes
c) Results
d) All of the above
5. Which of the following ways of getting operator into the Process Panel.
a) CSV File
a) Import Data
b) Export Data
c) Merge Data
d) None
8. RapidMiner Studio can blend structured data with unstructured data and then leverage all
the data for _____________analysis.
a) Descriptive
b) Diagnostic
c) Predictive
d) Prescriptive
9. ______________ is an advanced version of RapidMiner Studio that increments the process of
building and validating data models.
a) Process View
b) Operator View
c) Repository View
d) None
12. Which of the following is/are used to define an operator?
a) Meta-Data View
b) Help View
c) Comment View
d) Problems View
6. D 7. A 8. C 9. B 10. A
Review Questions
Q1) How to connect with the data using Rapidminer Studio. What are the different file formats
supported by Rapidminer?
Q2) Write down the different ways to store a RapidMiner model in a repository.
Q3) Create your won dataset and write down the steps on how you can import and store the data
created by you using the store and import operator.
Q4) With an appropriate example explain the different ways of importing the data in rapidminer
studio.
Q4) When and how to how to add a 'label' attribute to a dataset? Explain with example?
Q5) Create your repository and load data into it. Process the data and represent the results using
three different charts available in rapidminer.
Q6) Write down the step-by-step procedure to display the data using a line chart.
Further Readings
Kotu, V., & Deshpande, B. (2014). Predictive analytics and data mining: concepts and
practice with rapidminer. Morgan Kaufmann.
Hofmann, M., &Klinkenberg, R. (Eds.). (2016). RapidMiner: Data mining use cases and
business analytics applications. CRC Press.
Ertek, G., Tapucu, D., &Arin, I. (2013). Text mining with rapidminer. RapidMiner: Data
mining use cases and business analytics applications, 241.
Ryan, J. (2016). Rapidminer for text analytic fundamentals. Text Mining and Visualization:
Case Studies Using Open-Source Tools, 40, 1.
Siregar, A. M., Kom, S., Puspabhuana, M. K. D. A., Kom, S., &Kom, M. (2017). Data Mining:
Pengolahan Data MenjadiInformasidenganRapidMiner. CV Kekata Group.
Objectives
After this lecture, you will be able to
• Learn the need for preprocessing of data.
• Know the major Tasks in Data Preprocessing
• Understand the concept of missing data and various methods for handling
missing data.
• Learn the concept of data discretization and data reduction.
• Understand the various strategies of data reduction.
Introduction
Preprocessing data is a data mining technique for transforming raw data into a usable and efficient
format. The measures taken to make data more suitable for data mining are referred to as data
preprocessing. The steps involved in Data Preprocessing are normally divided into two categories:
• Selecting data objects and attributes for analysis, and selecting data objects and attributes
for analysis.
• Adding/removing attributes is a two-step process.
In data mining, data preprocessing is one of the most important aspects of the well-known
information innovation from the data processor. Data taken directly from the source will
contain errors, contradictions, and, most importantly, will not be willing to be used for a data
mining tool.
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value-added
• Interpretability
• Accessibility
Example, If a customer's age is 32 but the system says she's 34, the system is inaccurate.
Completeness: When data meets comprehensiveness expectations, it is considered "complete."
Assume you ask the customer to include his or her name. You can make a customer's middle name
optional, but the data is full as long as you have their first and last names.
Consistency: The same information can be held in many locations at several businesses. It's called
"consistent" if the information matches.
Example: ,It's inconsistent if your human resources information systems claim an employee
no longer works there, but your payroll system claims he's still getting paid.
Timeliness: Is your data readily accessible when you need it? The “timeliness” factor of data
quality is what it is called. Let's say you need financial data every quarter; the data is timely if it
arrives when it's expected to.
Data Cleaning
Data Integration
Data
Transformation
Data Reduction
Data
Discretization
Data Cleaning: Data cleaning is the method of cleaning data to make it easier to integrate.
Data Integration: Data integration is the method of bringing all of the data together.
Example: many tuples have no recorded value for several attributes, such as customer
income in sales data
Some data cleansing techniques:
1 Forget the tuple. If the classmark is absent, this is finished. Unless the tuple includes multiple
attributes with missing values, this approach is ineffective.
2 You can manually fill in the missing value. This method works well with small data sets that have
some missing values.
3. You may use a global constant like "Unknown" or minus infinity to replace all missing attribute
values.
4 To fill in the missing value, use the attribute mean. If a customer's average income is $25,000, you
can use this amount to fill in the lost income value.
5 Fill in the missing value with the most likely value.
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data mining combines techniques from a variety of fields, including database and data
warehouse technology, statistics, machine learning, high-performance computing, pattern
recognition, neural networks, data visualisation, information retrieval, image and signal
processing, and spatial and temporal data analysis
A. Tight Coupling
The method of ETL - Extraction, Transformation, and Loading - is used to integrate data from
various sources into a single physical location in tight coupling.
B. Loose Coupling
Loose Coupling is a term that refers to a coupling that is only the data from the source databases is
held in loose coupling. An interface is given in this method that takes a user's query and converts it
into a format that the source database can understand, then sends the query directly to the source
databases to get the response.
Data Integration Issues
We must deal with many issues while integrating the data, which are discussed below:
How do we ‘match the real-world entities from the data, given that the data is unified from
heterogeneous sources? We have consumer data from two separate data sources, for example. The
customer id is assigned to one data source's entity, while the customer number is assigned to the
other data source's entity. How does the data analyst or the machine know that these two entities
lead to the same thing?
One of the major concerns during data integration is redundancy. Unimportant data or data that is
no longer needed is referred to as redundant data. It can also happen if there are attributes in the
data set that can be extracted from another attribute.
Example: If one data set contains the customer's age and another data set contains the
customer's date of birth, age will be a redundant attribute since the date of birth could be used to
derive it.
The extent of redundancy is also increased by attribute inconsistencies. Correlation analysis can be
used to find the redundancy. The attributes are evaluated to see whether they are interdependent
on one another and if so, to see if there is a connection between them.
Data conflict occurs when data from various sources is combined and does not fit. The attribute
values, for example, can vary between data sets. The disparity maybe because they are depicted
differently in different data sets. Assume that the price of a hotel room in different cities is
expressed in different currencies. This type of problem is detected and fixed during the data
integration process.
1. Smoothing: The term "smoothing" refers to the process of removing noise from data.
2. Aggregation:Aggregation is the application of description or aggregation operations to
data.
3. Generalization: Using definition hierarchies climbing, low-level data is replaced with
high-level data in generalization.
4. Normalization: Using normalization, attribute data was scaled to fall within a narrow
range, such as 0.0 to 1.0.
5. Attribute Construction: New attributes are created from a specified collection of attributes
in attribute construction.
Aggregation of Data
By performing an aggregation process on a large collection of data, data aggregation reduces the
volume of the data set.
Example: we have a data set of sales reports of an enterprise that has quarterly sales of each
year. We can aggregate the data to get the annual sales report of the enterprise.
Generalization
The nominal data or nominal attribute is a set of values with a finite number of unique values and
no ordering between them. Job-category, age-category, geographicregions, items-category, and
other nominal attributes are examples. By adding a group of attributes, the nominal attributes form
the definition hierarchy. Concept hierarchy can be created by combining words like street, city,
state, and nation.
The data is divided into several levels using a concept hierarchy. At the schema level, the definition
hierarchy can be created by adding partial or complete ordering between the attributes.
Alternatively, a concept hierarchy can be created by specifically grouping data on a portion of the
intermediate level.
Normalization
The process of data normalization entails translating all data variables into a specific set. Data
normalization is the process of reducing the number of data values to a smaller range, such as [-1,
1] or [0.0, 1.0].
The following are some examples of normalization techniques:
A. Min-Max Normalization:
Minimum-Maximum Normalization is a linear transformation of the original results. Assume that
the minima and maxima of an attribute are min A and max A, respectively.
We Have the Formula:
B. Z-score Normalization
Using the mean and standard deviation, this approach normalizes the value for attribute A. The
formula is as follows:
Here Ᾱ and σAare the mean and standard deviation for attribute A are respectively.For instance,
attribute A now has a mean and standard deviation of $54,000 and $16,000, respectively. We must
also use z-score normalization to normalize the value of $73,600.
C. Decimal Scaling
By shifting the decimal point in the value, this method normalizes the value of attribute A. The
maximum absolute value of A determines the movement of a decimal point.
The decimal scaling formula is as follows:
Bottom-up Discretization
Bottom-up discretization or merging is described as a process that begins by considering all
continuous values as potential split-points and then removes some by merging neighborhood
values to form intervals.
Discretization on an attribute can be done quickly to create a definition hierarchy, which is a
hierarchical partitioning of the attribute values.
Supervised Discretization
Unsupervised Discretization
Unsupervised discretization algorithms are the simplest algorithms to make use of, because the
only parameter you would specify is the number of intervals to use; or else, how many values
should be included in each interval.
1. Data Cube Aggregation: In the construction of a data cube, aggregation operations are applied to
the data. This method is used to condense data into a more manageable format. As an example,
consider the data you collected for your study from 2012 to 2014, which includes your company's
revenue every three months. Rather than the quarterly average, they include you in the annual
revenue.
2. Dimension Reduction:We use the attribute needed for our analysis whenever we come across
data that is weakly significant. It shrinks data by removing obsolete or redundant functions. The
following are the methods used for Dimension reduction:
• Step-wise Forward Selection
• Step-wise Backward Selection
• Combination of forwarding and Backward Selection
Step-wise Forward Selection
The selection begins with an empty set of attributes later on we decide the best of the original
attributes on the set based on their relevance to other attributes.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
• Step-1: {X1}
Data Compression
Using various encoding methods, the data compression technique reduces the size of files
(Huffman Encoding & run-length Encoding). Based on the compression techniques used, we can
split it into two forms.
• Lossless Compression
• Lossy Compression
Lossless Compression – Encoding techniques (Run Length Encoding) allow for a quick and
painless data reduction. Algorithms are used in lossless data compression to recover the exact
original data from the compressed data.
Lossy Compression – Examples of this compression include the Discrete Wavelet Transform
Technique and PCA (principal component analysis). JPEG image format, for example, is a lossy
compression, but we can find a sense equivalent to the original image. The decompressed data in
lossy data compression vary from the original data, but they are still useful for retrieving
information.
• Regression
Easy linear regression and multiple linear regression are two types of regression. While there is
only one independent attribute, the regression model is referred to as simple linear regression, and
when there are multiple independent attributes, it is referred to as multiple linear regression.
• Log-Linear Model
Did you Know? On sparse data, both regression and the log-linear model can be used, but
their implementation is limited.
Non-Parametric Methods:Histograms, clustering, sampling, and data cube aggregation are
examples of non-parametric methods for storing reduced representations of data.
Summary
• Data transformation is the process of converting data into a format that allows for effective
data mining.
• Normalization, discretization, and concept hierarchy are the most powerful ways of
transforming data.
• Data normalization is the process of growing the collection of data.
• The data values of a numeric variable are replaced with interval labels when data
discretization is used.
• The data is transformed into multiple l by using concept hierarchy.
• Accuracy, completeness, continuity, timeliness, believability, and interpretability are all w
ords used to describe data quality. These qualities are evaluated based on the data's intend
ed use.
• Data cleaning routines aim to fix errors in the data by filling in missing values, smoothing
out noise when finding outliers, and filling in missing values. Cleaning data is normally
done in two steps, one after the other.
Keywords
Data cleaning: To remove noise and inconsistent data.
Data integration: Multiple data sources may be combined.
Data transformation: Where data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations.
Data discretization: Ittransforms numeric data by mapping values to interval or concept
labels.
2. Data cleaning is
A. Large collection of data mostly stored in a computer system
B. The removal of noise errors and incorrect input from a database
C. The systematic description of the syntactic structure of a specific database. It describes the
structure of the attributes of the tables and foreign key relationships.
D. None of these
7. Which data mining task can be used for predicting wind velocities as a function of temperature,
humidity, air pressure, etc.?
A. Cluster Analysis
B. Regression
C. Classification
D. Sequential pattern discovery
A. Data integration
B. Data reduction
C. Data Transformation
D. Data cleaning
10. Careful integration can help reduce and avoid __________and inconsistencies in resulting data
set.
A. Noise
B. Redundancies
C. Error
D. None
11. _________________is the process of changing the format, structure, or values of data.
A. Data integration
B. Data reduction
C. Data Transformation
D. Data cleaning
12. In Binning, we first sort data and partition into (equal-frequency) bins and then which of the
following is not a valid step
A. smooth by bin boundaries
B. smooth by bin median
C. smooth by bin means
D. smooth by bin values
1. D 2. B 3. A 4. D 5. A
6. C 7. B 8. A 9. A 10. B
Review Questions
Q1) Data quality can be assessed in terms of several issues, including accuracy, completeness, and
consistency. For each of the above three issues, discuss how data quality assessment can depend on
the intended use of the data, giving examples. Propose two other dimensions of data quality.
Q2) In real-world data, tuples with missing values for some attributes are a common occurrence.
Describe various methods for handling this problem.
Q3) Discuss issues to consider during data integration.
Q4) For example explain the various methods of Normalization.
Q5) Elaborate on various data reduction strategies by giving the example of each strategy.
Q6) Explain the concept of data discretization along with its various methods.
Q7) What is the need for data preprocessing? Explain the different data preprocessing tasks in
detail.
Further Readings
Kotu, V., & Deshpande, B. (2014). Predictive analytics and data mining: concepts and
practice with rapidminer. Morgan Kaufmann.
Hofmann, M., &Klinkenberg, R. (Eds.). (2016). RapidMiner: Data mining use cases and
business analytics applications. CRC Press.
Ertek, G., Tapucu, D., &Arin, I. (2013). Text mining with rapidminer. RapidMiner: Data
mining use cases and business analytics applications, 241.
Ryan, J. (2016). Rapidminer for text analytic fundamentals. Text Mining and Visualization:
Case Studies Using Open-Source Tools, 40, 1.
Siregar, A. M., Kom, S., Puspabhuana, M. K. D. A., Kom, S., &Kom, M. (2017). Data Mining:
Pengolahan Data MenjadiInformasidenganRapidMiner. CV Kekata Group
https://towardsdatascience.com/data-preprocessing-in-data-mining-machine-learning-
79a9662e2eb
https://www.geeksforgeeks.org/numerosity-reduction-in-data-mining/
https://cs.ccsu.edu/~markov/ccsu_courses/datamining-3.html
https://www.cse.wustl.edu/~zhang/teaching/cs514/Spring11/Data-prep.pdf
Objectives
After this lecture, you will be able to
• Learn the need for preprocessing of data.
• Need and implementation of Remove Duplicate Operator.
• Learn the implementation of Different methods of Handing Missing Values.
• Understand the various methods of renaming an attribute.
• Learn the implementation of the Apriori Algorithm in Weka.
Introduction
Data pre-processing is a data mining technique that entails converting raw data into a format that
can be understood. Real-world data is often incomplete, unreliable, and/or deficient in specific
habits or patterns, as well as containing numerous errors. The most important information at the
pre-processing phase is about the missing values.Preprocessing data is a data mining technique for
transforming raw data into a usable and efficient format. The measures taken to make data more
suitable for data mining are referred to as data preprocessing. The steps involved in Data
Preprocessing are normally divided into two categories:
• Selecting data objects and attributes for analysis, and selecting data objects and attributes
for analysis.
• Adding/removing attributes is a two-step process.
The attribute filter type parameter and other related parameters can be used to select attributes.
Assume that two attributes, 'att1' and 'att2,' are chosen, with three and two possible values,
respectively. As a result, there are a total of six (three times two) special combinations of these two
attributes. As a consequence, the resulting Example Set can only have 6 examples. This operator
applies to all attribute types. The following Iris dataset contains 150 examples before the use of
remove duplicate operator. The following are the various parameters applicable to remove
duplicate operators.
Consider any dataset from sample dataset in Rapid miner and implement Remove Duplicate
operator on different attributes.
Input
An Example Set is expected at this input port. In the attached Example Method, it's the result of the
Retrieve operator. Other operators' output can also be used as data.
Output
• Example set Output(Data Table)
The duplicate examples in the provided Example Set are removed, and the resulting Example Set is
delivered via this port.
This port is used to deliver duplicated examples from the given Example Set.
Parameters
attribute_filter_type
This parameter lets you choose the attribute selection filter, which is the tool you'll use to pick the
appropriate attributes. It comes with the following features:
• All: This choice simply selects all of the Example Set's attributes. This is the standard-
setting.
• single: Selecting a single attribute is possible with this choice. When you select this choice,
a new parameter (attribute) appears in the Parameters panel.
• subset: This choice allows you to choose from a list of multiple attributes. The list contains
all of the Example Set's attributes; appropriate attributes can be easily selected. If the
metadata is unknown, this option will not work.
• regular_expression: This feature allows you to pick attributes using a standard
expression. Other parameters (regular expression, use except for expression) become
available in the Parameters panel when this choice is selected.
• value_type: This option allows you to pick all of a type's attributes. It's worth noting that
styles are arranged in a hierarchy. Real and integer types, for example, are also numeric
types. When selecting attributes via this option, users should have a basic understanding
of type hierarchy. Other parameters in the Parameters panel become available when this
choice is selected.
• block_type: This option works similarly to the value type option. This choice allows you
to choose all of the attributes for a specific block form. Other parameters (block type, use
block type exception) become available in the Parameters panel when this choice is
selected.
• no_missing_values: This choice simply selects all of the Example Set's attributes that do
not have a missing value in any of the examples. All attributes with a single missing value
are deleted.
• numeric value filter: When the numeric value filter choice is selected, a new parameter
(numeric condition) appears in the Parameters panel. The examples of all numeric
attributes that satisfy the numeric condition are chosen. Please note that regardless of the
numerical situation, all nominal attributes are chosen.
attribute
This choice allows you to choose the desired attribute. If the metadata is identified, the attribute
name can be selected from the attribute parameter's drop-down box.
attributes
This choice allows you to pick the appropriate attributes. This will bring up a new window with
two lists. The left list contains all attributes, which can be transferred to the right list, which
contains the list of selected attributes for which the conversion from nominal to numeric will be
performed; all other attributes will remain unchanged.
Regular_expression
This expression will be used to pick the attributes whose names match this expression. Regular
expressions are a powerful tool, but they require a thorough introduction for beginners. It's always
a good idea to use the edit and display regular expression menu to define the regular expression.
value_type
A drop-down menu allows you to pick the type of attribute you want to use. You may choose one
of the following types: nominal, text, binominal, polynomial, or file path.
The following figure shows the design view of the Iris dataset.
block_type
A drop-down list may be used to choose the block type of attributes to be used. 'single value' is the
only possible value.
numeric_condition
This is where you specify the numeric condition for checking examples of numeric attributes. The
numeric condition '> 6', for example, will hold all nominal and numeric attributes with a value
greater than 6 in any example. '> 6 && 11' or '= 5 || 0' are examples of potential situations.
However, you can't use && and || in the same numeric situation.
Did you know? Conditions like '(> 0 && 2) || (>10 && 12)' are not authorized, since they use
both && and ||. After '>', '=', and ", use a blank space; for example, '5' will not work; instead, use
'5'.
include_special_attributes
The examples are identified by unique attributes, which are attributes with specific functions.
Regular attributes, on the other hand, simply define the instances. Id, label, prediction, cluster,
weight, and batch are special attributes.
Originally there is a total of 150 examples in the Iris dataset after the implementation of the remove
duplicate operator we have left with 147 examples rest 3 examples were duplicates which are
removed with the remove duplicate operator.
If the selected attributes have the same values, two instances are called duplicate. The
attribute filter type parameter and other related parameters can be used to select attributes. Assume
that two attributes, 'att1' and 'att2,' are chosen, with three and two possible values, respectively.
Rename
The Rename operator is used to rename one or more of the input ExampleSet's attributes. It's
important to remember that attribute names must be special. The Rename operator does not affect
an attribute's form or position.
For example, suppose you have an integer form and standard role attribute called 'alpha.' The
attribute's name would be changed if it is renamed to 'beta.' It will keep its integer form and daily
function. Use the Set Function operator to adjust an operator's role. At 'Data Transformation/Class
Conversion,' a variety of type conversion operators are available for modifying the type of an
attribute.
The Rename operator takes two parameters; one is the existing attribute name and the other is
the new name you want the attribute to be renamed to.
Input
An Example Set is expected at this input port. In the attached Example Method, it's the result of the
Retrieve operator. Other operators' output can also be used as data. Since attributes are defined in
its metadata, metadata must be attached to data for input. The Retrieve operator returns metadata
in addition to data.
Output
This port produces an Example Set with renamed attributes.
Original
This port passes the Example Set that was provided as input to the output without any changes.
This is commonly used to reuse the same Example Set through several operators or to display the
Example Set in the Results Workspace.
Parameters
old name
This parameter is used to specify which attribute's name should be modified.
new name
This parameter is used to specify the attribute's new name. Special characters may also be used in a
name.
This Example Process makes use of the 'Golf' data collection. The 'Wind' attribute has been renamed
to '#*#', and the 'Play' attribute has been renamed to 'Game'. To demonstrate that special characters
can be used to rename attributes, the 'Wind' attribute is renamed to '#*#'. Attribute names, on the
other hand, should always be meaningful and specific to the type of data stored in them.
Rename By replacing
This operator can be used to rename a collection of attributes by substituting a substitute for parts
of the attribute names.
The Rename by Replacing operator substitutes the required replacement for parts of the attribute
names. This operator is often used to exclude unnecessary characters from attribute names, such as
whitespaces, parentheses, and other characters. The substitute parameter specifies which part of the
attribute name should be changed. It's a normal term, and it's a very strong one at that.An arbitrary
string may be used as the replace by parameter. Empty strings are also permitted. Using $1, $2, $3,
and so on to capture groups of the regular expression of the replace what parameter can be
accessed. Please take a look at the attached Example Process for more details.
The Retrieve operator is used to load the 'Sonar' data set. A breakpoint has been placed here to
allow you to see the Example Set. The Example Set has 60 standard attributes with names such as
attribute 1, attribute 2, and so on. On it, the Rename by Replacing operator is used. Since the
attribute filter form parameter is set to 'all,' this operator can rename any attribute. The first
capturing category and a slash, i.e. 'att-', are used in place of 'attribute_' in the names of the 'Sonar'
attributes. As a result, attributes are renamed to att-1, att-2, and so on.
Create your dataset and rename the attributes of your dataset by considering different rename
operators.
The Sub process operator is the first step in this Example Process. The Example Set is returned by
the Sub process operator. The names of the attributes are actually 'label,"att1', and'att2', as you can
see. The values 'new label," new name1', and new name2' is used in the first example. On this
Example Set, the Rename by Example Values operator is used to rename the first example's values
to attribute names. The row number parameter is set to 1 in this example. You'll notice that the
attributes have been renamed appropriately after the process has been completed. Furthermore, the
first example from the Example Set has been deleted.
Missing values can be replaced with the Attribute's minimum, maximum, or average value. Zero
may also be used to fill in gaps in data. As a substitute for missing values, any replenishment value
can be listed.
Output
The missing values are substituted for the required values of the selected attributes, and the
resultant Example Set is provided via this port.
Original
This port passes the Example Set that was provided as input to the output without any changes.
This is commonly used to reuse the same Example Set through several operators or to display the
Example Set in the Results Workspace.
Parameters
attribute filter type
This parameter lets you choose the attribute selection filter, which is the tool you'll use to pick the
appropriate attributes. It comes with the following features:
• All: This choice simply selects all of the Example Set's attributes. This is the standard-
setting.
• single: Selecting a single attribute is possible with this choice. When you select this choice,
a new parameter (attribute) appears in the Parameters panel.
• subset: This choice allows you to choose from a list of multiple attributes. The list contains
all of the Example Set's attributes; appropriate attributes can be easily selected. If the
metadata is unknown, this option will not work. When you select this choice, a new
parameter appears in the Parameters panel.
• numeric value filter: When the numeric value filter choice is selected, a new parameter
(numeric condition) appears in the Parameters panel. The examples of all numeric
attributes that satisfy the numeric condition are chosen. Please note that regardless of the
numerical situation, all nominal attributes are chosen.
• no_missing_values: This choice simply selects all of the Example Set's attributes that do
not have a missing value in any of the examples. All attributes with a single missing value
are deleted.
• value_type: This option allows you to pick all of a type's attributes. It's worth noting that
styles are arranged in a hierarchy. The numeric form, for example, includes both real and
integer types. When selecting attributes via this option, users should have a basic
understanding of type hierarchy. Other parameters in the Parameters panel become visible
when it is picked.
• block_type: The block type option works similarly to the value type option. This choice
allows you to choose all of the attributes for a specific block form. Other parameters (block
type, use block type exception) become available in the Parameters panel when this choice
is selected.
In the Titanic data collection, this method removes all missing values. For the remaining rows, all
numerical values are replaced by average values, such as using the age 29.881 for missing ages. The
term MISSING has been used to replace all nominal missings. The Titanic data set lacks dates, but
they would have been replaced by the set's first date.
Loading Data
Open the Preprocess tab in the WEKA explorer, pick the supermarket.arff database from the
installation folder, and press the Open file... icon. After the data has been loaded, you can see the
screen below.
A total of 4627 instances and 217 attributes are stored in the database. It's easy to see how
complicated it will be to find a connection between so many variables. The Apriori algorithm,
fortunately, automates this task.
Associate
Then, on the Associate TAB, select the Choose option. As shown in the screenshot, c hoose the
Apriori association.
On the weather info, run Apriori. What is the support for this item set based on the output?
outlook = rainy humidity = normal windy = FALSE play = yes.
Click the Start button after you've set the parameters. After a while, the results will appear as
shown in the screenshot below.
Run Apriori on this data with default settings. Comment on the rules that are generated.
Several of them are quite similar. How are their support and confidence values related?
Summary
• The Remove Duplicates operator compares all examples in an Example Set against each
other using the listed attributes to remove duplicates.
• The special attributes are attributes with special roles which identify the examples. Special
attributes are id, label, prediction, cluster, weight, and batch.
• The Rename operator is used for renaming one or more attributes of the input Example
Set.
• To change the role of an operator, use the Set Role operator.
• Whitespaces, brackets, and other unnecessary characters are commonly removed from
attribute names using rename by replacing.
• Missing values can be replaced with the attribute's minimum, maximum, or average
value.
• Using invert selection all of the previously selected attributes have been unselected, and
previously unselected attributes have been selected.
Keywords
Output Ports: The duplicate examples are removed from the given Example Set and the resultant
Example Set is delivered through this port.
old name: This parameter is used to specify which attribute's name should be modified.
Rename by replacing: The Rename by Replacing operator substitutes the required replacement for
parts of the attribute names.
replace what: The replace what parameter specifies which part of the attribute name should be
changed.
no missing values : This choice simply selects all of the Example Set's attributes that do not have a
missing value in any of the examples.
Create view : Instead of modifying the underlying data, you can build a View. To allow this choice,
simply select this parameter.
Invert selection : When set to true, this parameter acts as a NOT gate, reversing the selection.
7. The ____________operator estimates values for the missing values by applying a model learned
for missing values.
A. Replace Missing Values
B. Impute Missing Values
C. Remove Missing values
D. Hide Missing Values
8. Which of the following attribute_filter_type allows the selection of multiple attributes through a
list
A. value_type
B. subset
C. All
D. block_type
9. Which of the following attribute_filter_type option selects all Attributes of the ExampleSet which
do not contain a missing value in any Example.
A. numeric_value_filter
10. Missing data can_________ the effectiveness of classification models in terms of accuracy and
bias.
A. Reduce
B. Increase
C. Maintain
D. Eliminate
11. _____________is a rule-based machine learning method for discovering interesting relations
between variables in large databases.
A. Association rule learning
B. Market Based Analysis
C. Rule Learning
D. None
12. Let X and Y are the data items where ________________specifies the probability that a
transaction contains X Y.
A. Support
B. Confidence
C. Both
D. None
13. Let X and Y are the data items where ____________ specifies the conditional probability that a
transaction having X also contains Y.
A. Support
B. Confidence
C. Both
D. None
14. The……………algorithm is one such algorithm in ML that finds out the probable associations
and creates association rules.
A. Decision Tree
B. KNN
C. Apriori
D. All of the above
15. You can select Apriori in Weka by clicking which of the following tab:
A. Classify
B. Cluster
C. Associate
1. A 2. C 3. A 4. C 5. B
6. A 7. B 8. B 9. C 10. A
Review Questions
Q1) Why is it necessary to remove duplicate values from the dataset. Explain the step-by-step
process of removing duplicate values from the dataset.
Q2) With example elucidate the different renaming operators available in rapidminer.
Q3) Explain the step-by-step process of generating association rules in weka using the apriori
algorithm.
Q4) “Missing values lead to incorrect analysis of the data” Justify the statement with an appropriate
example.
Q5) Elucidate the different methods of handling missing data in rapid miner.
Further Readings
Hofmann, M., & Klinkenberg, R. (Eds.). (2016). RapidMiner: Data mining use cases and
business analytics applications. CRC Press.
Kotu, V., & Deshpande, B. (2014). Predictive analytics and data mining: concepts and
practice with rapidminer. Morgan Kaufmann.
Chisholm, A. (2013). Exploring data with Rapidminer. Packt Publishing Ltd.
Nisbet, R., Elder, J., & Miner, G. (2009). Handbook of statistical analysis and data mining
applications. Academic Press.
Smith, T. C., & Frank, E. (2016). Introducing machine learning concepts with WEKA.
In Statistical genomics (pp. 353-378). Humana Press, New York, NY.
https://rapidminer.com/wp-content/uploads/2014/10/RapidMiner-5-Operator-
Reference.pdf
https://www.myexperiment.org/workflows/1345.html
https://docs.rapidminer.com/latest/studio/operators/blending/attributes/names_and_roles
/rename_by_replacing.html
https://docs.rapidminer.com/latest/studio/operators/blending/attributes/names_and_roles
/rename_by_example_values.html
https://eeisti.fr/grug/ATrier/GSI/MachineLearningOptimisation/Algorithmes_ML/Aprior
i/weka_a_priori.pdf
Objectives
After this lecture, you will be able to
• Understand the concept of frequent patterns and association rules.
• Learn how to calculate support and confidence.
• Understand the basic concepts of the Apriori algorithm.
• Learn the working of the Apriori algorithm.
• Understand the process of finding frequent patterns using FP-tree.
• Know the applications of association rule mining.
Introduction
Imagine that you are a sales manager, and you are talking to a customer who recently bought a PC
and a digital camera from the store. What should you recommend to her next?Frequent patterns
and association rules are the knowledge that you want to mine in such a scenario.The term
"association mining" refers to the process of looking for frequent items in a data set. Typically,
interesting connections and similarities between item sets in transactional and relational databases
are discovered through regular mining. Frequent Mining, in a nutshell, shows which objects appear
together in a transaction or relationship.
For example, a set of items, such as milk and bread, that appear frequently together in a
transaction data set is a frequent item-set.
A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it
occurs frequently in a shopping history database, is a (frequent ) sequential pattern.
< {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart}
{Order Confirmation} {Return to Shopping} >
Subgraphs, subtrees, and sublattices are examples of structural structures that can be combined
with itemsets or subsequences to form a substructure. A (frequent) structural pattern is a
substructure that appears regularly in a graph database. In mining associations, correlations, and
many other interesting relationships among data, finding frequent patterns is critical.
If A then B, for example, association rule learning is based on the principle of If and Else
Statements.
The If aspect is referred to as the antecedent, and the Then statement is referred to as the
Consequent. Single cardinality refers to relationships in which we may discover an interaction or
relationship between two objects. It's all about making laws, and as the number of things grows, so
does cardinality.
We used association rules to find mistakes often occurring together while solving
exercises.The purpose of looking for these associations is for the teacher to ponder and, may be,
toreview the course material or emphasize subtleties while explaining concepts to students. Thus, it
makes sense to have a support that is not too low.
For example, the information that customers who purchase computers also tend to buy antivirus
software at the same time is represented in the following association rule:
Computer=>antivirus_Software [support=2%, confidence=60%].
Support of 2% for Rule means that 2% of all the transactions under analysis show that computer
and antivirus software are purchased together.A confidence of 60% means that 60% of the
customers who purchased a computer also bought the software. There are several metrics for
measuring the relationships between thousands of data objects. These figures are as follows:
Support
Confidence
Support
The frequency of A, or how often an object appears in the dataset, is called "support." It's the
percentage of the transaction T that has the itemset X in it. If there are X datasets, the following can
be written for transaction T:
Support(X)=Freq(X)/T
Confidence
The term "confidence" refers to how much the rule has been proven correct. Or, when the frequency
of X is already known, how often the items X and Y appear together in the dataset. It's the ratio of
the number of records that contain X to the number of transactions that contain X.
Confidence=Freq(X,Y)/Freq(X)
Table 1: DataItems along with Transaction ids
TID Items
1 Bread, Milk
It could be useful for the OurVideoStore manager to know what movies are often
rented together or if there is a relationship between renting a certain type of mov ie
and buying popcorn or pop. The discovered association rules are of the form:
P→Q [s, c], where P and Q are conjunctions of attribute value-pairs, and s (for
support) is the probability that P and Q appear together in a transaction and c (for
confidence) is the conditional probability that Q appears in a transaction when P is
present. For example, the hypothetic association rule
Rent Type (X, “game”) ^ Age(X,“13-19”) → Buys(X, “pop”) [s=2% ,c=55%]
would indicate that 2% of the transactions considered are of customers aged
between 13 and 19 who are renting a game and buying pop and that there is a
certainty of 55% that teenage customers, who rent a game, also buy pop.
• Join Operation: To find Lk, a set of candidate k-item-sets is generated by joining Lk-1 with
itself.
Apriori Algorithm
The Apriori algorithm was the first algorithm for frequent itemset mining to be proposed. R
Agarwal and R Srikant improved it later, and it became known as Apriori. To reduce the search
space, this algorithm uses two steps: "join" and "prune." It is an iterative method for identifying the
most frequent itemsets.
Apriori says:
The probability that item I is not frequent is if:
The Apriori Algorithm for data mining includes the following steps:
1. Join Step: By joining each item with itself, this move generates (K+1) itemset from K-
itemsets.
2. Prune Step: This step counts all of the items in the database. If a candidate item fails to
fulfill the minimum support requirements, it is classified as infrequent and hence
withdrawn. This step aims to reduce the size of the candidate itemsets.
Steps In Apriori
The apriori algorithm is a series of steps that must be followed to find the most frequent itemset in
a database. This data mining technique repeats the join and prune steps until the most frequently
occurring itemset is found. The issue specifies a minimum assistance threshold, or the user assumes
it.
1) Each object is treated as a 1-itemsets candidate in the first iteration of the algorithm. Each item's
occurrences will be counted by the algorithm.
2) Set a minimum level of support, min sup. The set of 1 – itemsets whose occurrence meets the
minimum sup requirement is determined. Only those candidates with a score greater than or equal
to min sup are advanced to the next iteration, while the rest are pruned.
3) Next, min sup is used to find 2-itemset frequent itemsets. The 2-itemset is formed in the join
phase by forming a group of 2 by combining items with itself.
4) The min-sup threshold value is used to prune the 2-itemset candidates. The table will now have
two –itemsets, one with min-sup and the other with just min-sup.
5) Using the join and prune step, the next iteration will create three –itemsets. This iteration will use
the antimonotone property, which means that the subsets of 3-itemsets, i.e. the two –itemset subsets
of each category, will fall into min sup. The superset will be frequent if all 2 -itemset subsets are
frequent, otherwise, it will be pruned.
6) Making 4-itemset by joining a 3-itemset with itself and pruning if its subset does not meet the
min sup criteria would be the next move. When the most frequent itemset is reached, the algorithm
is terminated.
Table 2: Itemsets
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2
2. Prune Step: Table 3 shows that the I5 item does not meet min_sup=3, thus it is deleted,
only I1, I2, I3, I4 meet min_sup count.
3. Step 2: Form a 2-itemset. Find the occurrences of 2-itemset in Table 2.
Table 4: Two Itemset Data
Item Count
I1,I2 4
I1,I3 3
I1,I4 2
I2,I3 4
I2,I4 3
I3,I4 2
4. Prune Step: Table 4reveals that item sets I1, I4, and I3, I4 do not follow min sup, so they
are deleted.
Table 5: Frequent Two Itemset data
Item Count
I1,I2 4
I1,I3 3
I2,I3 4
I2,I4 3
We can see that for itemset{ I1, I2, I3} subsets, TABLE-5 contains {I1, I2}, {I3, I2},{I1, I3} indicating
that {I1, I2, I3 } is frequent.
Table 6: Three Itemset Data
Item
I1,I2,I3
I1,I2,I4
I1,I3,I4
6. Generate Association Rules: From the frequent itemset discovered above the association
could be:
• If the itemsets are wide and the minimum support is held low, it necessitates a lot of
computation.
• The database as a whole must be scanned.
1) The first step is to search the database for instances of the itemsets. This move is identical
to Apriori's first step. Help count or frequency of 1-itemset refers to the number of 1-
itemsets in the database.
2) The FP tree is built in the second stage. To do so, start by making the tree's root. Null is
used to represent the root.
3) The following move is to re-scan the database and look over the transactions. Examine the
first transaction to determine the itemset contained therein. The highest-counting itemset
is at the top, followed by the next-lowest-counting itemset, and so on. It means that the
tree's branch is made up of transaction itemsets arranged in descending order of count.
4) The database's next transaction is analyzed. The itemsets are sorted by count in ascending
order. This transaction branch will share a common prefix to the root if any itemset of this
transaction is already present in another branch (for example, in the first transaction).This
implies that in this transaction, the common itemset is connected to the new node of
another itemset.
5) In addition, as transactions occur, the count of the itemset is increased. As nodes are
generated and connected according to transactions, the count of both the common node
and new node increases by one.
6) The next move is to mine the FP Tree that has been developed. The lowest node, as well as
the relations between the lowest nodes, are analyzed first. The frequency pattern length 1
is represented by the lowest node. The conditional pattern base is a sub-database that
contains prefix paths in the FP tree that start at the lowest node (suffix).
7) Create a Conditional FP Tree based on the number of itemsets in the route. The
Conditional FP Tree considers the itemsets that meet the threshold support.
8) The Conditional FP Tree generates Frequent Patterns.
T1 I1,I2,I3
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Item Count
4
I1
5
I2
4
I3
4
I4
2
I5
2. Sort the itemset in descending order.
Table 9: Itemset in descending order of their frequency
Count
Item
5
I2
4
I1
4
I3
4
I4
3. Build FP tree.
a) Considering the root node null.
b) The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1}, {I3:1}, where I2
is linked as a child to root, I1 is linked to I2, and I3 is linked to I1.
c) T2: I2, I3, I4 contains I2, I3, and I4, where I2 is linked to root, I3 is linked to I2 , and I4 is
linked to I3. But this branch would share the I2 node as common as it is already used in
T1.
d) Increment the count of I2 by 1 and I3 is linked as a child to I2, I4 is linked as a child to I3.
The count is {I2:2}, {I3:1}, {I4:1}.
e) T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
f) T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the root node, hence
it will be incremented by 1. Similarly I1 will be incremented by 1 as it is already linked
with I2 in T1, thus {I2:3}, {I1:2}, {I4:1}.
g) T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3}, {I3:2}, {I5:1}.
h) T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3}, {I4 1}.
Discuss how the discovery of different patterns through different data miningalgorithms and
visualisation techniques suggest you a simple pedagogical policy?
Summary
• In selective marketing, decision analysis, and business management, the discovery of
frequent trends, correlations, and correlation relationships among massive amounts of
data is useful.
• A popular area of application is market basket analysis, which studies customers’ buying
habits by searching for itemsets that are frequently purchased together.
• The process of "association rule mining" entails first identifying frequent itemsets (groups
of items, such as A and B, that satisfy a minimum support threshold, or percentage of task-
related tuples), and then generating strong association rules in the form of A=>B.
• For frequent itemset mining, several powerful and scalable algorithms have been created,
from which association and correlation rules can be extracted. These algorithms can be
classified into three categories: (1) Apriori-like algorithms, (2) frequent pattern growth–
based algorithms such as FP-growth.
Keywords
Antecedent:An antecedent is something that’s found in data, and a consequent is an item that is
found in combination with the antecedent.
Support: Support indicates how frequently the if/then relationship appears in the database.
Confidence: Confidence tells about the number of times these relationships are true.
Frequent Itemsets: The sets of an item that has minimum support
Apriori Property: Any subset of frequent item-set must be frequent.
Join Operation: To find Lk, a set of candidate k-item-sets is generated by joining Lk-1 with itself.
Sequence analysis algorithms:This type summarizes frequent sequences or episodes in data.
Association algorithms:This type of algorithm finds correlations between differentattributes in a
dataset. The most common application of this kind of algorithm is for creating association rules,
which can be used in a market analysis.
A. Itemset
B. Support
C. Confidence
D. Support Count
A. Support
B. Confidence
C. Support Count
D. Rules
3. An itemset whose support is greater than or equal to a minimum support threshold is ______
A. (a)Itemset
B. (b)Frequent Itemset
C. (c)Infrequent items
D. (d)Threshold values
A. It mines all frequent patterns through pruning rules with lesser support
B. It mines all frequent patterns through pruning rules with higher support
C. It mines all frequent patterns by constructing a FP tree
D. It mines all frequent patterns by constructing an itemsets
A. Apriori
B. FP Growth
C. Naive Bayes
D. Decision Trees
11. For the question given below consider the data Transactions :
A. <I1>, <I2>, <I4>, <I5>, <I6>, <I1, I4>, <I2, I4>, <I2, I5>, <I4, I5>, <I4, I6>, <I2, I4, I5>
B. <I2>, <I4>, <I5>, <I2, I4>, <I2, I5>, <I4, I5>, <I2, I4, I5>
C. <I11>, <I4>, <I5>, <I6>, <I1, I4>, <I5, I4>, <I11, I5>, <I4, I6>, <I2, I4, I5>
D. <I1>, <I4>, <I5>, <I6>
A. Primary.
B. Candidate.
C. Secondary.
D. Superkey.
14. This approach is best when we are interested in finding all possible interactions among a
set of attributes.
A. Decision tree
B. Association rules
C. K-Means algorithm
D. Genetic learning
A. Apriori
B. FP growth
C. Decision trees
D. Eclat
1. A 2. C 3. B 4. C 5. C
6. B 7. D 8. C 9. B 10. B
Review Questions
Q1) The 'database' below has four transactions. What association rules can be found in this set, if
the
minimum support (i.e coverage) is 60% and the minimum confidence (i.e. accuracy) is 80% ?
Trans_id Itemlist
T1 {K, A, D, B}
T2 {D, A C, E, B}
T3 {C, A, B, E}
T4 {B, A, D}
Q2) With appropriate example explain the concept of support and confidence in Association rule
mining.
Q3) Discuss the various applications of Association rule learning.
Q4) Step-by-step exaplin the working of Apriori algorithm.
Q5) Differentiate between apriori and FP tree algorithm.
Q6) Elucidate the process how frequent items can be mined using FPtree.
Further Readings
Gkoulalas-Divanis, A., & Verykios, V. S. (2010). Association rule hiding for data mining (Vol.
41). Springer Science & Business Media.
Ventura, S., & Luna, J. M. (2016). Pattern mining with evolutionary algorithms (pp. 1-190).
Berlin: Springer.
Fournier-Viger, P., Lin, J. C. W., Nkambou, R., Vo, B., & Tseng, V. S. (2019). High-utility
pattern mining. Springer.
Elizabeth Vitt, Michael Luckevich, Stacia Misner (2010). “Business Intelligence”.O’Reilly
Media, Inc.
Rajiv Sabhrwal, Irma Becerra-Fernandez (2010). “Business Intelligence”. John Wiley & Sons
https://www.javatpoint.com/apriori-algorithm-in-machine-learning
https://www.softwaretestinghelp.com/apriori-algorithm/
https://towardsdatascience.com/fp-growth-frequent-pattern-generation-in-data-mining-
with-python-implementation-244e561ab1c3
https://www.mygreatlearning.com/blog/understanding-fp-growth-algorithm/
https://www.kdnuggets.com/2016/04/association-rules-apriori-algorithm-tutorial.html
Objectives
After this lecture, you will be able to
• Learn the various measures of similarity.
• Know the concept of unsupervised learning.
• Understand the working of the K-Means Algorithm.
• Understand the working of K-medoids and Clarans clustering algorithms.
• Understand the working of various hierarchal clustering algorithms.
• Learn the concept of cluster analysis.
• Understand the concept of outliers and the different types of outliers.
Introduction
Clustering is the process of grouping objects that are identical to one another. It can be used to
determine if two things have similar or dissimilar properties. Clustering aids in the division of data
into subsets. The data in each of these subsets is comparable, and these subsets are referred to as
clusters. We can make an informed conclusion on who we believe is most suited for this product
now that the data from our consumer base has been separated into clusters.
Outliers are cases that are out of the ordinary because they fall outside of the data's normal
distribution. The distance from a normal distribution center reflects how typical a given point is
The taste, size, color, and other characteristics of vegetables, for example, can be used to
determine their similarity.
To evaluate the similarities or differences between two objects, most clustering methods use
distance measures. The most commonly used distance measures are:
Euclidean Distance
The standard metric for solving geometry problems is Euclidean distance. It is the ordinary
distance between two points, to put it simply. It is one of the most widely used cluster analysis
algorithms. K-mean is one of the algorithms that use this formula.Mathematically it computes the
root of squared differences between the coordinates between two objects.
Manhattan Distance
This defines the absolute difference between two coordinate pairs. To evaluate the distance
between two points P and Q, simply measure the perpendicular distance between the points from
the X-Axis and the Y-Axis.In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|
Minkowski Distance
It's a combination of the Euclidean and Manhattan Distance Measurements. A point in an N -
dimensional space is defined as (x1, x2, ..., xN)
Consider the following two points, P1 and P2:
P1: (X1, X2, ..., XN)
P2: (Y1, Y2, ..., YN)
The Minkowski distance between P1 and P2 is then calculated as follows:
Where p and q are the attribute values for two data objects.
Assume it is shown a picture of both dogs and cats that it has never seen before.As a result, the
machine has no understanding of the characteristics of dogs and cats, and we are unable to classify
it as such. However, it can classify them based on their similarities, patterns, and differences,
allowing us to easily divide the above image into two parts. The first part may contain all photos of
dogs, while the second part may contain all photos of cats. It allows the model to work on its own
to discover patterns and information that was previously undetected. It ma inly deals with
unlabelled data.
Recommender systems, which involve grouping together users with similar viewing patterns
to recommend similar content.
Step-2:Select random K points or centroids. It can be other from the input dataset. We need to
choose some random k points or centroid to form the cluster. These points can be either the points
from the dataset or any other point.
Step-3:Assign each data point to their closest centroid, which will form the predefined K
clusters.Now we will assign each data point of the scatter plot to its closest K-point or centroid. So,
we will draw a median between both the centroids
Step-4:Calculate the variance and place a new centroid of each cluster.From the previous image, it
is clear that points left side of the line are near to the blue centroid, and points to the right of the
line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization.
Step-5:Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these centroids and
will find new centroids
Step-6:If any reassignment occurs, then go to step-4 else go to FINISH.From the above image, we
can see, one yellow point is on the left side of the line, and two blue points are right to the line. So,
these three points will be assigned to new.
Step-7: The model is ready.As we got the new centroids so again will draw the median line and
reassign the data points. As our model is ready, so we can now remove the assumed centroids and
the two final clusters
Algorithm
1. Initialize select k random points out of the n data points as the medoids.
2. Associate each data point to the closest medoid by using any common distance metric
methods.
3. While the cost decreases:
For each medoid m, for each data o point which is not a medoid:
1. Swap m and o, associate each data point to the closest medoid, recompute the
cost.
2. If the total cost is more than that in the previous step, undo the swap.
Let’s consider the following example:
Step 1: Let the randomly chosen two medoids be C1 -(4, 5) and C2 -(8, 5) respectively.
Step 2: Calculating cost. Each non-medoid point's dissimilarity to the medoids is measured and
tabulated:
Figure 13: Similarity and dissimilarity index based upon selected centroids
Each point is assigned to the medoid cluster with less dissimilarity. Cluster C1 is represented by
points 1, 2, 5; cluster C2 is represented by the points 0, 3, 6, 7, 8.
The cost= (3 + 4 + 4 )+ (3 + 1 + 1 + 2 + 2 2 + 2 + 2) = 20
Each point is assigned to that cluster whose dissimilarity is less. So, points 1, 2, 5 go to
cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.
The New cost = (3 + 4 + 4) + (2 + 2 + 1 + 3 + 3) = 22
Swap Cost = New Cost – Previous Cost = 22 – 20 and 2 >0
As the swap cost is not less than zero, we undo the swap. Hence (3, 4) and (7, 4) are the
final medoids. The clustering would be in the following way
1. The K-Medoid Algorithm is a fast algorithm that converges in a fixed number of steps.
2. It is simple to understand and easy to implement.
3. PAM is less sensitive to outliers than other partitioning algorithms.
Disadvantages
1. The key drawback of K-Medoid algorithms is that they cannot cluster non-spherical
(arbitrary shaped) groups of points. This is because it relies on minimizing the distances
between non-medoid objects and the medoid (cluster center) – in other words, it clusters
based on compactness rather than connectivity.
2. Since the first k medoids are chosen at random, it can produce different results for
different runs on the same dataset.
Draws a sample of nodes at the beginning of the search.Neighbors are from the chosen
sample.Restricts the search to a specific area of the original data
Do not confine the search to a localized area. Stops the search when a local minimum is found.
Finds several local optimums and output the clustering with the best local optimum.
Advantages
• Experiments show that CLARANS is more effective than both PAM and CLARA.
• Handles outliers
Disadvantages
• The computational complexity of CLARANS is O(n2), where n is the number of objects.
• The clustering quality depends on the sampling method.
1. Calculate the degree of similarity between one cluster and all others (calculate proximity
matrix)
2. Consider each data point as an individual cluster.
3. Combine clusters that are very similar or identical to one another.
4. For each cluster, recalculate the proximity matrix.
5. Steps 3 and 4 should be repeated until only one cluster remains.
Step 1: Treat each alphabet as a separate cluster and measure the distance between each cluster and
the others.
Step 2: Comparable clusters are combined to create a single cluster in the second step. Let's assume
clusters (B) and (C) are very close, so we merge them in the second stage, just as we did with
clusters (D) and (E), and we end up with clusters [(A), (BC), (DE), (F)].
10.7 BIRCH
BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies) is a clustering algorithm
that clusters large datasets by first producing a small and compact summary of the large dataset
that preserves as much information as possible. Instead of clustering the larger dataset, this smaller
overview is clustered.
Why BIRCH?
Clustering algorithms like K-means clustering do not perform clustering very efficiently and it is
difficult to process large datasets with a limited amount of resources (like memory or a slower
CPU). Regular clustering algorithms do not scale well in terms of running time and quality as the
size of the dataset increases. This is where BIRCH clustering comes in.
Before we implement BIRCH, we must understand two important terms:
If the branching factor of a leaf node can not exceed 3 then the following tree results.
If the branching factor of a non-leaf node can not exceed 3, then the root is split, and the height of
the CF Tree increases by one.
Advantages
• Birchperforms faster than existing algorithms (CLARANS and KMEANS) on large
datasets.
• Scans whole data only once.
• Handles outliers better.
Clustering Approaches
The following are the two clustering approaches that we are going to discuss:
1) DBSCAN
2) Graph-Based Clustering
DBCSAN
The density-based clustering algorithm has been extremely useful in identifying non-linear shape
structures. The most commonly used density-based algorithm is DBSCAN (Density-Based Spatial
Clustering of Applications with Noise). It makes use of the density reachability and density
connectivity concepts.
Density Reachability- A point "p" is said to be density reachable from a point "q" if point "p" is
within ε distance from point "q" and "q" has a sufficient number of points in its neighbors which are
within distance ε.
Density Connectivity - A point "p" and "q" are said to be density connected if there exists a point
"r" which has a sufficient number of points in its neighbors and both the points "p" and "q" is within
the ε distance. This is the chaining process. So, if "q" is neighbor of "r", "r" is neighbor of "s", "s" is a
neighbor of "t" which in turn is neighbor of "p" implies that "q" is neighbor of "p".
Algorithm
The set of data points is X = x1, x2, x3,..., xn. Two parameters are needed by DBSCAN: (eps) and the
minimum number of points required to form a cluster (minPts).
1) Begin at a random starting point that has never been visited before.
2) Using (All points within the distance are neighborhood), extract the neighborhood of th is point.
3) If there are enough neighbors in the region, the clustering process begins, and the point is labeled
as visited; otherwise, it is labeled as noise (Later this point can become part of the cluster).
4) If a point is found to be part of a cluster, its neighbors are also part of the cluster, and the process
from step 2 is repeated for all neighborhood points. This process is repeated until all of the cluster's
points have been calculated.
5) A new, previously unexplored point is retrieved and processed, resulting in the detection of a
new cluster or noise.
Advantages
1) No need to specify the number of clusters ahead of time.
2) Capable of detecting noise data during clustering.
3) Clusters of any size and form can be found using the DBSCAN algorithm.
Disadvantages
1) In the case of varying density clusters, the DBSCAN algorithm fails.
2) Fails in the case of datasets with a neck.
3) Doesn't work well for data with a lot of dimensions.
Between-Graph: Between-graph clustering methods divide a set of graphs into different clusters.
A set of graphs representing chemical compounds can be grouped into clusters based on
their structural similarity
Within Graph: Within-graph clustering methods divide the nodes of a graph into clusters.
In a social networking graph, these clusters could represent people with the same/similar
hobbies
K-Spanning Tree
Following are the steps to obtain a K-Spanning Tree:
• Obtains the Minimum Spanning Tree (MST) of input graph G
Figure 27:MST
maximum possible sum of edge weights, if the edge weights represent similarity
This approach of segmenting the database via clustering analysis is often used as
anexploratory technique because it is not necessary for the end-user/analyst to identify ahead of
time how records should be associated simultaneously.
Cluster completeness: A clustering result satisfies completeness if all the data points that are
members of a given class are elements of the same cluster.
Rag-Bag:Elements with low relevance to the categories (e.g., noise) should be preferably assigned
to the less homogeneous clusters (macro-scale, low-resolution, coarse-grained, or top-level clusters
in a hierarchy).
How can I evaluate whether the clustering results are good or not when I try out a clustering
method on the data set?
Figure 31:Outlier
• Choice of distance measure among objects and the model of the relationship among
objects are often application-dependent
E.g., clinic data: a small deviation could be an outlier; while in marketing analysis, larger
fluctuations
3. Handling noise in outlier detection
• Noise may distort the normal objects and blur the distinction between normal objects and
outliers. It may help hide outliers and reduce the effectiveness of outlier detection
4. Understandability
• Specify the degree of an outlier: the unlikelihood of the object being generated by a normal
mechanism
Types of Outliers
The following are the different types of outliers:
• Global Outliers
• Contextual Outliers
• Collective Outliers.
Global Outliers
A data point is considered a global outlier if its value is far outside the entirety of the data set in
which it is found.
A global outlier is a measured sample point that has a very high or a very low value relative to all
the values in a dataset. For example, if 9 out of 10 points have values between 20 and 30, but the
10th point has a value of 85, the 10th point may be a global outlier.
Contextual Outliers
If an individual data point is different in a specific context or condition (but not otherwise), then it
is termed as a contextual outlier.
Attributes of data objects should be divided into two groups:
⦁ Contextual attributes: defines the context, e.g., time & location
⦁ Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature
Contextual outliers are hard to spot if there was no background information. If you had no idea
that the values were temperatures in summer, it may be considered a valid data point.
Collective Outliers
A subset of data objects collectively deviates significantly from the whole data set, even if the
individual data objects may not be outliers. When several computers keep sending denial-of-service
packages to each other.
Applications: E.g., intrusion detection:
Give an example of a situation in which global outliers, contextual outliers, and collective
outliers are all relevant. What are the characteristics, as well as the environmental and behavioral
characteristics? In collective outlier detection, how is the link between items modeled?
Summary
• Types of outliers include global outliers, contextual outliers, and collective outliers.
• An object may be more than one type of outlier.
• A density-based method cluster objects based on the notion of density. It grows clusters
either according to the density of neighborhood objects (e.g., in DBSCAN)or according to a
density function (e.g., in DENCLUE). OPTICS is a density-basedmethod that generates an
augmented ordering of the data’s clustering structure.
• Clustering evaluation assesses the feasibility of clustering analysis on a data set andthe
quality of the results generated by a clustering method.
• A cluster of data objects can be treated as one group.
• Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.
• Outliers are the data points that cannot be fitted in any type of cluster.
• The analysis of outlier data is referred to as outlier analysis or outlier mining.
Keywords
Clustering:Clustering is a method of data analysis which groups data points to “maximizing the
intraclass similarity and minimizing the interclass similarity.
Outliers: outliers are data items that did not (or are thought not to have) come from the assumed
population of data.
Unsupervised learning: This term refers to the collection of techniques where groupings of the data
are defined without the use of a dependent variable. Cluster analysis is an example.
Cluster Homogeneity:A clustering result satisfies homogeneity if all of its clusters contain only
data points that are members of a single class.
Cluster Evaluation: Cluster evaluation assesses the feasibility of clustering analysis on a data set
and the quality of the results generated by a clustering method.
Hopkins static: Hopkins statisticis used to assess the clustering tendency of a data set by measuring
the probability that a given data set is generated by uniform data distribution.
Contextual attributes:Attributes that define the context, e.g., time & location
6. Which of the following clustering type has characteristics shown in the below figure?
A. Partitional
B. Hierarchical
C. Naive Bayes
D. None of the mentioned
9. ______________specifies the maximum number of data points a sub-cluster in the leaf node of the
CF tree can hold.
A. n_clusters
B. Branching_factor
C. Threshold
D. All Of the above
1. D 2. B 3. A 4. C 5. A
6. B 7. B 8. B 9. B 10. A
Review Questions
Q1) Give an application example where global outliers, contextual outliers, and collectiveoutliers
are all interesting. What are the attributes, and what are the contextual andbehavioral attributes?
Q2) Briefly describe and give examples of each of the following approaches to clustering:
partitioning methods, hierarchical methods, and density-based methods.
Q3) For example explain the different types of outliers.
Q4) Define hierarchal clustering along with its various types.
Q5) Explain the K-mean algorithm.
Q6) Differentiate between K-means and K-Mediod.
Q7) Elucidate the step-by-step working of K-Mediod with example.
Further Reading
Gan, G., Ma, C., & Wu, J. (2020). Data clustering: theory, algorithms, and applications. Society
for Industrial and Applied Mathematics.
Long, B., Zhang, Z., & Philip, S. Y. (2010). Relational data clustering: models, algorithms, and
applications. CRC Press.
Celebi, M. E. (Ed.). (2014). Partitional clustering algorithms. Springer.
Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., & Zanasi, A. (1998). Discovering data
mining: from concept to implementation. Prentice-Hall, Inc.
Cios, K. J., Pedrycz, W., Swiniarski, R. W., & Kurgan, L. A. (2007). Data mining: a knowledge
discovery approach. Springer Science & Business Media.
Funatsu, K. (Ed.). (2011). New fundamental technologies in data mining. BoD–Books on
Demand.
https://www.tutorialspoint.com/data_mining/dm_cluster_analysis.htm
https://towardsdatascience.com/17-clustering-algorithms-used-in-data-science-mining-
49dbfa5bf69a
https://www.analytixlabs.co.in/blog/types-of-clustering-algorithms/
https://www.anblicks.com/blog/an-introduction-to-outliers/
Objectives
After this Unit, you will be able to
• Understand the concept of supervised learning.
• Learn the concept of classification and various methods used for classification.
• Know the basic concept of binary classification.
• Understand the working of naïve Bayes classifier.
• Analyze the use and working of Association based and Rule-based classification.
• Know the working of the KNN algorithm.
• Understand the working of the Decision tree and Random forest algorithm.
• Learn the concept of Cross-Validation.
Introduction
In the process of data mining, large data sets are first sorted, then patterns are identified and
relationships are established to perform data analysis and solve problems. Attributes represent
different features of an object. Different types of attributes are:
• Binary: Possesses only two values i.e. True or False
11.2 Classification
Classification is a data mining function that assigns items in a collection to target categories or
classes.The goal of classification is to accurately predict the target class for each case in the data.
You may wish to use classification to predict whether the weather on a particular day will be
“sunny”, “rainy” or “cloudy”. Popular classification techniques include decision trees and neural
networks.
Classification is a two-step process:
(a) Learning: Training data are analyzed by a classification algorithm. Here, the class label attribute
is a credit rating, and the learned model or classifier is represented in the form of a classification
rule. In the learning step (or training phase), a classification algorithm builds the classifier by
analyzing or “learning from” a training set.A tuple, X, is represented by an N-dimensional attribute
vector,
(b) Classification: Test data are used to estimate the accuracy of the classification rules. If the
accuracy is considered acceptable, the rules can be applied to the classification of new data
tuples.The model is used to predict class labels and testing the constructed model on test data and
hence estimate the accuracy of the classification rules.
2. Relevance analysis: Many of the attributes in the data may be irrelevant to the
classification or prediction task. For example, data recording the day of the week on which
a bank loan application was filled is unlikely to be relevant to the success of the
application. Furthermore, other attributes may be redundant. Hence, relevance analysis
may be performed on the data to remove any irrelevant or redundant attributes from the
learning process. In machine learning, this step is known as feature selection. Including
such attributes may otherwise slow down, and possibly mislead the learning step. Ideally,
• Discriminative
• Generative
Discriminative
It is a very basic classifier and determines just one class for each row of data. It tries to model just
by depending on the observed data, depends heavily on the quality of data rather than on
distributions.
Logistic Regression
Suppose there are few students and the Result of them are as follows :
Student 1 : Test Score: 9/10, Grades: 8/10 Result: Accepted
Student 2 : Test Score: 3/10, Grades: 4/10, Result: Rejected
Student 3: Test Score: 7/10, Grades: 6/10, Result: to be tested
“cancer not detected” is the normal state of a task that involves a medical test and “cancer
detected” is the abnormal state.
The class for the normal state is assigned the class label 0 and the class with the abnormal state is
assigned the class label 1.
Bayesian Theorem
Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem
P(X | C )P(C )
P(C | X) = i i
i P(X)
P(C | X) = P(X | C )P(C )
i i i
Since P(X) is constant for all classes, only needs to be maximized. A simplified assumption:
attributes are conditionally independent (i.e., no dependence relation between attributes):
n
P( X | C i) = P( x | C i) = P( x | C i) P( x | C i) ... P( x | C i)
k 1 2 n
k =1
Example
For instance, to compute P(X/Ci), we consider the following:
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Data sample
X = (age <=30,Income = medium,Student = yes, Credit_rating = Fair).
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 =0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) :P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Advantages
“Classification is a data mining technique used to predict group membership for data
instances”. Discuss.
Given a set of transactions, we can find rules that will predict the occurrence of an item based on
the occurrences of other items in the transaction.
Table 1:Set of transactions
Before we start defining the rule, let us first see the basic definitions.
• Support Count( )- σ Frequency of occurrence of an itemset.Here σ ({Milk, Bread,
Diaper})=2
• Frequent Itemset – An itemset whose support is greater than or equal to the minsup
threshold.
• Association Rule– An implication expression of form X -> Y, where X and Y are any 2
itemsets.
{Milk, Diaper}->{Beer}
Support(s) –
The number of transactions that include items in the {X} and {Y} parts of the rule as a percentage of
the total number of transactions.It is a measure of how frequently the collection of items occur
together as a percentage of all transactions.
Support=σ(X+Y)/total
It is interpreted as a fraction of transactions that contain both X and Y.
Confidence(c) –
It is the ratio of the no of transactions that includes all items in {B} as well as the no of transactions
that includes all items in {A} to the no of transactions that includes all items in {A}.
It measures how often each item in Y appears in transactions that contain items in X also.
{Milk, Diaper}=>{Beer}
Points to remember:
Live in
Name Blood Type Give Birth Can Fly Class
Water
Blood Give
Name Can Fly Live in Water Class
Type Birth
dogfish
cold yes no yes ?
shark
A lemur triggers rule R3, so it is classified as a mammal A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
The Following are the classification rules generated from the given decision tree.
Classification Rules
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced},Taxable Income<80K) ==> No
(Refund=No, Marital Status={Single,Divorced},Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No.
Explain how will you remove the training data covered by rule R.
1. Lazy learning algorithm: KNN is a lazy learning algorithm since it doesn't have a
dedicated training process and instead uses all of the data for training and classification.
2. Non-parametric learning algorithm: KNN is also a non-parametric learning algorithm
since it makes no assumptions about the underlying data.
John 35 35K 3 No
Hannah 63 200K 1 No
Tom 59 170K 1 No
David 37 50K 2 ?
As per the distance calculation, nearest to David is Rachel so the predicted class for David is the
same as Rachel.
Table 4: Predicted class for David
John 35 35 3 No
Rachel 22 50 2 Yes
Hannah 63 200 1 No
Tom 59 170 1 No
Nellie 25 40 4 Yes
David 37 50 2 Yes
• Information Gain
Information Gain
The calculation of changes in entropy after segmenting a dataset based on an attribute is known as
information gain. It determines how much data a function provides about a class. We split the node
and built the decision tree based on the importance of information gain.The highest information
gain node/attribute is split first in a decision tree algorithm, which always seeks to maximize the
amount of information gain. The formula below can be used to measure it.
v | Dj |
InfoA ( D) = Info( D j )
j =1 |D|
Where
m
Info( D) = − pi log 2 ( pi )
i =1
Example
Table 5: Data for calculating attribute selection measures
middle-
3 high no fair yes
aged
middle-
7 low yes excellent yes
aged
middle-
12 medium no excellent yes
aged
middle-
13 high yes fair yes
aged
The class label attribute, buys_computer, has two distinct values (namely, {yes, no}); therefore,
there are two distinct classes (i.e., m = 2).There are nine tuples of class, yes, and five tuples of class
no.
m
Info( D) = − pi log 2 ( pi )
i =1
We need to look at the distribution of yes and no tuples for each category of age. For the age
category “youth,” there are two yes tuples and three no tuples. For the category “middle-aged,”
there are four yes tuples and zero no tuples. For the category “senior,” there are three yes tuples
and two no tuples.
Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, and
Gain(credit_rating) = 0.048 bits. Because age has the highest information gain among the attributes,
it is selected as the splitting attribute. Node N is labeled with age.
A test on income splits the data into three partitions, namely low, medium, and high, containing
four, six, and four tuples, respectively.
Gain(Income) = 0.029.
Gain Ratio (Income) = 0.029 / 1.557 = 0.019.
The Gini index is used in CART. Using the notation previously described, the Gini index measures
the impurity of D, a data partition or set of training tuples, as:
• It is easy to comprehend because it follows the same steps that a person will take when
deciding in the real world.
• It can be extremely helpful in resolving decision-making issues.
• It is beneficial to consider all of the potential solutions to a problem.
• In comparison to other algorithms, data cleaning is not required as much.
Disadvantages of the Decision Tree
A company is trying to decide whether to bid for a certain contract or not. They estimate that
merely preparing the bid will cost £10,000. If their company bid then they estimate that there is a
50% chance that their bid will be put on the “short-list”, otherwise their bid will be rejected.Once
“short-listed” the company will have to supply further detailed information (entailing costs
estimated at £5,000). After this stage, their bid will either be accepted or rejected.The company
estimate that the labor and material costs associated with the contract are £127,000. They are
considering three possible bid prices, namely £155,000, £170,000, and £190,000. Theyestimate that
the probability of these bids being accepted (once they have been short-listed) is 0.90, 0.75, and 0.35
respectively. What should the company do and what is the expected monetary value of your
suggested course of action?
Random Forest is a classifier that contains several decision trees on various subsets of the given
dataset and takes the average to improve the predictive accuracy of that dataset.Instead of relying
on one decision tree, the random forest takes the prediction from each tree and based on the
majority votes of predictions, and predicts the final output.The greater number of trees in the forest
leads to higher accuracy and prevents the problem of overfitting.
• One-versus-all
• One-versus-one
One-verses-all
The one-versus-all method is usually implemented using a “Winner-Takes-All” (WTA) strategy. It
constructs M binary classifier models where M is the number of classes. The i th binary classifier is
trained with all the examples from ith class Wi with positive labels (typically +1), and the examples
from all other classes with negative labels (typically -1).
One-verses-one
The one-versus-one method is usually implemented using a “Max-Wins” voting (MWV) strategy.
This method constructs one binary classifier for every pair of distinct classes and so, all together it
constructs M(M - 1)/2 binary classifiers. The binary classifier Cij is trained with examples from ith
class Wi and jth class Wj only, where examples from class Wi take positive labels while examples
from class Wj take negative labels.
Some researchers also proposed “all-together” approaches that solve the multi-category
classification problem in one step by considering all the examples from all classes together at once.
However, the training speed of “all-together” methods is usually slow.
11.13 Cross-Validation
Cross-validation is a technique in which we train our model using the subset of the data-set and
then evaluate using the complementary subset of the data-set.The following are the three steps
involved in cross-validation:
Methods of Cross-Validation
Validation
In this process, 50% of the data set is used for preparation, and the other 50% is used for testing.
The biggest disadvantage of this approach is that we only train on 50% of the dataset; the
remaining 50% of the data likely includes crucial details that we're leaving when training our
model, resulting in higher bias.
K-Fold Cross-Validation
This approach divides the data set into k subsets (also known as folds), then trains all of the subsets
while leaving one (k-1) subset for evaluation of the trained model. We iterate k times with a
different subset reserved for testing purposes each time in this process.
The value of k should always be 10 since a lower value of k leads to validation and a higher
value of k leads to the LOOCV method.
• Since K-fold cross-validation repeats the train/test split K times, it is K times faster than
Leave One Out cross-validation.
• Examining the detailed outcomes of the assessment procedure is easier.
Advantages of Cross-Validation
11.14 Overfitting
When we train a statistical model with a large amount of data (much like fitting ourselves into
oversized pants! ), it is said to be overfitted. When a model is trained with a large amount of data, it
begins to learn from the noise and inaccuracies in the data collection. The model then fails to
correctly categorize the data due to too many details and noise.Non-parametric and non-linear
Summary
• Bayesian classifiers are statistical classifiers. They can predict class membership
probabilities, such as the probability that a given sample belongs to a particular class.
• Classification and prediction methods can be compared and evaluated according to the
criteria of Predictive accuracy, Speed, Robustness, Scalability and Interpretability.
• A decision treeis a flow-chart-like tree structure, where each internal node denotes a test
on an attribute, each branch represents an outcome of the test, and leaf nodes represent
classes or class distributions. The topmost node in a tree is the root node.
• The learning of the model is ‘supervised’ if it is told to which class each training sample
belongs. In contrasts with unsupervised learning (or clustering), in which the class labels
of the training samples are not known, and the number or set of classes to be learned may
not be known in advance.
• Naïve Bayes classifiers assume that the effect of a variable value on a given class is
independent of the values of other variable. This assumption is called class conditional
independence. It is made to simplify the computation and in this sense considered to be
“Naïve”.
• Non-parametric and non-linear approaches are the causes of overfitting since these types
of machine learning algorithms have more flexibility in constructing models based on the
dataset and can thus create unrealistic models.
• Cross-validation is a technique in which we train our model using the subset of the data-
set and then evaluate using the complementary subset of the data-set.
Keywords
Bayes theorem: Bayesian classification is based on Bayes theorem.
Bayesian belief networks: These are graphical models, which unlike naive Bayesian classifiers,
allow the representation of dependencies among subsets of attributes
Bayesian classification: Bayesian classifiers are statistical classifiers.
Classification: Classification is a data mining technique used to predict group membership for data
instances.
Decision Tree: A decision tree is a flow-chart-like tree structure, where each internal node denotes a
test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or
class distributions. The topmost node in a tree is the root node.
Decision tree induction: The automatic generation of decision rules from examples is known as rule
induction or automatic rule induction.
Overfitting: Decision trees that are too large are susceptible to a phenomenon called as overfitting.
Prediction: Prediction is similar to classification, except that for prediction, the results lie in the
future.
Supervised learning: The learning of the model is ‘supervised’ if it is told to which class each
training sample belongs.
Review Questions
Q1. What do you mean by classification in data mining? Write down the applications of
classification in business.
Q2. Discuss the issues regarding the classification.
Q3.What is a decision tree? Explain with the help of a suitable example.
Q4. Write down the basic algorithm for decision learning trees.
Q5. Write short notes on the followings:
(a) Bayesian classification
(b) Bayes theorem
6. B 7. D 8. A 9. B 10. D
A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
Alex Berson, Data Warehousing Data Mining and OLAP, Tata Mcgraw Hill, 1997
Alex Berson, Stephen J. Smith, Data warehousing, Data Mining & OLAP, Tata
McGraw Hill, Publications, 2004.
Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel
Processing, Kluwer Academic Publishers, 1998.
J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers,
1993.
Jiawei Han, MichelineKamber, Data Mining – Concepts and Techniques, Morgan
Kaufmann Publishers, First Edition, 2003.
Matthias Jarke, Maurizio Lenzerini, YannisVassiliou, PanosVassiliadis,
Fundamentals of Data Warehouses, Publisher: Springer
Michael Berry and Gordon Linoff, Data Mining Techniques (For Marketing, Sales,
and Customer Support), John Wiley & Sons, 1997.
Michael J. A. Berry, Gordon S Linoff, Data Mining Techniques, Wiley Publishing
Inc, Second Edition, 2004.
https://www.geeksforgeeks.org/basic-concept-classification-data-mining/
https://www.tutorialspoint.com/data_mining/dm_classification_prediction.htm
https://www.javatpoint.com/data-mining-techniques
Objective
After this lecture, you will be able to
Introduction
Weka is a set of data mining-related machine learning techniques. It includes data preparation,
categorization, regression, clustering, mining of association rules, and visualization tools. It
provides access to a vast variety of classification algorithms. One of the advantages of using the
Weka platform to solve your machine learning challenges is the wide variety of machine learning
algorithms available. Weka comes with many built-in functions for constructing a variety of
machine learning techniques, ranging from linear regression to neural networks. With just a press
of a button, you can install even the most complex algorithms on your dataset! Not only that, but
Weka also allows you to use Python and R to access some of the most popular machin e learning
library algorithms.
Steps to be followed
1. Using the choose file option, we must first load the necessary dataset into the Weka tool.
We're going to run the weather-nominal dataset here.
2. Now, on the top left side, go to the classify tab, click the choose button, and pick the Naiv e
Bayesian algorithm.
3. To adjust the parameters, click the choose button on the right side, and in this example,
we'll accept the default values.
4. From the “Test” options in the main panel, we choose Percentage split as our
measurement process. We'll use the percentage split of 66 percent to get a clear estimate of
the model's accuracy since we don't have a separate test data set. There are 14 examples in
our dataset, with 9 being used for training and 5 for testing.
5. We'll now click "start" to begin creating the model. The assessment statistic will appear in
the right panel after the model has been completed.
6. It's worth noting that the classification accuracy of the model is about 60%. This means
that by making certain changes, we would be able to improve the accuracy. (Either during
preprocessing or when choosing current classification parameters.)
Furthermore, we can use our models to find new instances. In the main panel's "Test options," select
the "supplied test package" radio button, then click the "set" button.
This will open a pop-up window that will allow us to open the test instance file. It can be used to
further increase the accuracy of the module by using different test sets.
Using the airline.arff inbuilt file from Weka dataset and implement Naïve Bayes algorithm on it.
Certain algorithms are greyed out depending on whether the problem is a classification or
regression problem (piecewise linear function M5P for regression). Besides that, certain decision
trees have distinct Weka implementations (for example, J48 implements C4.5 in Java)
In Weka, creating a decision tree is relatively easy. Simply follow the steps below :
3. To classify the unclassified data, go to the "Classify" tab. Select "Choose" from the drop-
down menu. Select "trees -> J48" from this menu.
4. To begin, press the Start Button. The performance of the classifier will be visible on the
right-hand panel. The run information is shown in the panel as follows:
Scheme: The classification algorithm that was used is referred to as the scheme.
Instances: Number of data rows in the dataset.
Attributes: There are five attributes in the dataset.
The decision tree is defined by the number of leaves and the size of the tree.
Time taken to build the model: Time for the output.
Full classification of the J48 pruned with the attributes and number of instances.
“A decision tree divides nodes based on all available variables and then chooses the split that
produces the most homogeneous sub-nodes.” The homogeneity of a sample at a split is calculated
using Information Gain.
Right-click on the result and choose to visualize the tree from the list.
5. The output is in the form of a decision tree. The main attribute is “outlook”.
If the outlook is sunny, then the tree further analyzes the humidity. If humidity is high then the
class label play= “yes”.
If the outlook is overcast, the class label, play is “yes”. The number of instances which obey the
classification is 4.
If the outlook is rainy, further classification takes place to analyze the attribute “windy”. If
windy=true, the play = “no”. The number of instances which obey the classification for outlook=
windy and windy=true is 2.
How would this instance be classified using the decision tree? outlook = sunny, temperature
= cool, humidity = high, windy = TRUE.
• Web page content mining:The typical search of a web page by content is known as
content mining.
• Search result mining:The term "search result mining" refers to the second search of sites
found in a prior search.
1. Agent-based approach
2. Database approach
If a user wants to search for a particular book, then the search engine provides a list of
suggestions.
Web structure mining can be very useful to companies to determine theconnection between
two commercial websites.
Preprocessing: Preprocessing entails translating the following information from diverse data
sources:
• usage information
• content information
• structure information
into the data abstractions required for pattern identification from the many available data sources.
Pattern Discovery: Pattern recognition uses methods and algorithms from a variety of domains,
including statistics, data mining, machine learning, and pattern recognition.
Pattern Analysis: Pattern analysis is the final step in the whole Web Usage mining process. Its goal
is to eliminate uninteresting rules or patterns from the set discovered during the pattern discovery
phase. The application for which Web mining is done usually determines the exact analytic
methodology.The most common form of pattern analysis consists of:
• A knowledge query technique, such as SQL, is the most prevalent kind of pattern
analysis.
• Another method is to input user data into a data cube and use it for Online Analytical
Processing (OLAP).
• Visualization techniques like graphing patterns or assigning colors to distinct values can
often reveal broad patterns or trends in data.
• Patterns having pages of a specific usage kind, content type, or pages that fit a specific
hyperlink structure can be filtered out using content and structure information.
Relevance of Data
It is assumed that a specific person is only concerned with a limited area of the internet, while the
rest of the web contains data that is unfamiliar to the user and may lead to unexpected results.
Summary
• Naive Bayes is a classification algorithm. Traditionally it assumes that the input values are
nominal, although numerical inputs are supported by assuming a distribution.
• Naïve Bayes classifier is a statistical classifier. It assumes that the values of attributes in the
classes are independent.
• J48 can deal with both nominal and numeric attributes
• Decision trees are also known as Classification And Regression Trees (CART).
• Each node in the tree represents a question derived from the features present in your
dataset.
• Weka is free open-source software that comes with several machine learning algorithms
that can be accessed via a graphical user interface.
• Pattern analysis is the final step in the whole Web Usage mining process.
• Naïve Bayes classifiers assume that the effect of a variable value on a given class is
independent of the values of other variables. This assumption is called class conditional
independence. It is made to simplify the computation and in this sense considered to be
“Naïve”.
Keywords
Bayesian classification: Bayesian classifiers are statistical classifiers.
Classification: Classification is a data mining technique used to predict group membership for data
instances.
Decision Tree: A decision tree is a flow-chart-like tree structure, where each internal node denotes a
test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes or
class distributions. The topmost node in a tree is the root node.
Predictive accuracy: This refers to the ability of the model to correctly predict the class label of new
or previously unseen data.
Intelligent search agents:Intelligent Search agents automatically searches for information according
to a particular query using domain characteristics and user profiles.
Web page content mining:The typical search of a web page by content is known as content mining.
Instances: Number of data rows in the dataset.
B. Select Classifier
C. Create Classifier
D. Build Classifier
5. In Weka which of the following options is/are available under test options.
A. Use Training Set
B. Supplied Test Set
C. Cross-Validation
D. All of the above.
6. ____________ work by learning answers to a hierarchy of if/else questions leading to a decision.
A. Decision trees
B. KNN
C. Naïve Bayes
D. Perceptron
7. Which of the following steps are used to load the dataset in Weka.
A. Open Weka GUI
B. Select the “Explorer” option.
C. Select “Open file” and choose your dataset.
D. All of the above
8. Implementing a decision tree in Wekafrom the drop-down list, select _________which will open
all the tree algorithms.
A. Tree
B. Trees
C. Decision Tree
1. A 2. D 3. B 4. A 5. D
6. A 7. D 8. B 9. A 10. A
Review Questions
Q1. Create a dataset and apply a decision tree algorithm on it and display the results in the tree
form.
Q2. Consider training and predicting with a naive Bayes classifier for two document classes. The
word “booyah” appears once for class 1, and never for class 0. When predicting new data, if the
classifier sees “booyah”, what is the posterior probability of class 1?
Q3. Elucidate the concept of web mining along with its various categories.
Q4. With the help of example explain the various challenges of Web mining.
Q5. Consider the weather.arff file and discuss why does it make more sense to test the feature
“outlook” first? Implement the dataset using decision tree.
Further Reading
Thuraisingham, B., & Maning, D. (1999). technologies, techniques, tools, and Trends. CRC
press.
Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2005). Practical machine learning tools and
techniques. Morgan Kaufmann, 578.
Witten, I. H., Frank, E., Trigg, L. E., Hall, M. A., Holmes, G., & Cunningham, S. J. (1999).
Weka: Practical machine learning tools and techniques with Java implementations.
Ngo, T. (2011). Data mining: practical machine learning tools and technique, by ian h.
witten, eibe frank, mark a. hell. ACM Sigsoft Software Engineering Notes, 36(5), 51-52.
Kaluža, B. (2013). Instant Weka How-to. Packt Publishing Ltd.
Veart, D. (2013). First, Catch Your Weka: A Story of New Zealand Cooking. Auckland
University Press.
https://machinelearningmastery.com/use-classification-machine-learning-algorithms-
weka/
https://scienceprog.com/building-and-evaluating-naive-bayes-classifier-with-weka/
https://weka.sourceforge.io/doc.dev/weka/classifiers/bayes/package-summary.html
https://www.analyticsvidhya.com/blog/2020/03/decision-tree-weka-no-coding/
https://weka.sourceforge.io/doc.dev/weka/classifiers/trees/J48.html
Objectives
After this unit, you will be able to
• Understand the concept of unsupervised learning.
• Learn the various clustering algorithms.
• Know the difference between clustering and classification.
• Implementation of K-Means and Hierarchal clustering using Weka.
Introduction
Clustering is the organization of data in classes. However, unlike classification, it is used to place
data elements into related groups without advanced knowledge of the group definitions i.e. class
labels are unknown and it is up to the clustering algorithm to discover acceptable classes. Clustering
is also called unsupervised classification because the classification is not dictated by given class
labels. There are many clustering approaches all based on the principle of maximizing the similarity
between objects in the same class (intra-class similarity) and minimizing the similarity between
objects of different classes (inter-class similarity). Clustering can also facilitate taxonomy formation,
that is, the organization of observations into a hierarchy of classes that group similar events together.
Example: For a data set with two attributes: AGE and HEIGHT, the following rule represents
most of the data assigned to cluster 10:
If AGE >= 25 and AGE <= 40 and HEIGHT >= 5.0ft and HEIGHT <= 5.5ft then CLUSTER = 10
In unsupervised learning “The outcome or output for the given inputs is unknown”
Example: If the machine wants to differentiate between fishes and birds. If we do not tell the
machine which animal is a fish and which is a bird (that is until we do not label them) machine will
have to differentiate them by using similarities or patterns found in their attributes.
Clustering Requirements
The reasons why clustering is significant in data mining are as follows:
1. Scalability
To work with huge databases, we need highly scalable clustering techniques.
4. Interpretability
The results should be complete, usable, and interpretable.
5. High dimensionality
Instead of merely dealing with low-dimensional data, the algorithm should be able to deal with high-
dimensional space.
• Agglomerative
• Divisive
1. Agglomerative
The bottom-up technique is another name for this method. We begin by separating each object into
its group. It continues to merge objects or groupings that are close together.
It continues to do so until all of the groups have been merged into one of the termination condition
has been met.
2. Divisive
The opposite of Agglomerative, Divisive starts with all of the points in one cluster and splits them to
make other clusters. These algorithms generate a distance matrix for all existing clusters and link
them together based on the linkage criteria. A dendrogram is used to show the clustering of data
points.
Partitional Method
This is one of the most popular methods for creating clusters among analysts. Clusters are partitioned
based on the properties of the data points in partitioning clustering. For this clustering procedure,
we must provide the number of clusters to be produced. These clustering algorithms use an iterative
procedure to allocate data points between clusters based on their distance from one another. The
following are the algorithms that fall within this category: –
1. K-Means Clustering
One of the most extensively used methods is K-Means clustering. Based on the distance metric used
for clustering, it divides the data points into k clusters. The user is responsible for determining the
value of ‘k.' The distance between the data points and the cluster centroids is calculated. The cluster
is assigned to the data point that is closest to the cluster's centroid. It computes the centroids of those
clusters again after each iteration, and the procedure repeats until a pre-determined number of
iterations have been completed or the centroids of the clusters have not changed after each iteration.
Density-Based Methods
Clusters are produced using this method depending on the density of the data points represented in
the data space. Clusters are locations that become dense as a result of the large number of data points
that reside there.
The data points in the sparse region (the region with the few data points) are referred to as noise or
outliers. These methods allow for the creation of clusters of any shape. Examples of density-based
clustering methods are as follows:
Average Case: Based on the data and algorithm implementation, the average case is the same as the
best/worst case.
Grid-Based methods
The collection of information is represented in a grid structure that consists of grids in grid-based
clustering (also called cells). This method's algorithms take a different approach from the others in
terms of their overall approach. They're more concerned about the value space that surrounds the
data points than the data points themselves. One of the most significant benefits of these algorithms
is their reduced computational complexity. As a result, it's well-suited to coping with massive data
sets. It computes the density of the cells after partitioning the data sets into cells, which aids in cluster
identification. The following are a few grid-based clustering algorithms: –
1. Classification is used for supervised learning whereas clustering is used for unsupervised learning.
2. The process of classifying the input instances based on their corresponding class labels is known
as classification whereas grouping the instances based on their similarity without the help of class
labels is known as clustering.
3. As Classification have labels so there is need of training and testing dataset for verifying the model
created but there is no need for training and testing dataset in clustering.
4. Classification is more complex as compared to clustering as there are many levels in the
classification phase whereas only grouping is done in clustering.
5. Classification examples are Logistic regression, Naive Bayes classifier, Support vector machines,
etc. Whereas clustering examples are k-means clustering algorithm, Fuzzy c-means clustering
algorithm, Gaussian (EM) clustering algorithm, etc.
Task: What criteria can be used to decide the number of clusters in k-means statistical analysis?
1) Go to the Preprocess tab in WEKA Explorer and select Open File. Select the dataset
"vote.arff."
2) Select the "Choose" button from the "Cluster" menu. Choose "SimpleKMeans" as the
clustering algorithm.
4. In the left panel, select Start. The algorithm's output is displayed on a white screen. Let us analyze
the run information:
The properties of the dataset and the clustering process are described by the terms Scheme, Relation,
Instances, and Attributes. The vote. arff dataset has 435 instances and 13 attributes in this scenario.
The number of iterations with the Kmeans cluster is 5. The sum of the squared error is 1098.0. As the
number of clusters grows, this inaccuracy will decrease. A table is used to represent the 5 final
clusters with centroids. Cluster centroids are 168.0, 47.0, 37.0, 122.0.33.0, and 28.0 in our example.
The algorithm will assign the class label to the cluster. Cluster 0 represents republican and Cluster 1
represents democrat. The Incorrectly clustered instance is 14.023% which can be reduced by
ignoring the unimportant attributes.
6. Use the “Visualize” tab to visualize the Clustering algorithm result. Go to the tab and click on any
box. Move the Jitter to the max.
Caution: The number of clusters to use necessitates a precise balancing act. Larger k values can
increase cluster homogeneity, but they also risk overfitting.
2. Select the "Choose" button from the "Cluster" menu. Choose "HierarichicalCluster" as the
clustering algorithm.
3. In the left panel, select Start. The algorithm's output is displayed on a white screen. Let us
analyze the run information:
The properties of the dataset and the clustering process are described by the terms Scheme,
Relation, Instances, and Attributes. The Iris. arff dataset has 150 instances and 5 attributes
in this scenario.
The algorithm will assign the class label to the cluster. Cluster 0 represents Iris-setosa and
Cluster 1 represents Iris-versicolor. The Incorrectly clustered instance is 33.3333 % which
can be reduced by ignoring the unimportant attributes.
5 . Use the “Visualize” tab to visualize the Clustering algorithm result. Go to the tab and click on any
box. Move the Jitter to the max. The X-axis and Y-axis represent the attribute.
Task: Create your dataset and implement a hierarchical clustering algorithm on it and visualize
the results.
Summary
• The learning of the model is ‘supervised’ if it is told to which class each training sample
belongs. In contrasts with unsupervised learning (or clustering), in which the class labels
of the training samples are not known, and the number or set of classes to be learned may
not be known in advance.
• The properties of the dataset and the clustering process are described by the terms Scheme,
Relation, Instances, and Attributes.
• Classification is more complex as compared to clustering as there are many levels in the
classification phase whereas only grouping is done in clustering.
• Data clustering can also aid marketers in identifying separate client groupings. They can
also categorize their customers based on their purchase habits.
• One data point can only belong to one cluster in hard clustering. In soft clustering, however,
the result is a probability likelihood of a data point belonging to each of the pre-defined
clusters.
• Unsupervised learning is another machine learning method that uses unlabeled input data
to discover patterns.
Keywords
Unsupervised Learning: Unsupervised Learning is a machine learning technique in which the users
do not need to supervise the model.
Dendrogram: In a hierarchal clustering algorithm, we develop the hierarchy of clusters in the form of
a tree, and this tree-shaped structure is known as the dendrogram.
Clustering: Grouping of unlabeled examples is called clustering.
Grid-based Methods: In this method, the data space is formulated into a finite number of cells that
form a grid-like structure.
Similarity measure: You can measure similarity between examples by combining the examples'
feature data into a metric, called a similarity measure.
High Dimensional Data: High-dimensional data is characterized by multiple dimensions. There can
be thousands, if not millions, of dimensions.
D. None
5. DBSCAN is an example of which of the following clustering methods.
A. Density-Based Methods
B. Partitioning Methods
C. Grid-based Methods
D. Hierarchical Based Methods
6. STING is an example of which of the following clustering methods.
A. Density-Based Methods
B. Partitioning Methods
C. Grid-based Methods
D. Hierarchical Based Methods
7. Which of the following comes under applications of clustering.
A. Marketing
B. Libraries
C. City Planning
D. All of the above
8. In _____________ there is no need of training and testing dataset.
A. Classification
B. Clustering
C. Support vector Machines
D. None
9. Implementations of K-means only allow ____________values for attributes.
A. Numerical
B. Categorical
C. Polynomial
D. Text
10. The WEKA SimpleKMeans algorithm uses__________distance measure to compute distances
between instances and clusters.
A. Euclidean
B. Manhattan
C. Correlation
D. Eisen
11. To perform clustering, select the _________tab in the Explorer and click on the _________button.
A. Choose, Cluster
B. Cluster, Choose
C. Create, Cluster
D. Cluster, Create
12. Click on the ___________tab to visualise the relationships between variables.
A. View
B. Show
C. Visualize
D. Graph
13. Examine the results in the Cluster output panel which give us:
A. The number of iterations of the K-Means algorithm to reach a local optimum.
B. The centroids of each cluster.
C. The evaluation of the clustering when compared to the known classes.
D. All of the Above
14. In ___________________ Weka can evaluate clustering on separate test data if the cluster
representation is probabilistic.
A. In Supplied test set or Percentage split
B. Classes to clusters evaluation
C. Use training set
D. Visualize the cluster assignments
15. In this mode Weka first ignores the class attribute and generates the clustering.
A. In Supplied test set or Percentage split
B. Classes to clusters evaluation
C. Use training set
D. Visualize the cluster assignments
Review Questions
Q1) Briefly describe and give examples of each of the following approaches to clustering:
partitioning methods, hierarchical methods, density-based and grid-based methods.
Q2) Elucidate the step-by-step working of K-Means with examples.
Q3) Explain the concept of unsupervised learning with examples. “Clustering is known as
unsupervised learning” justify the statement with an appropriate example.
Q4) Differentiate between classification and clustering. Discuss the various applications of
clustering.
Q5) With example discuss the various partitioning-based clustering methods.
Q6) Elucidate the various types of hierarchical clustering algorithms by giving the example of each
type.
1 A 2 B
3 A 4 C
5 A 6 C
7 D 8 B
9 A 10 A
11 B 12 C
13 D 14 A
15 B
Further Readings
King, R. S. (2015). Cluster analysis and data mining: An introduction. Stylus Publishing, LLC.
Wu, J. (2012). Advances in K-means clustering: a data mining thinking. Springer Science &
Business Media.
Mirkin, B. (2005). Clustering for data mining: a data recovery approach. Chapman and
Hall/CRC.
Tan, P. N., Chawla, S., Ho, C. K., & Bailey, J. (Eds.). (2012). Advances in Knowledge Discovery
and Data Mining, Part II: 16th Pacific-Asia Conference, PAKDD 2012, Kuala Lumpur, Malaysia,
May 29-June 1, 2012, Proceedings, Part II (Vol. 7302). Springer.
Liu, H., & Motoda, H. (2012). Feature selection for knowledge discovery and data mining (Vol.
454). Springer Science & Business Media.
Witten, I. H., Frank, E., Trigg, L. E., Hall, M. A., Holmes, G., & Cunningham, S. J. (1999).
Weka: Practical machine learning tools and techniques with Java implementations.
A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice-Hall, 1988.
https://www.tutorialspoint.com/data_mining/dm_cluster_analysis.htm
https://www.upgrad.com/blog/cluster-analysis-data-mining/
https://data-flair.training/blogs/clustering-in-data-mining/
Objectives
After this unit, you will be able to
• Understand the importance of data warehouses in financial data analysis and the retail
industry.
• Know the importance of data warehouses in the Indian Railway reservation system and
other industries.
• Learn the importance of weather forecasting in real life.
Introduction
Online transaction processing (OLTP) systems address an organization's operational data needs,
which are critical to the day-to-day operation of a business. However, they are not well suited to
sustaining decision support or business questions that managers are frequently faced with. Analytics,
such as aggregation, drill-down, and slicing/dicing of data, are best supported by online analytical
processing (OLAP) systems for such questions. By storing and managing data in a multidimensional
manner, data warehouses aid OLAP applications. Extract, Transfer, and Load (ETL) tools are used to
extract and load data into an OLAP warehouse from numerous OLTP data sources. The requirement
for a data warehouse is driven by a business need and a business strategy. A data warehouse is a
foundation for sophisticated data analysis; it aids business decision-making by allowing managers
and other company users to analyze data and conduct analysis more efficiently.
Financial data warehouses work in the same way as regular data warehouses do. After data is
acquired and put in a data warehouse, it is structured into a specific schema that categorizes the data.
When the data needs to be analyzed, this allows for easy access.
A financial data warehouse using the latest technologies can increase the quality of our data and help
us to gain insights into customer behavior. Clients often improve the effectiveness of their marketing
campaigns and loyalty programs as a result of these new insights. Our financial analytics services are
business-driven and our experts spend time establishing our requirements and task.
The Challenge
Every client stakeholder wanted an answer to the same question: How are sales going? There was no
single location where executives, finance managers, and sales managers could get a comprehensive
picture of the sales team's operations and evaluate key data such as present performance, prior
performance, and pipeline. The client's IT department was having trouble keeping its governance
and planning initiatives on track. They had taken measures to centralize the numerous data sources
they needed to deal with and get the ball rolling on accurate, actionable reporting on those assets,
but their infrastructure was outdated and no longer capable of handling the increased information
demands. They engaged Ironside to help them upgrade their old data warehouse to a more current
system that would help them achieve three important department goals:
The Journey
The Ironside Information Management team agreed to collaborate with the client's database
engineers to make the move to a modern data warehouse a success. The Ironside resources assigned
to the project outlined a full-scale migration and redesign plan that would bring about 20 tables from
a variety of data sources, including TM1, A-Track, Remedy, Oracle GL, and Excel, into an IBM
PureData for Analytics (Netezza) implementation capable of meeting and exceeding the client's
requirements, based on discovery conversations with IT leadership. Ironside was also tasked with
reengineering both the ETL processes required to transport and transform all of the numerous
information streams, as well as the reporting layer that makes that information ava ilable for analysis,
as part of the migration.
So far, Ironside has completed the following phases:
• For the new data warehouse, They gathered requirements and use cases.
• The existing warehouse logic was documented, including all data extraction,
transformations, schedules, and so on.
• Served as the solution architect for a data warehouse redesign proof of concept.
• Set specifications for new ETL operations to feed the POC design with the client's database
engineers.
• The reporting layer was rebuilt to work in the new environment.
• Using the POC system, they tested data handling and report output.
The Results
The initial findings coming out of the concepting and testing phases of the project are very promising,
and Ironside’s team is confident that the final solution will deliver the modern data handling
functionality that the client needs to continue their success.
The positive outcomes are:
1. Improved Cognos reporting usability and efficiency.
2. Support time comparison reporting, such as comparing data from the previous month to the
current day.
3. During testing, a performance boost of roughly 60 times faster than achievable in the old
environment was noticed.
Ironside and the customer are now making modifications and ready to roll out the full-scale data
warehouse solution, using this strong proof of concept as a springboard. The IT staff will be able to
easily fulfill the severe expectations of the financial services industry and give the sort of answers
that will propel the company forward with this level of data performance at their disposal.
1. Analytic Types
Many businesses are interested in using financial data for a range of analytics purposes. Predictive
and real-time analytics are the most popular types of analytics in finance. The goal of predictive
analytics is to find patterns in financial data to anticipate future events. Real-time analytics is
employed in a variety of applications, including consumer intelligence, fraud detection, and more.
2. Capturing Customer Data
Customers today use various channels on multiple devices, making data collection more challenging
and likely to become more complex in the future. Data warehouses enable businesses to record every
connection with a consumer, providing them with unprecedented insight into what motivates them.
3. Personalization
Many financial institutions have pushed to engage more with clients to stay competitive. One
approach to achieve this is to send more tailored messages, which is possible thanks to data science.
The ability to create more tailored interactions means that you can reach out to customers at the most
appropriate times and with the most effective message.
4. Risk Management
Many financial institutions have pushed to engage more with clients to stay competitive. One
approach to achieve this is to send more tailored messages, which is possible thanks to data science.
The ability to create more tailored interactions means that you can reach out to customers at the most
appropriate times and with the most effective message.
Example: Data warehouses are largely utilized in the investing and insurance industries to
assess consumer and market trends, as well as other data patterns. Data warehouses are critical in
two key sub-sectors: forex and stock markets, where a single point discrepancy can result in
enormous losses across the board. Data warehouses are typically shared in these industries and focus
on real-time data streaming.
Notes: The Data Warehouse Data Model for Retail and Distribution (Retail DWH model ®) is a
standard industry data warehouse model that covers traditional BI requirements, regulatory needs,
and Big Data Analytics requirements for retailers and wholesalers.
2. Establish an information system for executives to use in making strategic decisions, such as:
• Where in the store do perishable commodities such as fruits and vegetables need to be
moved quickly?
• How long can meat meals be kept on the rack at Apex's Illinois location?
3. Roll-up and drill-down capabilities in the Central Repository provide sales by category, region,
store, and more.
Example: For distribution and marketing, data warehouses are commonly employed. It also
aids in the tracking of items, client purchasing patterns, and promotions, as well as for deciding
pricing strategy.
• Railway executives (and even railway programmers) are unable to develop ad hoc inquiries.
• The application database is fragmented over several servers around the country, making it
difficult for railway users to locate data in the first place.
• Ad-hoc querying of PRS OLTP systems is prohibited by database administrators to prevent
analytical users from running queries (e.g., how many passengers are booked for a specific
journey Day) that slow down mission-critical PRS databases.
Aim of study
We have recorded all of the information on the trains scheduled and the users booking tickets, as
well as the condition of the trains, seats, and so on, in our study of the railway reservation system.
This database is useful for applications that allow customers to book train tickets and check train
details and status from their home, avoiding the inconvenience of having to go to the railway station
for every question. Passengers can use the Railway Reservation System to inquire about:
1. The trains that are available based on their origin and destination.
2 . Ticket booking and cancellation, as well as inquiring about the status of a booked ticket, and so
on.
3. The goal of this case study is to design and create a database that stores information about various
trains, train status, and passengers.
The major goal of keeping a database for the Railway Reservation System is to limit the number of
manual errors that occur during ticket booking and cancellation. and make it easy for customers and
providers to keep track of information about their clients as well as available seats. Many flaws in
manual record-keeping can be eliminated due to automation.
The data will be obtained and processed at a rapid pace. The suggested system could be web-enabled
in the future to allow clients to make numerous inquiries about trains between stations. As a result,
a variety of issues arise from time to time, and they are frequently involved in consumer disputes.
Some assumptions have been made to implement this sample case study, which is as follows:
1. The number of trains is limited to five.
2. The reservation is only available for the following seven days from the present date.
3. There are only two types of tickets available for purchase: AC and General.
4. There are a total of 10 tickets available in each category (AC and General).
5. The maximum number of tickets that can be given a waiting status is two.
6. The time between halt stations and their reservations is not taken into account.
Process
A train list must be kept up to date. Passenger information must be kept up to date. The passenger's
train number, train date, and category are read throughout the booking process. A n appropriate
record from the Train Status is fetched based on the values provided by the passenger. If the desired
category is AC, then the total number of AC seats and the number of booked AC seats are compared
to find whether a ticket can be booked or not.
It can be checked in the same way for the general category. Passenger details are read and stored in
the Passenger database if a ticket may be booked. The passenger's ticket ID is read during the
cancellation procedure, and the Passenger database is checked for a matching record. The record is
erased if it exists. After deleting the record (if it is confirmed), the Passenger table is searched for the
first record with a waiting status for the same train and category, and its status is updated to confirm.
Telephone Industry: The telephone industry uses both offline and online data, resulting in a large
amount of historical data that must be aggregated and integrated.
Apart from those processes, a data warehouse is required for the study of fixed assets, the study of
customer calling habits for salespeople to push advertising campaigns, and the tracking of customer
inquiries.
Transportation Industry: Client data is recorded in data warehouses in the transportation industry,
allowing traders to experiment with target marketing, where marketing campaigns are created with
the needs of the customer in mind. They are used in the industry's internal environment to monitor
customer feedback, performance, crew management on board, and customer financial information
for pricing strategies.
Weather Forecasting
The first step in constructing a data warehouse is to gather forecaster knowledge to construct the
subject system. Weather forecasting is the use of science and technology to anticipate the state of the
atmosphere in a specific region. Temperature, rain, cloudiness, wind speed, and humidity are all
factors to consider. Weather warnings are a type of short-range forecast that is used to protect
people's lives.
According to forecasters, there are currently six categories of subjects in our subject system:
1. Target Element: The target element of the forecast, such as rainfall, temperature, humidity, and
wind, must be included in the subject system.
2.Index Data: Some Index data are useful; they are usually flags or indicators of future weather, such
as the surface wind direction in Wutaishan, the 500hPa height change in west Siberia, and so on.
3. Statistical Data: Subjects like the average of 5 stations, the maximum temperature gradient in
someplace, minimum relative humidity in the last 5 days, and others should be included in the
subject system to analyze statistical features in a region or period.
4. Transformed Data: orthogonal transformations, such as wavelet transformation. And data that has
been filtered using lowpass, highpass, and other techniques.
5. Weather System: Even though the forecaster's knowledge is based on the study of model output,
synoptic expertise is vital. As a result, the weather system should be part of the topic system. High,
low, large gradient area, saddle area for scale element and convergence center, and divergence are
all examples of weather systems.
Challenges
Weather forecasts still have their limitations despite the use of modern technology and improved
techniques to predict the weather. Weather forecasting is complex and not always accurate, especially
for days further in the future, because the weather can be chaotic and unpredictable. If weather
patterns are relatively stable, the persistence method of forecasting provides a relatively useful
technique to predict the weather for the next day. Weather observation techniques have improved
and there have been technological advancements in predicting the weather in recent times. Despite
this major scientific and technical progress, many challenges remain regarding long-term weather
predictability. The accuracy of individual weather forecasts varies significantly.
Summary
• Every business, including retail, is being disrupted by the digital revolution, which is
offering tremendous new opportunities.
• The major goal of keeping a database for the Railway Reservation System is to limit the
number of manual errors that occur during ticket booking and cancellation. and make it
easy for customers and providers to keep track of information about their clients as well as
available seats.
• A data warehouse is a centralized storage location for all or large portions of the data
collected by an organization's numerous business systems.
• Hospitality Industry employs warehousing services to create and assess their ad and
marketing programs, which target customers based on their feedback and travel patterns.
• Weather forecasting is complex and not always accurate, especially for days further in the
future, because the weather can be chaotic and unpredictable.
• Data warehousing and analytics have various advantages, including increased operational
efficiency, improved customer experience, and customer loyalty and retention.
Keywords
Key attribute: The key attribute is the attribute in a dimension that identifies the columns in the
dimension main table that is used in foreign key relationships to the fact table.
Integrated: A data warehouse is developed by combining data from multiple heterogeneous sources,
such as flat files and relational databases, which consequently improves data analysis.
Data Mart: A data mart performs the same function as a data warehouse, but with a smaller scope.
It could be tailored to a certain department or line of business.
10. The main purpose of maintaining a database for the Railway Reservation System is
A. To reduce the manual errors involved in the booking and canceling of tickets
B. Make it convenient for the customers and providers to maintain the data about their customers
C. Data about the seats available
D. All of the Above
11. ___________is the prediction of the state of the atmosphere for a given location using the
application of science and technology.
A. Weather forecasting
B. Temperature forecasting
C. Rain forecasting
D. Humidity forecasting
12. _________________ are a special kind of short-range forecast carried out for the protection of
human life.
A. Weather Messages
B. Weather Warnings
C. Weather Information
D. None
13. On an everyday basis, many use _________ to determine what to wear on a given day.
A. Rain forecasting
B. Temperature forecasting
C. Weather forecasting
D. None
14. Which of the following steps making a weather forecast
A. Observation and analysis
B. Extrapolation to find the future state of the atmosphere.
C. Prediction of particular variables.
D. All of the Above
15. It is the traditional and basic approach adopted in weather prediction.
A. Synoptic Weather Prediction
B. Numerical Weather Prediction
C. Statistical Weather Prediction
D. None
Review Questions
Q1) What is the significance of weather forecasting in today’s business? Explain the various methods
used in weather prediction.
Q2) Discuss the various applications of Data Warehouse by giving the example of each application.
Q3) “Every business, including retail, is being disrupted by the digital revolution, which is offering
tremendous new opportunities”. Justify the statement by giving the appropriate example.
Q4) Discuss a scenario that shows how a data warehouse is helpful in the railway reservation system.
Q5) Elucidate the role of Data Warehouse in financial data analysis.
Q6) Discuss the various challenges faced during weather forecasting.
1 A 2 D
3 A 4 D
5 D 6 A
7 B 8 D
9 D 10 D
11 A 12 B
13 C 14 D
15 A
Further Readings
Wang, J. (Ed.). (2005). Encyclopedia of data warehousing and mining. iGi Global.
Inmon, W. H. (2005). Building the data warehouse. John Wiley & sons.
Adamson, C., & Venerable, M. (1998). Data warehouse design solutions. J. Wiley & Sons.
Blackwood, B. D. (2015). QlikView for Finance. Packt Publishing Ltd.
Pover, K. (2016). Mastering QlikView Data Visualization. Packt Publishing Ltd.
Kimball, R., & Ross, M. (2011). The data warehouse toolkit: the complete guide to
dimensional modeling. John Wiley & Sons.
https://www.import.io/post/using-a-data-warehouse-in-financial-services-how-web-
data-integration-helps-you-win-in-the-financial-industry/
https://www.csub.edu/training/pgms/fdwp2/index.html
https://www.voicendata.com/retail-industry-needs-data-warehousing-analytics/
https://dwh-models.com/solutions/retail/