Data Mining N Business Intelligence
Data Mining N Business Intelligence
Taught By –
Mrs. Veena Kiragi
School of Computer Science Engg. and Applications
DY Patil International University
1
What is KDD?
• Knowledge Discovery from Data/Databases (KDD) or
Pattern Analysis, Knowledge Extraction
• Sources of data –
– Business such as ecommerce,
– Science such as biomedical field,
– Society such as news, digital cameras, smartphones, apps.
2
Contd.
• Many people treat data mining as a synonym for another
popularly used term, knowledge discovery from data, or
KDD.
• Now after 10-15 years post internet era, millions and millions
of data is present.
Data Data
Transformation Selection
Data Mining
Pattern
Evaluation
Knowledge
Presentation
Data Selection where data relevant to the analysis task are retrieved from the
database
Data Transformation where data are transformed and consolidated into forms
appropriate for mining
10
Database Management System (DBMS)
11
Contd.
• DBMS manage the data, the database engine, and the database
schema, allowing for data to be manipulated or extracted by
users and other programs.
12
Data Mining Functionalities
• Data mining functionalities are used to specify the kinds of patterns to be found
in data mining tasks.
Descriptive Predictive
Descriptive mining tasks characterize properties of the data in a target data set.
Predictive mining tasks perform induction on the current data in order to make
predictions.
13
Mining of frequent
Characterization
patterns, associations, and
and Discrimination
correlations
Data Mining
Functionalities
Classification and
Regression
Outlier Analysis
Clustering Analysis
14
Data Warehouse
• Data warehouse systems are valuable tools in today’s competitive, fast-evolving world.
A data warehouse is a central repository of information that can be analyzed to make
more informed decisions.
• Data warehouse systems allow for integration of a variety of application systems. They
support information processing by providing a solid platform of consolidated historic
data for analysis.
15
Data Warehouse is the
place where valuable data
assets of an organization
are stored such as
16
Contd.
• A data warehouse is a type of data management system that is designed to enable and
support business intelligence (BI) activities.
• Data warehouses are solely intended to perform queries and analysis and often contain
large amounts of historical data.
• The data within a data warehouse is usually derived from a wide range of sources such as
application log files and transaction applications.
• A data warehouse centralizes and consolidates large amounts of data from multiple sources.
• Its analytical capabilities allow organizations to derive valuable business insights from
their data to improve decision-making.
17
Contd.
• Over time, it builds a historical record that can be invaluable to data scientists and
business analysts.
18
Subject Oriented
• A data warehouse is organized around major subjects such as customer, supplier,
product, and sales.
• Hence, data warehouses typically provide a simple and concise view of particular
subject issues by excluding data that are not useful in the decision support process.
19
Integrated
• A data warehouse is usually constructed by integrating multiple heterogeneous
sources, such as relational databases, flat files, and online transaction records.
• Data cleaning and data integration techniques are applied to ensure consistency in
naming conventions, encoding structures, attribute measures, and so on.
• Data warehouses create consistency among different data types from disparate
sources.
20
Time-variant
• Data are stored to provide information from an historic perspective (e.g., the past 5–10
years).
• Every key structure in the data warehouse contains, either implicitly or explicitly, a
time element.
21
Non-volatile
• A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment.
• Due to this separation, a data warehouse does not require transaction processing,
recovery, and concurrency control mechanisms.
• It usually requires only two operations in data accessing: initial loading of data and
access of data.
• Once data is in a data warehouse, it’s stable and doesn’t change. It can only be deleted.
22
What are the benefits of using a data warehouse?
• Based on this information, data warehousing is the process of constructing and using
data warehouses.
24
Contd…
• A data warehouse architecture is made up of 3-tiers system.
• Data is stored in two different types of ways: 1) data that is accessed frequently is
stored in very fast storage (like SSD drives) and 2) data that is infrequently accessed is
stored in a cheap object store, like Amazon S3.
• The data warehouse will automatically make sure that frequently accessed data is
moved into the “fast” storage so query speed is optimized.
25
• The top tier is the front-end client that presents
results of query through reporting, analysis, and
data mining tools.
26
Source: https://www.youtube.com/watch?v=q9oAZwhuUy4
27
28
OLAP and OLTP Systems
• OLAP : On Line Analytical Processing
29
Comparison of OLTP and OLAP Systems
Source: https://www.youtube.com/watch?v=q9oAZwhuUy4
30
OLTP Systems OLAP Systems
• These systems are called online transaction • These systems are known as online analytical
processing (OLTP) systems. processing (OLAP) systems.
• The major task of these systems is to perform online • Data warehouse systems serve users or knowledge
transaction and query processing. workers in the role of data analysis and decision
making.
• They cover most of the day-to-day operations of an
organization such as purchasing, inventory, • Such systems can organize and present
manufacturing, banking, payroll, registration, and data in various formats in order to accommodate the
accounting. diverse needs of different users.
• An OLTP system manages current data that, • An OLAP system manages large amounts of
typically, are too detailed to be easily used for historic data, provides facilities for summarization
decision making. and aggregation, and stores and manages
information at different levels of granularity.
31
OLTP Systems OLAP Systems
• An OLTP system usually adopts an entity- • An OLAP system typically adopts either a star or a
relationship (ER) data model and an application- snowflake model and a subject-oriented database
oriented database design. design.
• An OLTP system focuses mainly on the current data • An OLAP system often spans multiple versions of a
within an enterprise or department, without referring database schema, due to the evolutionary process of
to historic data or data in different organizations. an organization.
• The access patterns of an OLTP system consist • OLAP systems also deal with information that
mainly of short, atomic transactions. Such a system originates from different organizations, integrating
requires concurrency control and recovery information from many data stores.
mechanisms.
• Because of their huge volume, OLAP data are stored
on multiple storage media.
32
Comparison of OLTP and OLAP Systems
33
Contd.
Feature OLTP OLAP
Access read/write mostly read
Focus data in information out
Number of records accessed tens millions
Number of users thousands hundreds
DB size GB to high-order GB ≥ TB
Priority high performance, high availability high flexibility, end-user
autonomy
Metric transaction throughput query throughput, response time
34
Dimensional Data Models
• The entity-relationship data model is commonly used in the design of relational
databases, where a database schema consists of a set of entities and the relationships
between them.
• The most popular data model for a data warehouse is a multidimensional model,
which can exist in the form of a star schema, a snowflake schema, or a fact
constellation schema.
35
1. Star Schema
• The most common modeling paradigm is the star schema, in which the data warehouse
contains:
(1) a large central table (fact table) containing the bulk of the data, with no
redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each dimension.
• The schema graph resembles a starburst, with the dimension tables displayed in a radial
pattern around the central fact table.
36
• A star schema for AllElectronics sales is shown in Figure here. Sales are considered
along four dimensions: time, item, branch, and location.
37
2. Snowflake Schema
• The snowflake schema is a variant of the star schema model, where some dimension
tables are normalized, thereby further splitting the data into additional tables. The
resulting schema graph forms a shape similar to a snowflake.
• The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in normalized form to reduce
redundancies.
• Such a table is easy to maintain and saves storage space. However, this space savings is
negligible in comparison to the typical magnitude of the fact table.
• Furthermore, the snowflake structure can reduce the effectiveness of browsing, since
more joins will be needed to execute a query. Consequently, the system performance
may be adversely impacted.
38
Contd.
• Hence, although the snowflake schema reduces redundancy, it is not as popular as the
star schema in data warehouse design.
• A snowflake schema for
AllElectronics sales is given in
Figure.
• Here, the sales fact table is
identical to that of the star
schema.
• The main difference between the
two schemas is in the definition
of dimension tables.
• The single dimension table for
item in the star schema is
normalized in the snowflake
schema, resulting in new item
and supplier table.
39
3. Fact Constellation
• Sophisticated applications may require multiple fact tables to share dimension tables.
This kind of schema can be viewed as a collection of stars, and hence is called a galaxy
schema or a fact constellation.
• This schema specifies two fact
tables, sales and shipping.
• The sales table definition is
identical to that of the star
schema.
• The shipping table has five
dimensions, or keys—item
key,
time key, shipper key, from
location, and to location—and
two measures—dollars cost
and units_shipped.
40
Contd.
• A fact constellation schema allows dimension tables to be shared between fact tables.
• For example, the dimensions tables for time, item, and location are shared between the sales and
shipping fact tables.
• In data warehousing, there is a distinction between a data warehouse and a data mart. A data
warehouse collects information about subjects that span the entire organization, such as
customers, items, sales, assets, and personnel, and thus its scope is enterprise-wide.
• For data warehouses, the fact constellation schema is commonly used, since it can model
multiple, interrelated subjects. A data mart, on the other hand, is a department subset of the data
warehouse that focuses on selected subjects, and thus its scope is departmentwide.
• For data marts, the star or snowflake schema is commonly used, since both are geared toward
modeling single subjects, although the star schema is more popular and efficient.
41
Data Warehouse Models/ Schemas
• From the architecture point of view, there are three data warehouse models:
Enterprise Virtual
Data Mart
Warehouse Warehouse
42
1. Enterprise Warehouse
• An enterprise warehouse collects all of the information about subjects spanning the
entire organization.
• It typically contains detailed data as well as summarized data, and can range in size
from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
• It requires extensive business modeling and may take years to design and build.
43
2. Data Mart
• A data mart contains a subset of corporate-wide data that is of value to a specific
group of users.
• The scope is confined to specific selected subjects. For example, a marketing data
mart may confine its subjects to customer, item, and sales.
• Data marts are usually implemented on low-cost departmental servers that are
Unix/Linux or Windows based.
• The implementation cycle of a data mart is more likely to be measured in weeks rather
than months or years.
44
Contd.
• However, it may involve complex integration in the long run if its design and planning
were not enterprise-wide.
• Independent data marts are sourced from data captured from one or more operational
systems or external information providers, or from data generated locally within a
particular department or geographic area.
• Dependent data marts are sourced directly from enterprise data warehouses.
45
3. Virtual Warehouse
• A virtual warehouse is a set of views over operational databases.
• For efficient query processing, only some of the possible summary views may be
materialized.
46
Data Cube: A Multidimensional Data Model
• Data warehouses and OLAP tools are based on a multidimensional data model. This model
views data in the form of a data cube.
• A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts.
• In general terms, dimensions are the perspectives or entities with respect to which an organization
wants to keep records.
• Each dimension may have a table associated with it, called a dimension table, which further
describes the dimension.
• For example, a dimension table for item may contain the attributes item name, brand, and type.
• Dimension tables can be specified by users or experts, or automatically generated and adjusted
based on data distributions.
47
Contd.
• A multidimensional data model is typically organized around a central theme, such
as sales.
• Examples of facts for a sales data warehouse include dollars sold (sales amount in
dollars), units sold (number of units sold), and amount budgeted.
48
Example
• Imagine that you have collected the data for your analysis. These data consist of the
AllElectronics sales per quarter, for the years 2008 to 2010.
49
Contd.
• You are, however, interested in the annual sales (total per year), rather than the total
per quarter.
• Thus, the data can be aggregated so that the resulting data summarize the total
sales per year instead of per quarter.
• The resulting data set is smaller in volume, without loss of information necessary
for the analysis task.
50
2-D DATA CUBE
• In this 2-D representation, the sales for Vancouver are shown with respect to the time dimension
(organized in quarters) and the item dimension (organized according to the types of items sold). The fact
or measure displayed is dollars sold (in thousands).
51
3-D DATA CUBE
• suppose we would like to view the data according to time and item, as well as location,
for the cities Chicago, New York, Toronto, and Vancouver.
• The 3-D data in the table are represented as a series of 2-D tables.
52
Contd.
• For example, Figure below shows a data cube for multidimensional analysis of
sales data with respect to annual sales per item type for each AllElectronics branch.
Each cell holds an aggregate data value, corresponding to the data point in
multidimensional space.
53
Operations on Cubes
Drill-
Roll-Up
down
Slice &
Dice Pivot
54
55
56
Extraction, Transformation, and Loading (ETL) /ETL Operations
• Data warehouse systems use back-end tools and utilities to populate and refresh their data.
• These tools and utilities include the following functions:
– Data extraction, which typically gathers data from multiple, heterogeneous, and external
sources.
– Data cleaning, which detects errors in the data and rectifies them when possible.
– Data transformation, which converts data from legacy or host format to warehouse format.
– Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds
indices and partitions.
– Refresh, which propagates the updates from the data sources to the warehouse.
57
Data Preprocessing
• How can the data be preprocessed in order to help improve the quality of the data?
• How can the data be preprocessed in order to help improve the mining results?
• How can the data be preprocessed so as to improve the efficiency and ease of the
mining process?
58
Why we need Data Pre-processing? (Over view)
• In large real-world databases, three of the most common problems are:
Data Preprocess
59
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
• A multi-dimensional measure of data quality:
– A well-accepted multi-dimensional view:
• accuracy, completeness, consistency, timeliness, believability, value
added, interpretability, accessibility
– Broad categories:
• intrinsic, contextual, representational, and accessibility.
Why there are data problems?
• There are many possible reasons for inaccurate data (e.g, having incorrect attribute
values)., or
• The instrument used for data collection may be faulty, there may be human or
computer errors occurring at data entry.
• Users may purposely submit incorrect data values for mandatory fields when they
do not wish to submit personal information (e.g., by choosing the default value
“January 1” displayed for birthday). This is known as disguised missing data.
Errors in data transmission can also occur.
• Incorrect data may also result from inconsistencies in naming conventions or data
codes, or inconsistent formats for input fields (e.g., date). Duplicate tuples also
require data cleaning.
• Data integration merges data from multiple sources into a coherent data store such
as a data warehouse.
• Data transformations (e.g., normalization) may be applied, where data are scaled
to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and
efficiency of mining algorithms involving distance measurements.
• Data reduction can reduce data size by, for instance, aggregating, eliminating
redundant features, or clustering.
• These techniques are not mutually exclusive; they may work together. For example,
data cleaning can involve transformations to correct wrong data, such as by
transforming all entries for a date field to a common format.
62
Forms of data preprocessing