0% found this document useful (0 votes)
198 views63 pages

Data Mining N Business Intelligence

This document provides an introduction to knowledge discovery from databases (KDD) and data mining. It discusses how large amounts of data are now being generated from various sources and there is a need for data scientists to extract meaningful patterns and knowledge from these datasets. The document outlines the KDD process, which includes data cleaning, integration, selection, transformation, mining, evaluation and presentation. It also discusses some common data mining techniques like classification, clustering, and association rule mining. Finally, it provides an overview of data warehousing and how data warehouses can support business intelligence activities through consolidated historical data.

Uploaded by

Vishal Anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views63 pages

Data Mining N Business Intelligence

This document provides an introduction to knowledge discovery from databases (KDD) and data mining. It discusses how large amounts of data are now being generated from various sources and there is a need for data scientists to extract meaningful patterns and knowledge from these datasets. The document outlines the KDD process, which includes data cleaning, integration, selection, transformation, mining, evaluation and presentation. It also discusses some common data mining techniques like classification, clustering, and association rule mining. Finally, it provides an overview of data warehousing and how data warehouses can support business intelligence activities through consolidated historical data.

Uploaded by

Vishal Anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

Data Mining and Business Intelligence

Unit 1: Data Warehousing

Taught By –
Mrs. Veena Kiragi
School of Computer Science Engg. and Applications
DY Patil International University

1
What is KDD?
• Knowledge Discovery from Data/Databases (KDD) or
Pattern Analysis, Knowledge Extraction

• In today’s world, there is a large volume of data.

• Sources of data –
– Business such as ecommerce,
– Science such as biomedical field,
– Society such as news, digital cameras, smartphones, apps.

2
Contd.
• Many people treat data mining as a synonym for another
popularly used term, knowledge discovery from data, or
KDD.

• While others view data mining as merely an essential step in


the process of knowledge discovery.

• Aim of this subject – to learn


– To extract meaningful information from a large scale of
data.
– Interesting patterns or knowledge from the given
dataset.
3
Why KDD/Data Science becoming popular now?

• Converting data into meaningful information is the work of a


Data Scientist.

• Before Internet era, data was present in very small amount


and can be managed in few files only such as excel sheet.

• Now after 10-15 years post internet era, millions and millions
of data is present.

• The emergence of Facebook in 2004 , YouTube in 2005 and


then Instagram, Twitter etc. forced the people to come on
internet.
4
Contd.
• The affordability of internet to the common man has
revolutionized this field.

• As a result, large amount of data is being added to the


databases every second.

• The significant development in computers, high performance


systems has helped in handling this huge amount of data.

• There is need of Data Scientists to develop/explore advanced


algorithms so that meaningful information can be extracted
from this large amount of databases in quick time.
5
Contd.
• Most of the data scientist tasks nowadays are related to the
datasets which is collected through the activity of the people
in the internet.

• Any type of activity by an individual, such as, watching a


video in YouTube, checking reels in Instagram, watching
feeds in Facebook means you are leaving your imprint on the
internet.

• That imprint is being recorded by the companies to provide a


better service to the customer. This particular step is done by a
data scientist by observing activity of millions of people.
6
Introduction to KDD Process
Data Data
Cleaning Integration

Data Data
Transformation Selection

Data Mining

Pattern
Evaluation

Knowledge
Presentation

Source: Han et al., Data Mining


Figure Data mining as a step in the process of knowledge discovery 7
Concepts and Techniques.
Data Cleaning to remove noise and inconsistent data

Data Integration where multiple data sources may be combined

Data Selection where data relevant to the analysis task are retrieved from the
database

Data Transformation where data are transformed and consolidated into forms
appropriate for mining

an essential process where intelligent methods are applied to


Data Mining
extract data patterns

to identify the truly interesting patterns representing knowledge


Pattern Evaluation
based on interestingness measures

Knowledge where visualization and knowledge representation techniques are


Presentation used to present mined knowledge to users
8
Introduction to Data Mining

• Data mining is the process of discovering interesting patterns


and knowledge from large amounts of data.

• Data is the most elementary descriptions of things, events,


activities etc.

• The data sources can include databases, data warehouses, the


Web, other information repositories, or data that are streamed
into the system dynamically.

• Data mining field is continuously evolving and will certainly


continue to embrace new data types as they emerge.
9
Why there is need of Data Mining

• Huge amount of data is being generated each second.

• There is competitive pressure always on the owner to provide


better services to the customer.

• Data mining helps to develop smart market decision, run


accurate campaigns, make predictions, and more.

• With the help of Data mining, companies can analyze


customer behaviors and their insights. This leads to great
success and data-driven business.

10
Database Management System (DBMS)

• It consists of a collection of interrelated data, known as a


database, and a set of software programs to manage and
access the data.

• A database is a logical grouping of data or collection of


related data (e.g. Structured and Unstructured).

• DBMS is a computerized data-keeping system.

• DBMS helps provide data security, data integrity,


concurrency, and uniform data administration procedures.

11
Contd.

• A DBMS serves as an interface between an end-user and a


database, allowing users to create, read, update, and delete
data in the database.

• DBMS manage the data, the database engine, and the database
schema, allowing for data to be manipulated or extracted by
users and other programs. 

• The most widely used types of DBMS software are relational,


distributed, hierarchical, object-oriented & network.

12
Data Mining Functionalities
• Data mining functionalities are used to specify the kinds of patterns to be found
in data mining tasks.

• In general, such tasks can be classified into two categories:

Descriptive Predictive

Descriptive mining tasks characterize properties of the data in a target data set.

Predictive mining tasks perform induction on the current data in order to make
predictions.

13
Mining of frequent
Characterization
patterns, associations, and
and Discrimination
correlations

Data Mining
Functionalities

Classification and
Regression

Outlier Analysis

Clustering Analysis

14
Data Warehouse
• Data warehouse systems are valuable tools in today’s competitive, fast-evolving world.
A data warehouse is a central repository of information that can be analyzed to make
more informed decisions.

• A data warehouse refers to a data repository that is maintained separately from an


organization’s operational databases.

• Data warehouse systems allow for integration of a variety of application systems. They
support information processing by providing a solid platform of consolidated historic
data for analysis.

• Data warehousing provides architectures and tools for business executives to


systematically organize, understand, and use their data to make strategic decisions.

15
Data Warehouse is the
place where valuable data
assets of an organization
are stored such as

16
Contd.
• A data warehouse is a type of data management system that is designed to enable and
support business intelligence (BI) activities.

• Data warehouses are solely intended to perform queries and analysis and often contain
large amounts of historical data.

• The data within a data warehouse is usually derived from a wide range of sources such as
application log files and transaction applications.

• A data warehouse centralizes and consolidates large amounts of data from multiple sources.

• Its analytical capabilities allow organizations to derive valuable business insights from
their data to improve decision-making.

17
Contd.
• Over time, it builds a historical record that can be invaluable to data scientists and
business analysts.

• Because of these capabilities, a data warehouse can be considered an organization’s


“single source of truth.”

• According to William H. Inmon, a leading architect in the construction of data


warehouse systems, “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s decision making process”.

• The four keywords—subject-oriented, integrated, time-variant, and nonvolatile -


distinguish data warehouses from other data repository systems, such as relational
database systems, transaction processing systems, and file systems.

18
Subject Oriented
• A data warehouse is organized around major subjects such as customer, supplier,
product, and sales.

• Rather than concentrating on the day-to-day operations and transaction processing of


an organization, a data warehouse focuses on the modeling and analysis of data for
decision makers.

• Hence, data warehouses typically provide a simple and concise view of particular
subject issues by excluding data that are not useful in the decision support process.

19
Integrated
• A data warehouse is usually constructed by integrating multiple heterogeneous
sources, such as relational databases, flat files, and online transaction records.

• Data cleaning and data integration techniques are applied to ensure consistency in
naming conventions, encoding structures, attribute measures, and so on.

• Data warehouses create consistency among different data types from disparate
sources.

20
Time-variant
• Data are stored to provide information from an historic perspective (e.g., the past 5–10
years).

• Every key structure in the data warehouse contains, either implicitly or explicitly, a
time element.

• Data warehouse analysis looks at change over time.

21
Non-volatile
• A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment.

• Due to this separation, a data warehouse does not require transaction processing,
recovery, and concurrency control mechanisms.

• It usually requires only two operations in data accessing: initial loading of data and
access of data.

• Once data is in a data warehouse, it’s stable and doesn’t change. It can only be deleted.

22
What are the benefits of using a data warehouse?

• Benefits of a data warehouse include the following:

a) Informed decision making

b) Consolidated data from many sources

c) Historical data analysis

d) Data quality, consistency, and accuracy

e) Separation of analytics processing from transactional databases, which


improves performance of both systems
23
Data Warehousing Architecture
• A data warehouse is a semantically consistent data store that serves as a physical
implementation of a decision support data model.

• It stores the information an enterprise needs to make strategic decisions.

• A data warehouse is also often viewed as an architecture, constructed by integrating


data from multiple heterogeneous sources to support structured and/or ad hoc queries,
analytical reporting, and decision making.

• Based on this information, data warehousing is the process of constructing and using
data warehouses.

24
Contd…
• A data warehouse architecture is made up of 3-tiers system.

• Data is stored in two different types of ways: 1) data that is accessed frequently is
stored in very fast storage (like SSD drives) and 2) data that is infrequently accessed is
stored in a cheap object store, like Amazon S3.

• The data warehouse will automatically make sure that frequently accessed data is
moved into the “fast” storage so query speed is optimized.

25
• The top tier is the front-end client that presents
results of query through reporting, analysis, and
data mining tools.

• The middle tier is an OLAP server consists of


the analytics engine that is used to access and
analyze the data.

• It is typically implemented using either (1) a


relational OLAP(ROLAP) model (i.e., an
extended relational DBMS that maps operations
on multidimensional data to standard relational
operations); or (2) a multidimensional OLAP
(MOLAP) model.

• The bottom tier of the architecture is the


warehouse database server (that is almost
always a relational database system), where data
is loaded and stored.

26
Source: https://www.youtube.com/watch?v=q9oAZwhuUy4

27
28
OLAP and OLTP Systems
• OLAP : On Line Analytical Processing

• OLTP : On Line Transaction Processing

29
Comparison of OLTP and OLAP Systems

Source: https://www.youtube.com/watch?v=q9oAZwhuUy4

30
OLTP Systems OLAP Systems
• These systems are called online transaction • These systems are known as online analytical
processing (OLTP) systems. processing (OLAP) systems.

• The major task of these systems is to perform online • Data warehouse systems serve users or knowledge
transaction and query processing. workers in the role of data analysis and decision
making.
• They cover most of the day-to-day operations of an
organization such as purchasing, inventory, • Such systems can organize and present
manufacturing, banking, payroll, registration, and data in various formats in order to accommodate the
accounting. diverse needs of different users.

• An OLTP system is customer-oriented. • An OLAP system is market-oriented.

• An OLTP system manages current data that, • An OLAP system manages large amounts of
typically, are too detailed to be easily used for historic data, provides facilities for summarization
decision making. and aggregation, and stores and manages
information at different levels of granularity.

31
OLTP Systems OLAP Systems
• An OLTP system usually adopts an entity- • An OLAP system typically adopts either a star or a
relationship (ER) data model and an application- snowflake model and a subject-oriented database
oriented database design. design.

• An OLTP system focuses mainly on the current data • An OLAP system often spans multiple versions of a
within an enterprise or department, without referring database schema, due to the evolutionary process of
to historic data or data in different organizations. an organization.

• The access patterns of an OLTP system consist • OLAP systems also deal with information that
mainly of short, atomic transactions. Such a system originates from different organizations, integrating
requires concurrency control and recovery information from many data stores.
mechanisms.
• Because of their huge volume, OLAP data are stored
on multiple storage media.

• Access to OLAP systems are mostly read-only


operations (because most data warehouses store
historic rather than up-to-date information).

32
Comparison of OLTP and OLAP Systems

Feature OLTP OLAP


Characteristic operational processing informational processing
Orientation transaction analysis
User clerk, DBA, database professional knowledge worker (e.g.,
manager, executive, analyst)
Function day-to-day operations long-term informational
requirements decision
support
DB design ER-based, application-oriented star/snowflake, subject-
oriented
Data current, guaranteed up-to-date historic, accuracy maintained
over time
Summarization primitive, highly detailed summarized, consolidated

View detailed, flat relational summarized,


multidimensional
Unit of work short, simple transaction complex query

33
Contd.
Feature OLTP OLAP
Access read/write mostly read
Focus data in information out
Number of records accessed tens millions
Number of users thousands hundreds
DB size GB to high-order GB ≥ TB
Priority high performance, high availability high flexibility, end-user
autonomy
Metric transaction throughput query throughput, response time

34
Dimensional Data Models
• The entity-relationship data model is commonly used in the design of relational
databases, where a database schema consists of a set of entities and the relationships
between them.

• Such a data model is appropriate for online transaction processing.

• A data warehouse, however, requires a concise, subject-oriented schema that facilitates


online data analysis.

• The most popular data model for a data warehouse is a multidimensional model,
which can exist in the form of a star schema, a snowflake schema, or a fact
constellation schema.

35
1. Star Schema
• The most common modeling paradigm is the star schema, in which the data warehouse
contains:
(1) a large central table (fact table) containing the bulk of the data, with no
redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each dimension.

• The schema graph resembles a starburst, with the dimension tables displayed in a radial
pattern around the central fact table.

36
• A star schema for AllElectronics sales is shown in Figure here. Sales are considered
along four dimensions: time, item, branch, and location.

• The schema contains a central


fact table for sales that
contains keys to each of the
four dimensions, along with
two measures: dollars sold
and units sold.

• Each dimension is represented


by only one table, and
each table contains a set of
attributes.

37
2. Snowflake Schema
• The snowflake schema is a variant of the star schema model, where some dimension
tables are normalized, thereby further splitting the data into additional tables. The
resulting schema graph forms a shape similar to a snowflake.

• The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in normalized form to reduce
redundancies.

• Such a table is easy to maintain and saves storage space. However, this space savings is
negligible in comparison to the typical magnitude of the fact table.

• Furthermore, the snowflake structure can reduce the effectiveness of browsing, since
more joins will be needed to execute a query. Consequently, the system performance
may be adversely impacted.

38
Contd.
• Hence, although the snowflake schema reduces redundancy, it is not as popular as the
star schema in data warehouse design.
• A snowflake schema for
AllElectronics sales is given in
Figure.
• Here, the sales fact table is
identical to that of the star
schema.
• The main difference between the
two schemas is in the definition
of dimension tables.
• The single dimension table for
item in the star schema is
normalized in the snowflake
schema, resulting in new item
and supplier table.

39
3. Fact Constellation
• Sophisticated applications may require multiple fact tables to share dimension tables.
This kind of schema can be viewed as a collection of stars, and hence is called a galaxy
schema or a fact constellation.
• This schema specifies two fact
tables, sales and shipping.
• The sales table definition is
identical to that of the star
schema.
• The shipping table has five
dimensions, or keys—item
key,
time key, shipper key, from
location, and to location—and
two measures—dollars cost
and units_shipped.

40
Contd.
• A fact constellation schema allows dimension tables to be shared between fact tables.

• For example, the dimensions tables for time, item, and location are shared between the sales and
shipping fact tables.

• In data warehousing, there is a distinction between a data warehouse and a data mart. A data
warehouse collects information about subjects that span the entire organization, such as
customers, items, sales, assets, and personnel, and thus its scope is enterprise-wide.

• For data warehouses, the fact constellation schema is commonly used, since it can model
multiple, interrelated subjects. A data mart, on the other hand, is a department subset of the data
warehouse that focuses on selected subjects, and thus its scope is departmentwide.

• For data marts, the star or snowflake schema is commonly used, since both are geared toward
modeling single subjects, although the star schema is more popular and efficient.

41
Data Warehouse Models/ Schemas

• From the architecture point of view, there are three data warehouse models:

Enterprise Virtual
Data Mart
Warehouse Warehouse

42
1. Enterprise Warehouse
• An enterprise warehouse collects all of the information about subjects spanning the
entire organization.

• It provides corporate-wide data integration, usually from one or more operational


systems or external information providers, and is cross-functional in scope.

• It typically contains detailed data as well as summarized data, and can range in size
from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.

• An enterprise data warehouse may be implemented on traditional mainframes,


computer superservers, or parallel architecture platforms.

• It requires extensive business modeling and may take years to design and build.

43
2. Data Mart
• A data mart contains a subset of corporate-wide data that is of value to a specific
group of users.

• The scope is confined to specific selected subjects. For example, a marketing data
mart may confine its subjects to customer, item, and sales.

• The data contained in data marts tend to be summarized.

• Data marts are usually implemented on low-cost departmental servers that are
Unix/Linux or Windows based.

• The implementation cycle of a data mart is more likely to be measured in weeks rather
than months or years.

44
Contd.
• However, it may involve complex integration in the long run if its design and planning
were not enterprise-wide.

• Depending on the source of data, data marts can be categorized as independent


or dependent.

• Independent data marts are sourced from data captured from one or more operational
systems or external information providers, or from data generated locally within a
particular department or geographic area.

• Dependent data marts are sourced directly from enterprise data warehouses.

45
3. Virtual Warehouse
• A virtual warehouse is a set of views over operational databases.

• For efficient query processing, only some of the possible summary views may be
materialized.

• A virtual warehouse is easy to build but requires excess capacity on operational


database servers.

46
Data Cube: A Multidimensional Data Model
• Data warehouses and OLAP tools are based on a multidimensional data model. This model
views data in the form of a data cube.

• A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts.

• In general terms, dimensions are the perspectives or entities with respect to which an organization
wants to keep records.

• Each dimension may have a table associated with it, called a dimension table, which further
describes the dimension.

• For example, a dimension table for item may contain the attributes item name, brand, and type.

• Dimension tables can be specified by users or experts, or automatically generated and adjusted
based on data distributions.

47
Contd.
• A multidimensional data model is typically organized around a central theme, such
as sales.

• Think of them as the quantities by which we want to analyze relationships between


dimensions.

• Examples of facts for a sales data warehouse include dollars sold (sales amount in
dollars), units sold (number of units sold), and amount budgeted.

48
Example

• Imagine that you have collected the data for your analysis. These data consist of the
AllElectronics sales per quarter, for the years 2008 to 2010.

49
Contd.

• You are, however, interested in the annual sales (total per year), rather than the total
per quarter.

• Thus, the data can be aggregated so that the resulting data summarize the total
sales per year instead of per quarter.

• The resulting data set is smaller in volume, without loss of information necessary
for the analysis task.

50
2-D DATA CUBE
• In this 2-D representation, the sales for Vancouver are shown with respect to the time dimension
(organized in quarters) and the item dimension (organized according to the types of items sold). The fact
or measure displayed is dollars sold (in thousands).

• A simple 2-D data cube that


is, in fact, a table or
spreadsheet for sales data
from AllElectronics.

• In particular, we will look at


the AllElectronics sales data
for items sold per quarter in
the city of Vancouver.

51
3-D DATA CUBE
• suppose we would like to view the data according to time and item, as well as location,
for the cities Chicago, New York, Toronto, and Vancouver.

• The 3-D data in the table are represented as a series of 2-D tables.

52
Contd.

• Data cubes store multidimensional aggregated information.

• For example, Figure below shows a data cube for multidimensional analysis of
sales data with respect to annual sales per item type for each AllElectronics branch.
Each cell holds an aggregate data value, corresponding to the data point in
multidimensional space.

53
Operations on Cubes

Drill-
Roll-Up
down

Slice &
Dice Pivot

54
55
56
Extraction, Transformation, and Loading (ETL) /ETL Operations

• Data warehouse systems use back-end tools and utilities to populate and refresh their data.
• These tools and utilities include the following functions:

– Data extraction, which typically gathers data from multiple, heterogeneous, and external
sources.

– Data cleaning, which detects errors in the data and rectifies them when possible.

– Data transformation, which converts data from legacy or host format to warehouse format.

– Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds
indices and partitions.

– Refresh, which propagates the updates from the data sources to the warehouse.

57
Data Preprocessing

What is the aim?

• How can the data be preprocessed in order to help improve the quality of the data?

• How can the data be preprocessed in order to help improve the mining results?

• How can the data be preprocessed so as to improve the efficiency and ease of the
mining process?

58
Why we need Data Pre-processing? (Over view)
• In large real-world databases, three of the most common problems are:

Inaccurate Incomplete Inconsistent

Data Preprocess

Accurate Complete Consistent

59
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
• A multi-dimensional measure of data quality:
– A well-accepted multi-dimensional view:
• accuracy, completeness, consistency, timeliness, believability, value
added, interpretability, accessibility
– Broad categories:
• intrinsic, contextual, representational, and accessibility.
Why there are data problems?
• There are many possible reasons for inaccurate data (e.g, having incorrect attribute
values)., or

• The instrument used for data collection may be faulty, there may be human or
computer errors occurring at data entry.

• Users may purposely submit incorrect data values for mandatory fields when they
do not wish to submit personal information (e.g., by choosing the default value
“January 1” displayed for birthday). This is known as disguised missing data.
Errors in data transmission can also occur.

• Incorrect data may also result from inconsistencies in naming conventions or data
codes, or inconsistent formats for input fields (e.g., date). Duplicate tuples also
require data cleaning.

• Relevant data may not be recorded due to a misunderstanding or because of


equipment malfunctions.
61
What are Data Pre-processing techniques/ tasks?
• Data cleaning can be applied to remove noise and correct inconsistencies in data.

• Data integration merges data from multiple sources into a coherent data store such
as a data warehouse.

• Data transformations (e.g., normalization) may be applied, where data are scaled
to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and
efficiency of mining algorithms involving distance measurements.

• Data reduction can reduce data size by, for instance, aggregating, eliminating
redundant features, or clustering.

• These techniques are not mutually exclusive; they may work together. For example,
data cleaning can involve transformations to correct wrong data, such as by
transforming all entries for a date field to a common format.

62
Forms of data preprocessing

You might also like