0% found this document useful (0 votes)

198 views63 pages

Data Mining N Business Intelligence

This document provides an introduction to knowledge discovery from databases (KDD) and data mining. It discusses how large amounts of data are now being generated from various sources and there is a need for data scientists to extract meaningful patterns and knowledge from these datasets. The document outlines the KDD process, which includes data cleaning, integration, selection, transformation, mining, evaluation and presentation. It also discusses some common data mining techniques like classification, clustering, and association rule mining. Finally, it provides an overview of data warehousing and how data warehouses can support business intelligence activities through consolidated historical data.

Uploaded by

Vishal Anand

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

198 views63 pages

Data Mining N Business Intelligence

Uploaded by

Vishal Anand

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 63

Data Mining and Business Intelligence

Unit 1: Data Warehousing

Taught By –
Mrs. Veena Kiragi
School of Computer Science Engg. and Applications
DY Patil International University

1
What is KDD?
• Knowledge Discovery from Data/Databases (KDD) or
Pattern Analysis, Knowledge Extraction

• In today’s world, there is a large volume of data.

• Sources of data –
– Business such as ecommerce,
– Science such as biomedical field,
– Society such as news, digital cameras, smartphones, apps.

2
Contd.
• Many people treat data mining as a synonym for another
popularly used term, knowledge discovery from data, or
KDD.

• While others view data mining as merely an essential step in

the process of knowledge discovery.

• Aim of this subject – to learn

– To extract meaningful information from a large scale of
data.
– Interesting patterns or knowledge from the given
dataset.
3
Why KDD/Data Science becoming popular now?

• Converting data into meaningful information is the work of a

Data Scientist.

• Before Internet era, data was present in very small amount

and can be managed in few files only such as excel sheet.

• Now after 10-15 years post internet era, millions and millions
of data is present.

• The emergence of Facebook in 2004 , YouTube in 2005 and

then Instagram, Twitter etc. forced the people to come on
internet.
4
Contd.
• The affordability of internet to the common man has
revolutionized this field.

• As a result, large amount of data is being added to the

databases every second.

• The significant development in computers, high performance

systems has helped in handling this huge amount of data.

• There is need of Data Scientists to develop/explore advanced

algorithms so that meaningful information can be extracted
from this large amount of databases in quick time.
5
Contd.
• Most of the data scientist tasks nowadays are related to the
datasets which is collected through the activity of the people
in the internet.

• Any type of activity by an individual, such as, watching a

video in YouTube, checking reels in Instagram, watching
feeds in Facebook means you are leaving your imprint on the
internet.

• That imprint is being recorded by the companies to provide a

better service to the customer. This particular step is done by a
data scientist by observing activity of millions of people.
6
Introduction to KDD Process
Data Data
Cleaning Integration

Data Data
Transformation Selection

Data Mining

Pattern
Evaluation

Knowledge
Presentation

Source: Han et al., Data Mining

Figure Data mining as a step in the process of knowledge discovery 7
Concepts and Techniques.
Data Cleaning to remove noise and inconsistent data

Data Integration where multiple data sources may be combined

Data Selection where data relevant to the analysis task are retrieved from the
database

Data Transformation where data are transformed and consolidated into forms
appropriate for mining

an essential process where intelligent methods are applied to

Data Mining
extract data patterns

to identify the truly interesting patterns representing knowledge

Pattern Evaluation
based on interestingness measures

Knowledge where visualization and knowledge representation techniques are

Presentation used to present mined knowledge to users
8
Introduction to Data Mining

• Data mining is the process of discovering interesting patterns

and knowledge from large amounts of data.

• Data is the most elementary descriptions of things, events,

activities etc.

• The data sources can include databases, data warehouses, the

Web, other information repositories, or data that are streamed
into the system dynamically.

• Data mining field is continuously evolving and will certainly

continue to embrace new data types as they emerge.
9
Why there is need of Data Mining

• Huge amount of data is being generated each second.

• There is competitive pressure always on the owner to provide

better services to the customer.

• Data mining helps to develop smart market decision, run

accurate campaigns, make predictions, and more.

• With the help of Data mining, companies can analyze

customer behaviors and their insights. This leads to great
success and data-driven business.

10
Database Management System (DBMS)

• It consists of a collection of interrelated data, known as a

database, and a set of software programs to manage and
access the data.

• A database is a logical grouping of data or collection of

related data (e.g. Structured and Unstructured).

• DBMS is a computerized data-keeping system.

• DBMS helps provide data security, data integrity,

concurrency, and uniform data administration procedures.

11
Contd.

• A DBMS serves as an interface between an end-user and a

database, allowing users to create, read, update, and delete
data in the database.

• DBMS manage the data, the database engine, and the database
schema, allowing for data to be manipulated or extracted by
users and other programs.

• The most widely used types of DBMS software are relational,

distributed, hierarchical, object-oriented & network.

12
Data Mining Functionalities
• Data mining functionalities are used to specify the kinds of patterns to be found
in data mining tasks.

• In general, such tasks can be classified into two categories:

Descriptive Predictive

Descriptive mining tasks characterize properties of the data in a target data set.

Predictive mining tasks perform induction on the current data in order to make
predictions.

13
Mining of frequent
Characterization
patterns, associations, and
and Discrimination
correlations

Data Mining
Functionalities

Classification and
Regression

Outlier Analysis

Clustering Analysis

14
Data Warehouse
• Data warehouse systems are valuable tools in today’s competitive, fast-evolving world.
A data warehouse is a central repository of information that can be analyzed to make
more informed decisions.

• A data warehouse refers to a data repository that is maintained separately from an

organization’s operational databases.

• Data warehouse systems allow for integration of a variety of application systems. They
support information processing by providing a solid platform of consolidated historic
data for analysis.

• Data warehousing provides architectures and tools for business executives to

systematically organize, understand, and use their data to make strategic decisions.

15
Data Warehouse is the
place where valuable data
assets of an organization
are stored such as

16
Contd.
• A data warehouse is a type of data management system that is designed to enable and
support business intelligence (BI) activities.

• Data warehouses are solely intended to perform queries and analysis and often contain
large amounts of historical data.

• The data within a data warehouse is usually derived from a wide range of sources such as
application log files and transaction applications.

• A data warehouse centralizes and consolidates large amounts of data from multiple sources.

• Its analytical capabilities allow organizations to derive valuable business insights from
their data to improve decision-making.

17
Contd.
• Over time, it builds a historical record that can be invaluable to data scientists and
business analysts.

• Because of these capabilities, a data warehouse can be considered an organization’s

“single source of truth.”

• According to William H. Inmon, a leading architect in the construction of data

warehouse systems, “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s decision making process”.

• The four keywords—subject-oriented, integrated, time-variant, and nonvolatile -

distinguish data warehouses from other data repository systems, such as relational
database systems, transaction processing systems, and file systems.

18
Subject Oriented
• A data warehouse is organized around major subjects such as customer, supplier,
product, and sales.

• Rather than concentrating on the day-to-day operations and transaction processing of

an organization, a data warehouse focuses on the modeling and analysis of data for
decision makers.

• Hence, data warehouses typically provide a simple and concise view of particular
subject issues by excluding data that are not useful in the decision support process.

19
Integrated
• A data warehouse is usually constructed by integrating multiple heterogeneous
sources, such as relational databases, flat files, and online transaction records.

• Data cleaning and data integration techniques are applied to ensure consistency in
naming conventions, encoding structures, attribute measures, and so on.

• Data warehouses create consistency among different data types from disparate
sources.

20
Time-variant
• Data are stored to provide information from an historic perspective (e.g., the past 5–10
years).

• Every key structure in the data warehouse contains, either implicitly or explicitly, a
time element.

• Data warehouse analysis looks at change over time.

21
Non-volatile
• A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment.

• Due to this separation, a data warehouse does not require transaction processing,
recovery, and concurrency control mechanisms.

• It usually requires only two operations in data accessing: initial loading of data and
access of data.

• Once data is in a data warehouse, it’s stable and doesn’t change. It can only be deleted.

22
What are the benefits of using a data warehouse?

• Benefits of a data warehouse include the following:

a) Informed decision making

b) Consolidated data from many sources

c) Historical data analysis

d) Data quality, consistency, and accuracy

e) Separation of analytics processing from transactional databases, which

improves performance of both systems
23
Data Warehousing Architecture
• A data warehouse is a semantically consistent data store that serves as a physical
implementation of a decision support data model.

• It stores the information an enterprise needs to make strategic decisions.

• A data warehouse is also often viewed as an architecture, constructed by integrating

data from multiple heterogeneous sources to support structured and/or ad hoc queries,
analytical reporting, and decision making.

• Based on this information, data warehousing is the process of constructing and using
data warehouses.

24
Contd…
• A data warehouse architecture is made up of 3-tiers system.

• Data is stored in two different types of ways: 1) data that is accessed frequently is
stored in very fast storage (like SSD drives) and 2) data that is infrequently accessed is
stored in a cheap object store, like Amazon S3.

• The data warehouse will automatically make sure that frequently accessed data is
moved into the “fast” storage so query speed is optimized.

25
• The top tier is the front-end client that presents
results of query through reporting, analysis, and
data mining tools.

• The middle tier is an OLAP server consists of

the analytics engine that is used to access and
analyze the data.

• It is typically implemented using either (1) a

relational OLAP(ROLAP) model (i.e., an
extended relational DBMS that maps operations
on multidimensional data to standard relational
operations); or (2) a multidimensional OLAP
(MOLAP) model.

• The bottom tier of the architecture is the

warehouse database server (that is almost
always a relational database system), where data
is loaded and stored.

26
Source: https://www.youtube.com/watch?v=q9oAZwhuUy4

27
28
OLAP and OLTP Systems
• OLAP : On Line Analytical Processing

• OLTP : On Line Transaction Processing

29
Comparison of OLTP and OLAP Systems

Source: https://www.youtube.com/watch?v=q9oAZwhuUy4

30
OLTP Systems OLAP Systems
• These systems are called online transaction • These systems are known as online analytical
processing (OLTP) systems. processing (OLAP) systems.

• The major task of these systems is to perform online • Data warehouse systems serve users or knowledge
transaction and query processing. workers in the role of data analysis and decision
making.
• They cover most of the day-to-day operations of an
organization such as purchasing, inventory, • Such systems can organize and present
manufacturing, banking, payroll, registration, and data in various formats in order to accommodate the
accounting. diverse needs of different users.

• An OLTP system is customer-oriented. • An OLAP system is market-oriented.

• An OLTP system manages current data that, • An OLAP system manages large amounts of
typically, are too detailed to be easily used for historic data, provides facilities for summarization
decision making. and aggregation, and stores and manages
information at different levels of granularity.

31
OLTP Systems OLAP Systems
• An OLTP system usually adopts an entity- • An OLAP system typically adopts either a star or a
relationship (ER) data model and an application- snowflake model and a subject-oriented database
oriented database design. design.

• An OLTP system focuses mainly on the current data • An OLAP system often spans multiple versions of a
within an enterprise or department, without referring database schema, due to the evolutionary process of
to historic data or data in different organizations. an organization.

• The access patterns of an OLTP system consist • OLAP systems also deal with information that
mainly of short, atomic transactions. Such a system originates from different organizations, integrating
requires concurrency control and recovery information from many data stores.
mechanisms.
• Because of their huge volume, OLAP data are stored
on multiple storage media.

• Access to OLAP systems are mostly read-only

operations (because most data warehouses store
historic rather than up-to-date information).

32
Comparison of OLTP and OLAP Systems

Feature OLTP OLAP

Characteristic operational processing informational processing
Orientation transaction analysis
User clerk, DBA, database professional knowledge worker (e.g.,
manager, executive, analyst)
Function day-to-day operations long-term informational
requirements decision
support
DB design ER-based, application-oriented star/snowflake, subject-
oriented
Data current, guaranteed up-to-date historic, accuracy maintained
over time
Summarization primitive, highly detailed summarized, consolidated

View detailed, flat relational summarized,

multidimensional
Unit of work short, simple transaction complex query

33
Contd.
Feature OLTP OLAP
Access read/write mostly read
Focus data in information out
Number of records accessed tens millions
Number of users thousands hundreds
DB size GB to high-order GB ≥ TB
Priority high performance, high availability high flexibility, end-user
autonomy
Metric transaction throughput query throughput, response time

34
Dimensional Data Models
• The entity-relationship data model is commonly used in the design of relational
databases, where a database schema consists of a set of entities and the relationships
between them.

• Such a data model is appropriate for online transaction processing.

• A data warehouse, however, requires a concise, subject-oriented schema that facilitates

online data analysis.

• The most popular data model for a data warehouse is a multidimensional model,
which can exist in the form of a star schema, a snowflake schema, or a fact
constellation schema.

35
1. Star Schema
• The most common modeling paradigm is the star schema, in which the data warehouse
contains:
(1) a large central table (fact table) containing the bulk of the data, with no
redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each dimension.

• The schema graph resembles a starburst, with the dimension tables displayed in a radial
pattern around the central fact table.

36
• A star schema for AllElectronics sales is shown in Figure here. Sales are considered
along four dimensions: time, item, branch, and location.

• The schema contains a central

fact table for sales that
contains keys to each of the
four dimensions, along with
two measures: dollars sold
and units sold.

• Each dimension is represented

by only one table, and
each table contains a set of
attributes.

37
2. Snowflake Schema
• The snowflake schema is a variant of the star schema model, where some dimension
tables are normalized, thereby further splitting the data into additional tables. The
resulting schema graph forms a shape similar to a snowflake.

• The major difference between the snowflake and star schema models is that the
dimension tables of the snowflake model may be kept in normalized form to reduce
redundancies.

• Such a table is easy to maintain and saves storage space. However, this space savings is
negligible in comparison to the typical magnitude of the fact table.

• Furthermore, the snowflake structure can reduce the effectiveness of browsing, since
more joins will be needed to execute a query. Consequently, the system performance
may be adversely impacted.

38
Contd.
• Hence, although the snowflake schema reduces redundancy, it is not as popular as the
star schema in data warehouse design.
• A snowflake schema for
AllElectronics sales is given in
Figure.
• Here, the sales fact table is
identical to that of the star
schema.
• The main difference between the
two schemas is in the definition
of dimension tables.
• The single dimension table for
item in the star schema is
normalized in the snowflake
schema, resulting in new item
and supplier table.

39
3. Fact Constellation
• Sophisticated applications may require multiple fact tables to share dimension tables.
This kind of schema can be viewed as a collection of stars, and hence is called a galaxy
schema or a fact constellation.
• This schema specifies two fact
tables, sales and shipping.
• The sales table definition is
identical to that of the star
schema.
• The shipping table has five
dimensions, or keys—item
key,
time key, shipper key, from
location, and to location—and
two measures—dollars cost
and units_shipped.

40
Contd.
• A fact constellation schema allows dimension tables to be shared between fact tables.

• For example, the dimensions tables for time, item, and location are shared between the sales and
shipping fact tables.

• In data warehousing, there is a distinction between a data warehouse and a data mart. A data
warehouse collects information about subjects that span the entire organization, such as
customers, items, sales, assets, and personnel, and thus its scope is enterprise-wide.

• For data warehouses, the fact constellation schema is commonly used, since it can model
multiple, interrelated subjects. A data mart, on the other hand, is a department subset of the data
warehouse that focuses on selected subjects, and thus its scope is departmentwide.

• For data marts, the star or snowflake schema is commonly used, since both are geared toward
modeling single subjects, although the star schema is more popular and efficient.

41
Data Warehouse Models/ Schemas

• From the architecture point of view, there are three data warehouse models:

Enterprise Virtual
Data Mart
Warehouse Warehouse

42
1. Enterprise Warehouse
• An enterprise warehouse collects all of the information about subjects spanning the
entire organization.

• It provides corporate-wide data integration, usually from one or more operational

systems or external information providers, and is cross-functional in scope.

• It typically contains detailed data as well as summarized data, and can range in size
from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.

• An enterprise data warehouse may be implemented on traditional mainframes,

computer superservers, or parallel architecture platforms.

• It requires extensive business modeling and may take years to design and build.

43
2. Data Mart
• A data mart contains a subset of corporate-wide data that is of value to a specific
group of users.

• The scope is confined to specific selected subjects. For example, a marketing data
mart may confine its subjects to customer, item, and sales.

• The data contained in data marts tend to be summarized.

• Data marts are usually implemented on low-cost departmental servers that are
Unix/Linux or Windows based.

• The implementation cycle of a data mart is more likely to be measured in weeks rather
than months or years.

44
Contd.
• However, it may involve complex integration in the long run if its design and planning
were not enterprise-wide.

• Depending on the source of data, data marts can be categorized as independent

or dependent.

• Independent data marts are sourced from data captured from one or more operational
systems or external information providers, or from data generated locally within a
particular department or geographic area.

• Dependent data marts are sourced directly from enterprise data warehouses.

45
3. Virtual Warehouse
• A virtual warehouse is a set of views over operational databases.

• For efficient query processing, only some of the possible summary views may be
materialized.

• A virtual warehouse is easy to build but requires excess capacity on operational

database servers.

46
Data Cube: A Multidimensional Data Model
• Data warehouses and OLAP tools are based on a multidimensional data model. This model
views data in the form of a data cube.

• A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts.

• In general terms, dimensions are the perspectives or entities with respect to which an organization
wants to keep records.

• Each dimension may have a table associated with it, called a dimension table, which further
describes the dimension.

• For example, a dimension table for item may contain the attributes item name, brand, and type.

• Dimension tables can be specified by users or experts, or automatically generated and adjusted
based on data distributions.

47
Contd.
• A multidimensional data model is typically organized around a central theme, such
as sales.

• Think of them as the quantities by which we want to analyze relationships between

dimensions.

• Examples of facts for a sales data warehouse include dollars sold (sales amount in
dollars), units sold (number of units sold), and amount budgeted.

48
Example

• Imagine that you have collected the data for your analysis. These data consist of the
AllElectronics sales per quarter, for the years 2008 to 2010.

49
Contd.

• You are, however, interested in the annual sales (total per year), rather than the total
per quarter.

• Thus, the data can be aggregated so that the resulting data summarize the total
sales per year instead of per quarter.

• The resulting data set is smaller in volume, without loss of information necessary
for the analysis task.

50
2-D DATA CUBE
• In this 2-D representation, the sales for Vancouver are shown with respect to the time dimension
(organized in quarters) and the item dimension (organized according to the types of items sold). The fact
or measure displayed is dollars sold (in thousands).

• A simple 2-D data cube that

is, in fact, a table or
spreadsheet for sales data
from AllElectronics.

• In particular, we will look at

the AllElectronics sales data
for items sold per quarter in
the city of Vancouver.

51
3-D DATA CUBE
• suppose we would like to view the data according to time and item, as well as location,
for the cities Chicago, New York, Toronto, and Vancouver.

• The 3-D data in the table are represented as a series of 2-D tables.

52
Contd.

• Data cubes store multidimensional aggregated information.

• For example, Figure below shows a data cube for multidimensional analysis of
sales data with respect to annual sales per item type for each AllElectronics branch.
Each cell holds an aggregate data value, corresponding to the data point in
multidimensional space.

53
Operations on Cubes

Drill-
Roll-Up
down

Slice &
Dice Pivot

54
55
56
Extraction, Transformation, and Loading (ETL) /ETL Operations

• Data warehouse systems use back-end tools and utilities to populate and refresh their data.
• These tools and utilities include the following functions:

– Data extraction, which typically gathers data from multiple, heterogeneous, and external
sources.

– Data cleaning, which detects errors in the data and rectifies them when possible.

– Data transformation, which converts data from legacy or host format to warehouse format.

– Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds
indices and partitions.

– Refresh, which propagates the updates from the data sources to the warehouse.

57
Data Preprocessing

What is the aim?

• How can the data be preprocessed in order to help improve the quality of the data?

• How can the data be preprocessed in order to help improve the mining results?

• How can the data be preprocessed so as to improve the efficiency and ease of the
mining process?

58
Why we need Data Pre-processing? (Over view)
• In large real-world databases, three of the most common problems are:

Inaccurate Incomplete Inconsistent

Data Preprocess

Accurate Complete Consistent

59
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain attributes
of interest, or containing only aggregate data
– noisy: containing errors or outliers
– inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
• A multi-dimensional measure of data quality:
– A well-accepted multi-dimensional view:
• accuracy, completeness, consistency, timeliness, believability, value
added, interpretability, accessibility
– Broad categories:
• intrinsic, contextual, representational, and accessibility.
Why there are data problems?
• There are many possible reasons for inaccurate data (e.g, having incorrect attribute
values)., or

• The instrument used for data collection may be faulty, there may be human or
computer errors occurring at data entry.

• Users may purposely submit incorrect data values for mandatory fields when they
do not wish to submit personal information (e.g., by choosing the default value
“January 1” displayed for birthday). This is known as disguised missing data.
Errors in data transmission can also occur.

• Incorrect data may also result from inconsistencies in naming conventions or data
codes, or inconsistent formats for input fields (e.g., date). Duplicate tuples also
require data cleaning.

• Relevant data may not be recorded due to a misunderstanding or because of

equipment malfunctions.
61
What are Data Pre-processing techniques/ tasks?
• Data cleaning can be applied to remove noise and correct inconsistencies in data.

• Data integration merges data from multiple sources into a coherent data store such
as a data warehouse.

• Data transformations (e.g., normalization) may be applied, where data are scaled
to fall within a smaller range like 0.0 to 1.0. This can improve the accuracy and
efficiency of mining algorithms involving distance measurements.

• Data reduction can reduce data size by, for instance, aggregating, eliminating
redundant features, or clustering.

• These techniques are not mutually exclusive; they may work together. For example,
data cleaning can involve transformations to correct wrong data, such as by
transforming all entries for a date field to a common format.

62
Forms of data preprocessing

MODULE IN LANDSCAPING Edited2
84% (31)
MODULE IN LANDSCAPING Edited2
92 pages
Sample Paper-CAIIB-ABFM-By Dr. Murugan
0% (1)
Sample Paper-CAIIB-ABFM-By Dr. Murugan
92 pages
BUCO
100% (1)
BUCO
2 pages
M1 - Introducing Google Cloud v5.2 - ILT
No ratings yet
M1 - Introducing Google Cloud v5.2 - ILT
69 pages
GCP
No ratings yet
GCP
612 pages
Intel System Management Specification - Rev2p0
No ratings yet
Intel System Management Specification - Rev2p0
63 pages
Google - Prep4sure - Cloud Digital Leader - Vce.2024 Apr 08.by - Lester.141q.vce
No ratings yet
Google - Prep4sure - Cloud Digital Leader - Vce.2024 Apr 08.by - Lester.141q.vce
7 pages
Notes GCP PCA Preparation
No ratings yet
Notes GCP PCA Preparation
7 pages
Cloud Computing A Practical Approach For Learning and Implementation 1e 9788131776513 9789332537255 8131776514 Compress
No ratings yet
Cloud Computing A Practical Approach For Learning and Implementation 1e 9788131776513 9789332537255 8131776514 Compress
441 pages
(IJIT-V6I5P7) :ravishankar Belkunde
No ratings yet
(IJIT-V6I5P7) :ravishankar Belkunde
9 pages
Getting Started: Informatica Powercenter (Version 9.1.0)
No ratings yet
Getting Started: Informatica Powercenter (Version 9.1.0)
122 pages
Tribes Learning Presentation
No ratings yet
Tribes Learning Presentation
14 pages
04 - Google BigQuery Pricing
No ratings yet
04 - Google BigQuery Pricing
18 pages
05.azure Data Lake Authentication
No ratings yet
05.azure Data Lake Authentication
16 pages
Practice Test 4
No ratings yet
Practice Test 4
87 pages
Donald Ngandeu 1
No ratings yet
Donald Ngandeu 1
6 pages
07 - Ingesting New Datasets Into Google BigQuery
No ratings yet
07 - Ingesting New Datasets Into Google BigQuery
8 pages
Google Cloud Platform-Architect
No ratings yet
Google Cloud Platform-Architect
4 pages
Installing and Using Impala
No ratings yet
Installing and Using Impala
248 pages
Reading Preparing For ACE Module 2 v2.0
No ratings yet
Reading Preparing For ACE Module 2 v2.0
30 pages
1 GCP Cheatsheet Cloud
No ratings yet
1 GCP Cheatsheet Cloud
3 pages
Cloud Digital Leader 1
100% (1)
Cloud Digital Leader 1
29 pages
2.3 Resource Management
No ratings yet
2.3 Resource Management
23 pages
Course Presentation GoogleCloudDigitalLeader
No ratings yet
Course Presentation GoogleCloudDigitalLeader
182 pages
Associate Cloud Engineer Exam - Free Actual Q&As, Page 5 - ExamTopics
No ratings yet
Associate Cloud Engineer Exam - Free Actual Q&As, Page 5 - ExamTopics
3 pages
Associate Cloud Engineer Google Exam Updated Dumps Questions
No ratings yet
Associate Cloud Engineer Google Exam Updated Dumps Questions
20 pages
Slide 3 Hadoop MapReduce Tutorial
No ratings yet
Slide 3 Hadoop MapReduce Tutorial
119 pages
Google Certified Professional Data Engineer
No ratings yet
Google Certified Professional Data Engineer
4 pages
Exam Professional Data Engineer Topic 1 Question 204 Discussion - ExamTopics
No ratings yet
Exam Professional Data Engineer Topic 1 Question 204 Discussion - ExamTopics
1 page
Google Premium Professional-Cloud-Developer by - VCEplus 50q-DEMO
No ratings yet
Google Premium Professional-Cloud-Developer by - VCEplus 50q-DEMO
25 pages
Presentation Deck Part 21612531397089
No ratings yet
Presentation Deck Part 21612531397089
59 pages
Google Cloud Skills Boost - Google Cloud Architect Diagnostic Questions
No ratings yet
Google Cloud Skills Boost - Google Cloud Architect Diagnostic Questions
11 pages
GCP Associate Cloud Engineer
100% (1)
GCP Associate Cloud Engineer
4 pages
VMware Best Practices Kubernetes Security
No ratings yet
VMware Best Practices Kubernetes Security
8 pages
Essential Python Libraries and Functions For Data Science 1706295212
No ratings yet
Essential Python Libraries and Functions For Data Science 1706295212
12 pages
DataEngineer Roadmap
No ratings yet
DataEngineer Roadmap
12 pages
Ace Updated 20 2 23
No ratings yet
Ace Updated 20 2 23
1 page
9 - CT071-3-3-DDAC - Introduction To Azure Cosmos DB
No ratings yet
9 - CT071-3-3-DDAC - Introduction To Azure Cosmos DB
30 pages
Bank Loan Analysis Final Presentation
No ratings yet
Bank Loan Analysis Final Presentation
29 pages
(T-GCPAWS-I) Module 1 - Introducing Google Cloud Platform
No ratings yet
(T-GCPAWS-I) Module 1 - Introducing Google Cloud Platform
36 pages
PCA Dumps
No ratings yet
PCA Dumps
165 pages
T-GCPBDML-B - M0 - Course Introduction - ILT Slides
No ratings yet
T-GCPBDML-B - M0 - Course Introduction - ILT Slides
16 pages
AWS Athena Knowledgebase
No ratings yet
AWS Athena Knowledgebase
4 pages
Kubernetes and The Enterprise: Brought To You in Partnership With
No ratings yet
Kubernetes and The Enterprise: Brought To You in Partnership With
54 pages
Applied Coding Track
No ratings yet
Applied Coding Track
10 pages
Oltp Olap Rtap
No ratings yet
Oltp Olap Rtap
53 pages
Set Your Data in Motion
No ratings yet
Set Your Data in Motion
8 pages
CS 563-DeepLearning-SentimentApplication-April2022 (27403)
No ratings yet
CS 563-DeepLearning-SentimentApplication-April2022 (27403)
124 pages
GCP Data
No ratings yet
GCP Data
6 pages
Batch Processing Vs Stream Processing
No ratings yet
Batch Processing Vs Stream Processing
3 pages
30 Questions For Google Cloud Professional Machine Learning Engineer Exam - Mikael Ahonen
No ratings yet
30 Questions For Google Cloud Professional Machine Learning Engineer Exam - Mikael Ahonen
12 pages
Free Professional Cloud Architect Exam Questions
No ratings yet
Free Professional Cloud Architect Exam Questions
14 pages
GitHub Actions
No ratings yet
GitHub Actions
25 pages
Intro To Flask!
No ratings yet
Intro To Flask!
323 pages
Everything You Need To Know About PostgreSQL EXPLAIN
No ratings yet
Everything You Need To Know About PostgreSQL EXPLAIN
44 pages
M3 - Kubernetes Architecture - ILT v1.7
No ratings yet
M3 - Kubernetes Architecture - ILT v1.7
91 pages
Stream Processing at Lyft
No ratings yet
Stream Processing at Lyft
20 pages
Velostrata Migration To GCP
No ratings yet
Velostrata Migration To GCP
6 pages
GCP Storage
No ratings yet
GCP Storage
12 pages
Professional Cloud Architect Exam - Free Actual Q&As, Page 1 - ExamTopics
No ratings yet
Professional Cloud Architect Exam - Free Actual Q&As, Page 1 - ExamTopics
4 pages
AZ 204 Demo
No ratings yet
AZ 204 Demo
17 pages
Databricks Quiz Questions
No ratings yet
Databricks Quiz Questions
35 pages
Hemanshu Kumar Saraf - Resume New
No ratings yet
Hemanshu Kumar Saraf - Resume New
1 page
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
From Everand
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
Venkata Sasi Kanumuri
No ratings yet
File and Exceptions
No ratings yet
File and Exceptions
9 pages
Knowledge Discovery Data Mining - Syllabus
No ratings yet
Knowledge Discovery Data Mining - Syllabus
6 pages
Cil Annual Report 2021
No ratings yet
Cil Annual Report 2021
156 pages
Introduction To PYTHON
No ratings yet
Introduction To PYTHON
98 pages
Aiml Mca
100% (1)
Aiml Mca
38 pages
Dbms Data Warehosuing
No ratings yet
Dbms Data Warehosuing
80 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
26 pages
Physical and Mental Effect of Teenage Pregnancy in Secondary Schools
No ratings yet
Physical and Mental Effect of Teenage Pregnancy in Secondary Schools
6 pages
0 - Vehicle Deployment Policy 2017 For Formations Under CBEC
No ratings yet
0 - Vehicle Deployment Policy 2017 For Formations Under CBEC
18 pages
The Use of Thermal Analysis in Assessing The Effect of Temperature On A Cement Paste
No ratings yet
The Use of Thermal Analysis in Assessing The Effect of Temperature On A Cement Paste
5 pages
GRW PSO Jan 25
No ratings yet
GRW PSO Jan 25
28 pages
CCCC Mock T
No ratings yet
CCCC Mock T
65 pages
Ferguson Aberrations
100% (1)
Ferguson Aberrations
21 pages
American International School Class-X Subject-English Topic-Articles & Determiners
No ratings yet
American International School Class-X Subject-English Topic-Articles & Determiners
4 pages
Aptitude Question
No ratings yet
Aptitude Question
8 pages
Describing People: Activity Type
No ratings yet
Describing People: Activity Type
6 pages
Harvey 1997-Managing Inpatriates Building A
No ratings yet
Harvey 1997-Managing Inpatriates Building A
18 pages
Sugar Restrictions in Early Childhood Reduce DM
No ratings yet
Sugar Restrictions in Early Childhood Reduce DM
1 page
Log Is Cool Methods Kills
No ratings yet
Log Is Cool Methods Kills
4 pages
November LND Test PDF
No ratings yet
November LND Test PDF
2 pages
Water Cooler
No ratings yet
Water Cooler
12 pages
Parkside Retirement House New
No ratings yet
Parkside Retirement House New
2 pages
Forda Demo Grade 5
No ratings yet
Forda Demo Grade 5
9 pages
Digital Image Processing - 21ec722
No ratings yet
Digital Image Processing - 21ec722
67 pages
Dokumen - Tips Instructions Manual Illustrated Parts Catalogue Manual Illustrated Parts Catalogue
No ratings yet
Dokumen - Tips Instructions Manual Illustrated Parts Catalogue Manual Illustrated Parts Catalogue
29 pages
Export 06 07 2025-13 44
No ratings yet
Export 06 07 2025-13 44
2 pages
Unit 1 To 5
No ratings yet
Unit 1 To 5
91 pages
Acme Natural Stones
50% (2)
Acme Natural Stones
36 pages
Document (3) (1) Coconut Plantation Management System
100% (1)
Document (3) (1) Coconut Plantation Management System
16 pages
Deed of Absolute Sale
79% (14)
Deed of Absolute Sale
2 pages
The Role of Youth in National Development 1
No ratings yet
The Role of Youth in National Development 1
1 page
BYD - Colors 2023
No ratings yet
BYD - Colors 2023
8 pages

Data Mining N Business Intelligence

Uploaded by

Data Mining N Business Intelligence

Uploaded by

Data Mining and Business Intelligence

Unit 1: Data Warehousing

• In today’s world, there is a large volume of data.

• While others view data mining as merely an essential step in

• Aim of this subject – to learn

• Converting data into meaningful information is the work of a

• Before Internet era, data was present in very small amount

• The emergence of Facebook in 2004 , YouTube in 2005 and

• As a result, large amount of data is being added to the

• The significant development in computers, high performance

• There is need of Data Scientists to develop/explore advanced

• Any type of activity by an individual, such as, watching a

• That imprint is being recorded by the companies to provide a

Source: Han et al., Data Mining

Data Integration where multiple data sources may be combined

an essential process where intelligent methods are applied to

to identify the truly interesting patterns representing knowledge

Knowledge where visualization and knowledge representation techniques are

• Data mining is the process of discovering interesting patterns

• Data is the most elementary descriptions of things, events,

• The data sources can include databases, data warehouses, the

• Data mining field is continuously evolving and will certainly

• Huge amount of data is being generated each second.

• There is competitive pressure always on the owner to provide

• Data mining helps to develop smart market decision, run

• With the help of Data mining, companies can analyze

• It consists of a collection of interrelated data, known as a

• A database is a logical grouping of data or collection of

• DBMS is a computerized data-keeping system.

• DBMS helps provide data security, data integrity,

• A DBMS serves as an interface between an end-user and a

• The most widely used types of DBMS software are relational,

• In general, such tasks can be classified into two categories:

• A data warehouse refers to a data repository that is maintained separately from an

• Data warehousing provides architectures and tools for business executives to

• Because of these capabilities, a data warehouse can be considered an organization’s

• According to William H. Inmon, a leading architect in the construction of data

• The four keywords—subject-oriented, integrated, time-variant, and nonvolatile -

• Rather than concentrating on the day-to-day operations and transaction processing of

• Data warehouse analysis looks at change over time.

• Benefits of a data warehouse include the following:

a) Informed decision making

b) Consolidated data from many sources

c) Historical data analysis

d) Data quality, consistency, and accuracy

e) Separation of analytics processing from transactional databases, which

• It stores the information an enterprise needs to make strategic decisions.

• A data warehouse is also often viewed as an architecture, constructed by integrating

• The middle tier is an OLAP server consists of

• It is typically implemented using either (1) a

• The bottom tier of the architecture is the

• OLTP : On Line Transaction Processing

• An OLTP system is customer-oriented. • An OLAP system is market-oriented.

• Access to OLAP systems are mostly read-only

Feature OLTP OLAP

View detailed, flat relational summarized,

• Such a data model is appropriate for online transaction processing.

• A data warehouse, however, requires a concise, subject-oriented schema that facilitates

• The schema contains a central

• Each dimension is represented

• It provides corporate-wide data integration, usually from one or more operational

• An enterprise data warehouse may be implemented on traditional mainframes,

• The data contained in data marts tend to be summarized.

• Depending on the source of data, data marts can be categorized as independent

• A virtual warehouse is easy to build but requires excess capacity on operational

• Think of them as the quantities by which we want to analyze relationships between

• A simple 2-D data cube that

• In particular, we will look at

• Data cubes store multidimensional aggregated information.

What is the aim?

Inaccurate Incomplete Inconsistent

Accurate Complete Consistent

• Relevant data may not be recorded due to a misunderstanding or because of

You might also like