0% found this document useful (0 votes)

46 views43 pages

Data Mining and Warehousing-1

Data preprocessing is an important step in data mining that involves cleaning, transforming, and reducing raw data to prepare it for analysis. The data cleaning process removes noise and inconsistent data through techniques like handling missing values, smoothing outliers, and resolving inconsistencies. Data integration combines data from multiple sources, while reduction reduces data dimensionality to manageable levels. Transformation converts data into appropriate forms for mining through methods like normalization, aggregation, and generalization. The goal of data preprocessing is to prepare useful and high quality data for knowledge discovery through data mining algorithms and techniques.

Uploaded by

Vijay Kumar Saini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views43 pages

Data Mining and Warehousing-1

Uploaded by

Vijay Kumar Saini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Data Mining and

Warehousing
Data, Information & Knowledge
• Data are the raw Facts(alphanumeric values) obtained through different
acquisition methods. Data in their simplest form consist of raw alphanumeric
values.
• Information is created when data are processed, organized, or structured to
provide context and meaning. Information is essentially processed data.
• Knowledge is what we know. Knowledge is unique to each individual and is
the accumulation of past experience and insight that shapes the lens by which
we interpret, and assign meaning to, information.
• Wisdom is the ability to make sensible decisions and judgments because of
your knowledge or experience
Types of Knowledge
• Explicit knowledge

Explicit knowledge is knowledge covering topics that are easy to systematically

document (in writing), and share out at scale: what we think of as structured information. When explicit
knowledge is well-managed, it can help a company make better decisions, save time, and maintain an
increase in performance.

• Implicit knowledge

Implicit knowledge is, essentially, learned skills or know-how. It is gained by taking

explicit knowledge and applying it to a specific situation. If explicit knowledge is a book on the mechanics
of flight and a layout diagram of an airplane cockpit, implicit knowledge is what happens when you apply
that information in order to fly the plane.
• Declarative knowledge

Declarative knowledge which can be also understood as propositional

knowledge, refers to static information and facts that are specific to a given topic, which
can be easily accessed and retrieved. It’s a type of knowledge where the individual is
consciously aware of their understanding of the subject matter.

• Procedural knowledge

Procedural knowledge focuses on the ‘how’ behind which things

operate, and is demonstrated through one’s ability to do something. Where declarative
knowledge focuses more on the ‘who, what, where, or when’, procedural knowledge is
less articulated and shown through action or documented through manuals.
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
• Data collection and data availability
• Automated data collection tools, database systems, Web, computerized society

• Major sources of abundant data

• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube

• We are drowning in data, but starving for knowledge!

• “Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
Evolution of Database Technology
• 1960s:
• Data collection, database creation, IMS and network DBMS
• 1970s:
• Relational data model, relational DBMS implementation
• 1980s:
• RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
• Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
• Data mining, data warehousing, multimedia databases, and Web databases
• 2000s
• Stream data management and mining
• Data mining and its applications
• Web technology (XML, data integration) and global information systems
What is Data Mining?
• Data mining refers to extracting or mining knowledge from large amounts of data.

• Data Mining is a process used by organizations to extract specific data from huge databases to solve
business problems.

• It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.

• Data mining is one of the most useful techniques that help entrepreneurs, researchers, and individuals to
extract valuable information from huge sets of data. Data mining is also called Knowledge Discovery in
Database (KDD).

• The overall goal of the data mining process is to extract information from a data set and transform it into
an understandable structure for further use.
Advantages of Data Mining
• The Data Mining technique enables organizations to obtain knowledge-based data.

• Data mining enables organizations to make modifications in operation and production.

• Compared with other statistical data applications, data mining is a cost-efficient.

• Data Mining helps the decision-making process of an organization.

• It facilitates the automated discovery of hidden patterns as well as the prediction of

trends and behaviors.

• It is a quick process that makes it easy for new users to analyze data in a short time.
Data Mining Applications
Classification of Data Mining Systems
• Data mining refers to the process of extracting important data from raw data. It
analyses the data patterns in huge sets of data with the help of several software.
Ever since the development of data mining, it is being incorporated by
researchers in the research and development field.

• With Data mining, businesses are found to gain more profit. It has helped in
determining business objectives for making clear decisions.

• To understand the system and meet the desired requirements, data mining can be
classified into the following systems:
Classification of Data Mining Systems
Challenges of Data Mining
• Security and Social Challenges: Decision-Making strategies are done through data collection-sharing, so it
requires considerable security.

• User Interface: The knowledge discovered using data mining tools is useful only if it is interesting and
above all understandable by the user.

• Mining Methodology Challenges: These challenges are related to data mining approaches and their
limitations.

(i)Versatility of the mining approaches

(ii) Diversity of data available,

(iii) Dimensionality of the domain,

(iv) Control and handling of noise in data, etc.

Challenges of Data Mining
• Complex Data: Real-world data is heterogeneous and it could be multimedia data
containing images, audio and video, complex data, temporal data, spatial data,
time series, natural language text etc. It is difficult to handle these various kinds of
data and extract the required information.

• Performance: The performance of the data mining system depends on the

efficiency of algorithms and techniques are using. The algorithms and techniques
designed are not up to the mark lead to affect the performance of the data mining
process.
What is Knowledge Discovery?
• The following diagram shows the process of knowledge discovery
The list of steps involved in the knowledge discovery process −

• Data Cleaning − In this step, the noise and inconsistent data is removed.

• Data Integration − In this step, multiple data sources are combined.

• Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.

• Data Transformation − In this step, data is transformed or consolidated into appropriate

forms for mining by performing summary or aggregation operations.

• Data Mining − In this step, intelligent methods are applied in order to extract data patterns.

• Pattern Evaluation − In this step, data patterns are evaluated.

• Knowledge Presentation − In this step, knowledge is represented.

Architecture of Data Mining
• Knowledge Base: This is the domain knowledge that is used to guide the search or evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to
organize attributes or attribute values into different levels of abstraction.

• Data Mining Engine: This is essential to the data mining system and ideally consists of a set of
functional modules for tasks such as association and correlation analysis, classification, prediction,
cluster analysis and evolution analysis.

• Pattern Evaluation Module: This component typically employs interestingness measures interacts
with the data mining modules so as to focus the search toward interesting patterns.

• User interface: This module communicates between users and the data mining system, allowing the
user to interact with the system by specifying a data mining query or task, providing information to
help focus the search.
Data Preprocessing:

• Need of Data Preprocessing

• Data Cleaning Process

• Data Integration Process

• Data Reduction Process

• Data Transformation Process

Need of Data Preprocessing

• Data preprocessing refers to the set of techniques implemented on the

databases to remove noisy, missing, and inconsistent data. Different
Data preprocessing techniques involved in data mining are data
cleaning, data integration, data reduction, and data transformation.
Data Cleaning Process

• Data in the real world is usually incomplete, incomplete and noisy.

The data cleaning process includes the procedure which aims at filling
the missing values, smoothing out the noise which determines the
outliers and rectifies the inconsistencies in data.

• Let us discuss the basic methods of data cleaning

1. Missing Values

Assume that you are dealing with any data like sales and customer data and you observe that there are
several attributes from which the data is missing. One cannot compute data with missing values. In this
case, there are some methods which sort out this problem. Let us go through them one by one,

1.1 Ignore the tuple: If there is no class label specified then we could go for this method. It is not
effective in the case if the percentage of missing values per attribute changes considerably.

1.2. Enter the missing value manually or fill it with global constant: When the database contains large
missing values, then filling manually method is not feasible. Meanwhile, this method is time-consuming.
Another method is to fill it with some global constant.

1.3. Filling the missing value with attribute mean or by using the most probable value: Filling the
missing value with attribute value can be the other option. Filling with the most probable value uses
regression or decision tree.
2. Noisy Data
• Noise refers to any error in a measured variable. If a numerical attribute is given
you need to smooth out the data by eliminating noise. Some data smoothing
techniques are as follows,
2.1. Binning:
• Smoothing by bin means: In smoothing by bin means, each value in a bin is
replaced by the mean value of the bin.
• Smoothing by bin median: In this method, each bin value is replaced by its bin
median value.
• Smoothing by bin boundary: In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin boundaries. Every value
of bin is then replaced with the closest boundary value.
• Let us understand with an example,
• Data for price: 15, 8, 21, 26, 21, 9, 25, 4, 34, 28, 24, 29
• Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

Smoothing by bin means: - Bin 1: 9, 9, 9, 9 -

Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29

Smoothing by bin median: - Bin 1: 9 9, 9, 9 -

Bin 2: 24, 24, 24, 24 - Bin 3: 29, 29, 29, 29

Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15

- Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
2.2. Regression

• Regression is used to predict the value. Linear regression uses the

formula of a straight line which predicts the value of y on the specified
value of x whereas multiple linear regression is used to predict the
value of a variable is predicted by using given values of two or more
variables.
3) Data Integration Process
• Data Integration is a data preprocessing technique that involves combining data from
multiple heterogeneous data sources into a coherent data store and supply a unified
view of the info. These sources may include multiple data cubes, databases or flat files.
• 3.1 Issues in Data Integration
• There are not any issues to think about during data integration: Schema Integration,
Redundancy, Detection and determination of knowledge value conflicts. These are
explained in short as below,
• 3.1.1. Entity identification:
• Integrate metadata from different sources.
• The real-world entities from multiple sources are matched mentioned because of the
entity identification problem.
• For example, How can the info analyst and computer make certain that customer id in
one database and customer number in another regard to an equivalent attribute.
• 3.2.2. Redundancy problem:
• An attribute could also be redundant if it is often derived or obtaining from another
attribute or set of the attribute.
• Inconsistencies in attribute also can cause redundancies within the resulting data set.
• Some redundancies are often detected by correlation analysis.
• 3.3.3. Detection and Resolution of data value conflicts:
• This is the third important issues in data integration. Attribute values from another
different source may differ for an equivalent world entity. An attribute in one system
could also be recorded at a lower level abstraction than the "same" attribute in
another.
4) Data Transformation Process
• In data transformation process data are transformed from one format
to a different format, that's more appropriate for data processing.
Some Data Transformation Strategies,
• Smoothing:-Smoothing may be a process of removing noise from the
info.
• Aggregation:-Aggregation may be a process where summary or
aggregation operations are applied to the info.
• Generalization:-In generalization, low-level data are replaced with
high-level data by using concept hierarchies climbing.
• Normalization:-Normalization scaled attribute data so on fall within a
little specified range, such as 0.0 to 1.0.
5) Data Reduction Process
• Data warehouses usually store large amounts of data the data mining operation takes a long time to
process this data. The data reduction technique helps to minimize the size of the dataset without
affecting the result. The following are the methods that are commonly used for data reduction,
• Data cube aggregation:- Refers to a method where aggregation operations are performed on data to
create a data cube, which helps to analyze business trends and performance.
• Attribute subset selection:- Refers to a method where redundant attributes or dimensions or irrelevant
data may be identified and removed.
• Data Compression:- Refers to a method where encoding techniques are used to minimize the size of the
data set.
• Numerosity reduction:- Refers to a method where smaller data representation replaces the data.
• Discretization and concept hierarchy generation:- Refers to methods where higher conceptual values
replace raw data values for attributes. Data discretization is a type of numerosity reduction for the
automatic generation of concept hierarchies.
Data Cube
• A data cube is created from a subset of attributes in the database. Specific
attributes are chosen to be measure attributes, i.e., the attributes whose
values are of interest. Another attributes are selected as dimensions or
functional attributes. The measure attributes are aggregated according to the
dimensions.
• A data cube enables data to be modeled and viewed in multiple dimensions.
A multidimensional data model is organized around a central theme, like
sales and transactions. A fact table represents this theme. Facts are
numerical measures.
• Dimensions are a fact that defines a data cube. Facts are generally
quantities, which are used for analyzing the relationship between
dimensions.
The figure below shows the data cube for All Electronics sales.

Each dimension has a dimension table which contains a further description of that dimension.
Such as a branch dimension may have branch_name, branch_code, branch_address etc.
Attribute Subset Selection in Data Mining
• Attribute subset Selection is a technique which is used for data
reduction in data mining process. Data reduction reduces the size of
data so that it can be used for analysis purposes more efficiently.
• This is a kind of greedy approach in which a significance level is
decided (statistically ideal value of significance level is 5%) and the
models are tested again and again until p-value (probability value) of
all attributes is less than or equal to the selected significance level.
• Methods of Attribute Subset Selection-
1. Stepwise Forward Selection.
2. Stepwise Backward Elimination.
3. Combination of Forward Selection and Backward Elimination.
4. Decision Tree Induction.
• Stepwise Forward Selection: This procedure start with an empty set of
attributes as the minimal set. The most relevant attributes are chosen(having
minimum p-value) and are added to the minimal set. In each iteration, one
attribute is added to a reduced set.
• Stepwise Backward Elimination: Here all the attributes are considered in
the initial set of attributes. In each iteration, one attribute is eliminated from
the set of attributes whose p-value is higher than significance level.
• Combination of Forward Selection and Backward Elimination: The
stepwise forward selection and backward elimination are combined so as to
select the relevant attributes most efficiently. This is the most common
technique which is generally used for attribute selection.
• Decision Tree Induction: This approach uses decision tree for attribute
selection. It constructs a flow chart like structure having nodes denoting a
test on an attribute. Each branch corresponds to the outcome of test and leaf
nodes is a class prediction. The attribute that is not the part of tree is
considered irrelevant and hence discarded.
Step-Wise Forward Selection

•Suppose there are the following attributes in the data set in

which few attributes are redundant.
Initial attribute Set: {X1, X2, X3, X4, X5, X6}
Initial reduced attribute set: { }
Step-1: {X1}
Step-2: {X1, X2}
Step-3: {X1, X2, X5}
Final reduced attribute set: {X1, X2, X5}
Data Compression
• Data reduction is a technique used in data mining to reduce the size of a dataset
while still preserving the most important information. This can be beneficial in
situations where the dataset is too large to be processed efficiently, or where the
dataset contains a large amount of irrelevant or redundant information.
• Types of data compression techniques
• While one can refer to this data compression technique PDF, to know about the
various type of techniques available, the two common types that always stand
out are:
1. Lossy
2. Lossless
• Lossy compression

To understand the lossy compression technique, we must first understand the

difference between data and information. Data is a raw, often unorganized
collection of facts or values and can mean numbers, text, symbols, etc. On the
other hand, Information brings context by carefully organizing the facts.

• Lossless Compression

Lossless compression, unlike lossy compression, doesn’t remove any data;

instead, it transforms it to reduce its size.
Numerosity Reduction
• In the numerosity reduction, the data volume is decreased by
selecting an alternative, smaller form of data representation.
• These techniques can be parametric or non-parametric.
• Parametric methods, a model can estimate the data so that only the
data parameters need to be saved, instead of the actual data, for
example, Log-linear models.
• Non-parametric methods are used to store a reduced representation
of the data, including histograms, clustering, and sampling.
Discretization
• Data discretization refers to a method of converting a huge
number of data values into smaller ones so that the evaluation
and management of data become easy.
• In other words, data discretization is a method of converting
attributes values of continuous data into a finite set of intervals
with minimum data loss.
• Example:-
Age-5,7,8,12,15,18,25,35,45,60,72,82
Discretization-child(5,7,8) Young(12,15,18,25) Mature(35,45)
Old(60,72,82)
Association Rule
• Association rule mining is a procedure which aims to observe
frequently occurring patterns, correlations, or associations from
datasets found in various kinds of databases such as relational
databases, transactional databases, and other forms of repositories.
• Association rule mining finds interesting associations and
relationships among large sets of data items. This rule shows how
frequently a itemset occurs in a transaction. A typical example is a
Market Based Analysis.
• The Association rule is a learning technique that helps identify the
dependencies between two data items. Based on the dependency, it
then maps accordingly so that it can be more profitable.
• Association rules are created by thoroughly analyzing data and
looking for frequent if/then patterns. Then, depending on the
following two parameters, the important relationships are observed:

• Support: Support indicates how frequently the if/then relationship

appears in the database.

• Confidence: Confidence tells about the number of times these

relationships have been found to be true.
Market Basket Analysis
• A data mining technique that is used to uncover purchase patterns in
any retail setting is known as Market Basket Analysis.
• In simple terms Basically, Market basket analysis in data mining is to
analyze the combination of products which been bought together.
• This is a technique that gives the careful study of purchases done by a
customer in a supermarket.
• This concept identifies the pattern of frequent purchase items by
customers. This analysis can help to promote deals, offers, sale by the
companies, and data mining techniques helps to achieve this analysis
task.
How does Association Rule Learning work?
• Association rule learning works on the concept of If and Else
Statement, such as if A then B.
• measure the associations between thousands of data items, there are
several metrics.
• These metrics are given below:
Support
Confidence
Lift
• Support: Support is the frequency of A or how frequently an item appears
in the dataset. It is defined as the fraction of the transaction T that contains
the itemset X. If there are X datasets, then for transactions T, it can be
written as:
Supp(X)=freq(X)/T
• Confidence: Confidence indicates how often the rule has been found to be
true. Or how often the items X and Y occur together in the dataset when
the occurrence of X is already given. It is the ratio of the transaction that
contains X and Y to the number of records that contain X.
Confidence=freq(X,Y)/freq(X)
• Lift: It is the strength of any rule, which can be defined as below formula:
Lift=support(X,Y)/sup(X)*sup(Y )

Identify The Correct Spelling: Prepare For MaRRS Spelling Bee Competition Exam - Practice Tests For Free Immediate Downlaod
100% (1)
Identify The Correct Spelling: Prepare For MaRRS Spelling Bee Competition Exam - Practice Tests For Free Immediate Downlaod
18 pages
The Way of Complete Perfection - A Quanzhen Daoist Anthology (PDFDrive)
No ratings yet
The Way of Complete Perfection - A Quanzhen Daoist Anthology (PDFDrive)
470 pages
PPP
No ratings yet
PPP
38 pages
Notes for DMDWH -Module1
No ratings yet
Notes for DMDWH -Module1
21 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
25 pages
Chapter-1 (Introduction)
No ratings yet
Chapter-1 (Introduction)
17 pages
Data Mining
No ratings yet
Data Mining
7 pages
Data Mining
No ratings yet
Data Mining
395 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
Data Structures: Notes For Lecture 12 Introduction To Data Mining by Samaher Hussein Ali
No ratings yet
Data Structures: Notes For Lecture 12 Introduction To Data Mining by Samaher Hussein Ali
4 pages
unit-III
No ratings yet
unit-III
101 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
8 pages
DMWH M1
No ratings yet
DMWH M1
25 pages
Data Mining Note
No ratings yet
Data Mining Note
79 pages
Unit II Data Mining
No ratings yet
Unit II Data Mining
8 pages
1 Chapter One
No ratings yet
1 Chapter One
54 pages
Introduction-to-Data-Mining
No ratings yet
Introduction-to-Data-Mining
32 pages
Data Mining and Its Applications
No ratings yet
Data Mining and Its Applications
60 pages
Screenshot 2023-10-19 at 11.36.57
No ratings yet
Screenshot 2023-10-19 at 11.36.57
27 pages
Topic 3 - Data Mining
No ratings yet
Topic 3 - Data Mining
37 pages
Data Mining-CH5
No ratings yet
Data Mining-CH5
49 pages
Motivation of Data Mining
No ratings yet
Motivation of Data Mining
4 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
Digital Data Mining Nostos - FP
No ratings yet
Digital Data Mining Nostos - FP
37 pages
Data Mining
No ratings yet
Data Mining
27 pages
Unit I DM
No ratings yet
Unit I DM
27 pages
DM Module1
No ratings yet
DM Module1
15 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
15 pages
Data Mining Nostos
100% (1)
Data Mining Nostos
39 pages
5 Data Mining Proccess and Techniques - Week 7
No ratings yet
5 Data Mining Proccess and Techniques - Week 7
61 pages
Data Mining Note Sixth Semester ..
No ratings yet
Data Mining Note Sixth Semester ..
79 pages
Data Mining
No ratings yet
Data Mining
7 pages
Data Mining Lecture One - Docx1
No ratings yet
Data Mining Lecture One - Docx1
12 pages
DM Unit1 Intro
No ratings yet
DM Unit1 Intro
12 pages
DM
No ratings yet
DM
15 pages
Data Mining Cognate
No ratings yet
Data Mining Cognate
23 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
41 pages
Mining
No ratings yet
Mining
7 pages
July 16, 2009 1 Data Mining
No ratings yet
July 16, 2009 1 Data Mining
26 pages
DATA MINING-Knowledge Discovery in Databases
No ratings yet
DATA MINING-Knowledge Discovery in Databases
6 pages
Module - 1 - DM
No ratings yet
Module - 1 - DM
52 pages
Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
Lecture-01 (Data Mining Concepts & Technologies)
No ratings yet
Lecture-01 (Data Mining Concepts & Technologies)
13 pages
Data Mining
No ratings yet
Data Mining
18 pages
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
No ratings yet
Mehrdad Jalali: Jalali@mshdiau - Ac.ir Jalali - Mshdiau.ac - Ir
27 pages
Data Mining and Data Analysis UNIT-1 Notes For Print
No ratings yet
Data Mining and Data Analysis UNIT-1 Notes For Print
22 pages
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
No ratings yet
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
6 pages
Motivation For Data Mining The Information Crisis
No ratings yet
Motivation For Data Mining The Information Crisis
13 pages
Data Mining e Resources
No ratings yet
Data Mining e Resources
98 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
Chapter 1 - What is Data Mining
No ratings yet
Chapter 1 - What is Data Mining
8 pages
KDD Process
No ratings yet
KDD Process
56 pages
DWDM 1
No ratings yet
DWDM 1
17 pages
Module-1 DM
No ratings yet
Module-1 DM
15 pages
Unit III Dwdm
No ratings yet
Unit III Dwdm
113 pages
B SC (IT) VI-DSE3-M5
No ratings yet
B SC (IT) VI-DSE3-M5
13 pages
Lecture 5 Introduction To Data Mining Business Intelligence
No ratings yet
Lecture 5 Introduction To Data Mining Business Intelligence
50 pages
Dataminig
No ratings yet
Dataminig
21 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
46 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Green and Yellow Colorful Blocks Conference Event Website
No ratings yet
Green and Yellow Colorful Blocks Conference Event Website
5 pages
SE Notes
No ratings yet
SE Notes
16 pages
Introduction To Quantum Computing
No ratings yet
Introduction To Quantum Computing
10 pages
Cryptography Unit 1
No ratings yet
Cryptography Unit 1
57 pages
CG Lab Manual
No ratings yet
CG Lab Manual
48 pages
ecosystem class 3
No ratings yet
ecosystem class 3
3 pages
Mapeh 9 Week 3 Week 4
No ratings yet
Mapeh 9 Week 3 Week 4
12 pages
PROFS DATABASE
No ratings yet
PROFS DATABASE
35 pages
Critical Thinking
No ratings yet
Critical Thinking
37 pages
Resumelaveirge
No ratings yet
Resumelaveirge
2 pages
Transformation Leadership Development
No ratings yet
Transformation Leadership Development
25 pages
Marksheet Magadh University, Bodh Gaya
No ratings yet
Marksheet Magadh University, Bodh Gaya
1 page
Effects of Video Games
No ratings yet
Effects of Video Games
10 pages
Week 15 QTR 2 Math 1
No ratings yet
Week 15 QTR 2 Math 1
7 pages
I. Objectives: Preposition and Prepositional Phrases and Prepositional Phrases
No ratings yet
I. Objectives: Preposition and Prepositional Phrases and Prepositional Phrases
14 pages
The Pin With A Tin Fin
No ratings yet
The Pin With A Tin Fin
5 pages
Health Problems Vocabulary Esl Matching Exercise Worksheet For Kids
100% (1)
Health Problems Vocabulary Esl Matching Exercise Worksheet For Kids
2 pages
Write Log in Console Procedures Are Given Below-: Using System
No ratings yet
Write Log in Console Procedures Are Given Below-: Using System
40 pages
content-marketing-the-internship-report
No ratings yet
content-marketing-the-internship-report
50 pages
Kannur University (Examination Branch) 25 October 2011 (The Mark Is Provisional. Kindly Confirm The Same With The Mark List Issued by The University)
No ratings yet
Kannur University (Examination Branch) 25 October 2011 (The Mark Is Provisional. Kindly Confirm The Same With The Mark List Issued by The University)
16 pages
Curriculum - Vitae: Laxmikant Shukla
No ratings yet
Curriculum - Vitae: Laxmikant Shukla
2 pages
Annabel Keel Ty Resume
No ratings yet
Annabel Keel Ty Resume
1 page
Elements of Short Story 1223516596455327 8
No ratings yet
Elements of Short Story 1223516596455327 8
30 pages
Adminmd12 Mdmsgovt2024 Seatmatrix
No ratings yet
Adminmd12 Mdmsgovt2024 Seatmatrix
4 pages
Chapter 1 Introduction to Academic Literacy
No ratings yet
Chapter 1 Introduction to Academic Literacy
9 pages
Flamingo (Prose)
No ratings yet
Flamingo (Prose)
16 pages
Models For Research in Art, Design, and The Creative Industries
No ratings yet
Models For Research in Art, Design, and The Creative Industries
7 pages
Dataware dp2
No ratings yet
Dataware dp2
1 page
Pengalaman Ujian CCNA
No ratings yet
Pengalaman Ujian CCNA
3 pages
How To Be An Effective and Productive Team Member (Chapter2)
No ratings yet
How To Be An Effective and Productive Team Member (Chapter2)
11 pages
Managing Speech Anxiety
No ratings yet
Managing Speech Anxiety
17 pages
BBPS-NM_Staff-RecruitmentApplication-Form
No ratings yet
BBPS-NM_Staff-RecruitmentApplication-Form
2 pages
Allama Iqbal Open University, Islamabad (Department of English) Warning
No ratings yet
Allama Iqbal Open University, Islamabad (Department of English) Warning
4 pages

Data Mining and Warehousing-1

Uploaded by

Data Mining and Warehousing-1

Uploaded by

Data Mining and

Explicit knowledge is knowledge covering topics that are easy to systematically

Implicit knowledge is, essentially, learned skills or know-how. It is gained by taking

Declarative knowledge which can be also understood as propositional

Procedural knowledge focuses on the ‘how’ behind which things

• Major sources of abundant data

• We are drowning in data, but starving for knowledge!

• Data mining enables organizations to make modifications in operation and production.

• Compared with other statistical data applications, data mining is a cost-efficient.

• Data Mining helps the decision-making process of an organization.

• It facilitates the automated discovery of hidden patterns as well as the prediction of

(i)Versatility of the mining approaches

(ii) Diversity of data available,

(iii) Dimensionality of the domain,

(iv) Control and handling of noise in data, etc.

• Performance: The performance of the data mining system depends on the

• Data Integration − In this step, multiple data sources are combined.

• Data Transformation − In this step, data is transformed or consolidated into appropriate

• Pattern Evaluation − In this step, data patterns are evaluated.

• Knowledge Presentation − In this step, knowledge is represented.

• Need of Data Preprocessing

• Data Cleaning Process

• Data Integration Process

• Data Reduction Process

• Data Transformation Process

• Data preprocessing refers to the set of techniques implemented on the

• Data in the real world is usually incomplete, incomplete and noisy.

• Let us discuss the basic methods of data cleaning

Smoothing by bin means: - Bin 1: 9, 9, 9, 9 -

Smoothing by bin median: - Bin 1: 9 9, 9, 9 -

Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15

• Regression is used to predict the value. Linear regression uses the

•Suppose there are the following attributes in the data set in

To understand the lossy compression technique, we must first understand the

Lossless compression, unlike lossy compression, doesn’t remove any data;

• Support: Support indicates how frequently the if/then relationship

• Confidence: Confidence tells about the number of times these

You might also like