0% found this document useful (0 votes)

16 views

Data Mining M1

DM1

Uploaded by

bgr7078

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Data Mining M1

DM1

Uploaded by

bgr7078

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

SCSA3001 Data Mining And Data Warehousing

DATA MINING
Introduction - Steps in KDD - System Architecture – Types of data -Data mining
functionalities - Classification of data mining systems - Integration of a data mining
system with a data warehouse - Issues - Data Preprocessing - Data Mining
Application.
INTRODUCTION

What is Data?
• Collection of data objects and their attributes
• An attribute is a property or characteristic of an object – Examples: eye color of a person,
temperature, etc. – Attribute is also known as variable, field, characteristic, or feature
• A collection of attributes describe an object – Object is also known as record, point, case,
sample, entity, or instance Attributes
Data sets are made up of data objects. A data object represents an entity—in a sales database, the
objects may be customers, store items, and sales; in a medical database, the objects may be
patients; in a university database, the objects may be students, professors, and courses. Data
objects are typically described by attributes. Data objects can also be referred to as samples,
examples, instances, data points, or objects. If the data objects are stored in a database, they
are data tuples. That is, the rows of a database correspond to the data objects, and the columns
correspond to the attributes.
Attribute:
It can be seen as a data field that represents characteristics or features of a data object. For a
customer object attributes can be customer Id, address etc.
We can say that a set of attributes used to describe a given object are known as attribute
vector or feature vector.
Type of attributes:
This is the First step of Data Data-preprocessing. We differentiate between different types of
attributes and then pre process the data. So here is description of attribute types.
1. Qualitative (Nominal (N), Ordinal (O), Binary (B)).
2. Quantitative (Discrete, Continuous)

2
SCSA3001 Data Mining And Data Warehousing

Figure 1.1 Type of attributes

Qualitative Attributes
1. Nominal Attributes – related to names:
The values of a Nominal attribute are name of things, some kind of symbols. Values of
Nominal attributes represents some category or state and that’s why nominal attribute also
referred as categorical attributes and there is no order among values of nominal attribute.
Example

Table 1.1 Nominal Attributes

2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no, affected or
unaffected, true or false.
3. i) Symmetric: Both values are equally important (Gender).
ii) Asymmetric: Both values are not equally important (Result).

Table 1.2 binary Attributes

3
SCSA3001 Data Mining And Data Warehousing

Ordinal Attributes : The Ordinal Attributes contains values that have a meaningful sequence or
ranking(order) between them, but the magnitude between values is not actually known, the order
of values that shows what is important but don’t indicate how important it is.

Table 1.3 Ordinal Attributes

Quantitative Attributes
1. Numeric: A numeric attribute is quantitative because, it is a measurable quantity, represented
in integer or real values. Numerical attributes are of 2 types, interval and ratio.
i) An interval-scaled attribute has values, whose differences are interpretable, but the
numerical attributes do not have the correct reference point or we can call zero point. Data can
be added and subtracted at interval scale but cannot be multiplied or divided. Consider an
example of temperature in degrees Centigrade. If a day’s temperature of one day is twice than
the other day we cannot say that one day is twice as hot as another day.
ii) A ratio-scaled attribute is a numeric attribute with an fix zero-point. If a measurement is ratio-
scaled, we can say of a value as being a multiple (or ratio) of another value. The values are
ordered, and we can also compute the difference between values, and the mean, median, mode,
Quantile-range and five number summaries can be given.
2. Discrete: Discrete data have finite values it can be numerical and can also be in categorical
form. These attributes has finite or countable infinite set of values.
Example

Table 1.4 Discrete Attributes

3. Continuous: Continuous data have infinite no of states. Continuous data is of float type.
There can be many values between 2 and 3.
Example:

4
SCSA3001 Data Mining And Data Warehousing

Table 1.5 Continuous Attributes

STEPS INVOLVED IN KDD PROCESS

Data Mining also known as Knowledge Discovery in Databases refers to the nontrivial extraction
of implicit, previously unknown and potentially useful information from data stored in databases.

Figure 1.2 KDD Process

1. Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from
collection.
 Cleaning in case of Missing values.
 Cleaning noisy data, where noise is a random or variance error.
 Cleaning with Data discrepancy detection and Data transformation tools.
2. Data Integration: Data integration is defined as heterogeneous data from multiple sources
combined in a common source (Data Warehouse).
 Data integration using Data Migration tools.
 Data integration using Data Synchronization tools.
 Data integration using ETL (Extract-Load-Transformation) process.

5
SCSA3001 Data Mining And Data Warehousing

3. Data Selection: Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection.
 Data selection using Neural network.
 Data selection using Decision Trees.
 Data selection using Naive bayes.
 Data selection using Clustering, Regression, etc.
4. Data Transformation: Data Transformation is defined as the process of transforming data
into appropriate form required by mining procedure.
Data Transformation is a two-step process:

 Data Mapping: Assigning elements from source base to destination to capture

transformations.
 Code generation: Creation of the actual transformation program.
5. Data Mining: Data mining is defined as clever techniques that are applied to extract patterns
potentially useful.
 Transforms task relevant data into patterns.
 Decides purpose of model using classification or characterization.
6. Pattern Evaluation: Pattern Evaluation is defined as as identifying strictly increasing
patterns representing knowledge based on given measures.
 Find interestingness score of each pattern.
 Uses summarization and Visualization to make data understandable by user.
7. Knowledge representation: Knowledge representation is defined as technique which utilizes
visualization tools to represent data mining results.
 Generate reports.
 Generate tables.
 Generate discriminant rules, classification rules, characterization rules, etc.
Note:
 KDD is an iterative process where evaluation measures can be enhanced, mining can be
refined, new data can be integrated and transformed in order to get different and more
appropriate results.
 Preprocessing of databases consists of Data cleaning and Data Integration.

6
SCSA3001 Data Mining And Data Warehousing

SYSTEM ARCHITECTURE
Data mining is a very important process where potentially useful and previously unknown
information is extracted from large volumes of data. There are a number of components involved
in the data mining process. These components constitute the architecture of a data mining system.
Data Mining Architecture

The major components of any data mining system are data source, data warehouse server, data
mining engine, pattern evaluation module, graphical user interface and knowledge base.

Figure 1.3 system Architecture

a) Data Sources
Database, data warehouse, World Wide Web (WWW), text files and other documents are the
actual sources of data. You need large volumes of historical data for data mining to be successful.
Organizations usually store data in databases or data warehouses. Data warehouses may contain
one or more databases, text files, spreadsheets or other kinds of information repositories.
Sometimes, data may reside even in plain text files or spreadsheets. World Wide Web or the
Internet is another big source of data.
Different Processes
The data needs to be cleaned, integrated and selected before passing it to the database or data
warehouse server. As the data is from different sources and in different formats, it cannot be used
directly for the data mining process because the data might not be complete and reliable. So, first

7
SCSA3001 Data Mining And Data Warehousing

data needs to be cleaned and integrated. Again, more data than required will be collected from
different data sources and only the data of interest needs to be selected and passed to the server.
These processes are not as simple as we think. A number of techniques may be performed on the
data as part of cleaning, integration and selection.
b) Database or Data Warehouse Server
The database or data warehouse server contains the actual data that is ready to be processed.
Hence, the server is responsible for retrieving the relevant data based on the data mining request
of the user.
c) Data Mining Engine
The data mining engine is the core component of any data mining system. It consists of a number
of modules for performing data mining tasks including association, classification,
characterization, clustering, prediction, time-series analysis etc.
d) Pattern Evaluation Modules
The pattern evaluation module is mainly responsible for the measure of interestingness of the
pattern by using a threshold value. It interacts with the data mining engine to focus the search
towards interesting patterns.
e) Graphical User Interface
The graphical user interface module communicates between the user and the data mining system.
This module helps the user use the system easily and efficiently without knowing the real
complexity behind the process. When the user specifies a query or a task, this module interacts
with the data mining system and displays the result in an easily understandable manner.
f) Knowledge Base
The knowledge base is helpful in the whole data mining process. It might be useful for guiding the
search or evaluating the interestingness of the result patterns. The knowledge base might even
contain user beliefs and data from user experiences that can be useful in the process of data
mining. The data mining engine might get inputs from the knowledge base to make the result
more accurate and reliable. The pattern evaluation module interacts with the knowledge base on a
regular basis to get inputs and also to update it.
Summary
Each and every component of data mining system has its own role and importance in completing
data mining efficiently.

8
SCSA3001 Data Mining And Data Warehousing

DATA MINING FUNCTIONALITIES

Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. Data mining tasks can be classified into two categories: descriptive and predictive.
Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.
Concept/Class Description: Characterization and Discrimination
Data can be associated with classes or concepts. For example, in the Electronics store, classes of
items for sale include computers and printers, and concepts of customers include big Spenders and
budget Spenders.
Data characterization
Data characterization is a summarization of the general characteristics or features of a target class
of data.
Data discrimination
Data discrimination is a comparison of the general features of target class data objects with the
general features of objects from one or a set of contrasting classes.
Mining Frequent Patterns, Associations, and Correlations
Frequent patterns, are patterns that occur frequently in data. There are many kinds of frequent
patterns, including itemsets, subsequences, and substructures.
Association analysis
Suppose, as a marketing manager, you would like to determine which items are frequently
purchased together within the same transactions.
buys(X,“computer”)=buys(X,“software”) [support=1%,confidence=50%]
Where X is a variable representing a customer. Confidence=50% means that if a customer buys a
computer, there is a 50% chance that she will buy software as well.
Support=1% means that 1% of all of the transactions under analysis showed that computer and
software were purchased together.
Classification:
There is a large variety of data mining systems available. Data mining systems may integrate
techniques from the following −
 Spatial Data Analysis
 Information Retrieval

9
SCSA3001 Data Mining And Data Warehousing

 Pattern Recognition
 Image Analysis
 Signal Processing
 Computer Graphics
 Web Technology
 Business
 Bioinformatics
DATA MINING SYSTEM CLASSIFICATION
A data mining system can be classified according to the following criteria −
 Database Technology
 Statistics
 Machine Learning
 Information Science
 Visualization
 Other Disciplines

Figure 1.4 system Architecture

Apart from these, a data mining system can also be classified based on the kind of (a) databases
mined, (b) knowledge mined, (c) techniques utilized, and (d) applications adapted.
Classification Based on the Databases Mined
We can classify a data mining system according to the kind of databases mined. Database system
can be classified according to different criteria such as data models, types of data, etc. And the
data mining system can be classified accordingly.
For example, if we classify a database according to the data model, then we may have a
relational, transactional, object-relational, or data warehouse mining system.

10
SCSA3001 Data Mining And Data Warehousing

Classification Based on the kind of Knowledge Mined

We can classify a data mining system according to the kind of knowledge mined. It means the
data mining system is classified on the basis of functionalities such as −
 Characterization
 Discrimination
 Association and Correlation Analysis
 Classification
 Prediction
 Outlier Analysis
 Evolution Analysis
Classification Based on the Techniques Utilized
We can classify a data mining system according to the kind of techniques used. We can describe
these techniques according to the degree of user interaction involved or the methods of analysis
employed.
Classification Based on the Applications Adapted
We can classify a data mining system according to the applications adapted. These applications
are as follows −
 Finance
 Telecommunications
 DNA
 Stock Markets
 E-mail
Data Mining Task Primitives
Each user will have a data mining task in mind, that is, some form of data analysis that he or she
would like to have performed. A data mining task can be specified in the form of a data mining
query, which is input to the data mining system. A data mining query is defined in terms of data
mining task primitives. These primitives allow the user to interactively communicate with the data
mining system during discovery in order to direct the mining process, or examine the findings
from different angles or depths. set of task-relevant data to be mined: This specifies the portions
of the database or the set of data in which the user is interested. This includes the database
attributes or data warehouse dimensions of interest (referred to as the relevant attributes or

11
SCSA3001 Data Mining And Data Warehousing

dimensions). The kind of knowledge to be mined: This specifies the data mining functions to be
performed, such as characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution analysis.
The background knowledge to be used in the discovery process: This knowledge about the
domain to be mined is useful for guiding the knowledge discovery process and for evaluating the
patterns found. Concept hierarchies are a popular form of background knowledge, which allow
data to be mined at multiple levels of abstraction. User beliefs regarding relationships in the data
are another form of background knowledge. The interestingness measures and thresholds for
pattern evaluation: They may be used to guide the mining process or, after discovery, to evaluate
the discovered patterns. Different kinds of knowledge may have different interestingness
measures. For example, interestingness measures for association rules include support and
confidence. Rules whose support and confidence values are below user-specified thresholds are
considered uninteresting. The expected representation for visualizing the discovered patterns: This
refers to the form in which discovered patterns are to be displayed, which may include rules,
tables, charts, graphs, decision trees, and cubes. A data mining query language can be designed to
incorporate these primitives, allowing users to flexibly interact with data mining systems. Having
a data mining query language provides a foundation on which user-friendly graphical interfaces
can be built.

Figure 1.5 Data mining tasks

12
SCSA3001 Data Mining And Data Warehousing

INTEGRATING A DATA MINING SYSTEM WITH A DB/DW SYSTEM

If a data mining system is not integrated with a database or a data warehouse system, then there
will be no system to communicate with. This scheme is known as the non-coupling scheme. In
this scheme, the main focus is on data mining design and on developing efficient and effective
algorithms for mining the available data sets.
The list of Integration Schemes is as follows −
 No Coupling − In this scheme, the data mining system does not utilize any of the database or
data warehouse functions. It fetches the data from a particular source and processes that data
using some data mining algorithms. The data mining result is stored in another file.
 Loose Coupling − In this scheme, the data mining system may use some of the functions of
database and data warehouse system. It fetches the data from the data respiratory managed by
these systems and performs data mining on that data. It then stores the mining result either in a
file or in a designated place in a database or in a data warehouse.
 Semi−tight Coupling − In this scheme, the data mining system is linked with a database or a
data warehouse system and in addition to that, efficient implementations of a few data mining
primitives can be provided in the database.
 Tight coupling − In this coupling scheme, the data mining system is smoothly integrated into
the database or data warehouse system. The data mining subsystem is treated as one functional
component of an information system.
MAJOR ISSUES IN DATA WAREHOUSING AND MINING
• Mining methodology and user interaction
– Mining different kinds of knowledge in databases
– Interactive mining of knowledge at multiple levels of abstraction – Incorporation of background
knowledge
– Data mining query languages and ad-hoc data mining
– Expression and visualization of data mining results
– Handling noise and incomplete data
– Pattern evaluation: the interestingness problem
• Performance and scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed and incremental mining methods

13
SCSA3001 Data Mining And Data Warehousing

• Issues relating to the diversity of data types

– Handling relational and complex types of data
– Mining information from heterogeneous databases and global information systems (WWW)
• Issues related to applications and social impacts
– Application of discovered knowledge
• Domain-specific data mining tools
Issues:
Data mining is not an easy task, as the algorithms used can get very complex and data is not
always available at one place. It needs to be integrated from various heterogeneous data sources.
These factors also create some issues. Here in this tutorial, we will discuss the major issues
regarding −
 Mining Methodology and User Interaction
 Performance Issues
 Diverse Data Types Issues
The following diagram describes the major issues.

Figure 1.6 Data Mining Issues

14
SCSA3001 Data Mining And Data Warehousing

Mining Methodology and User Interaction Issues:

It refers to the following kinds of issues −

 Mining different kinds of knowledge in databases − Different users may be interested in
different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range
of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery process and to express the
discovered patterns, the background knowledge can be used. Background knowledge may be
used to express the discovered patterns not only in concise terms but at multiple levels of
abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns are discovered it
needs to be expressed in high level languages, and visual representations. These representations
should be easily understandable.
 Handling noisy or incomplete data − The data cleaning methods are required to handle the
noise and incomplete objects while mining the data regularities. If the data cleaning methods
are not there then the accuracy of the discovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
Performance Issues:
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms− In order to effectively extract the
information from huge amount of data in databases; data mining algorithm must be efficient
and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as huge size of
databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the

15
SCSA3001 Data Mining And Data Warehousing

data into partitions which is further processed in a parallel fashion. Then the results from the
partitions are merged. The incremental algorithms, update databases without mining the data
again from scratch.
Diverse Data Types Issues:
 Handling of relational and complex types of data − The database may contain complex data
objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.
 Mining information from heterogeneous databases and global information systems − The
data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining the knowledge from them adds
challenges to data mining.
DATA PREPROCESSING
Data preprocessing is a data mining technique that involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain
behaviors or trends, and is likely to contain many errors. Data preprocessing is a proven method
of resolving such issues. Data preprocessing prepares raw data for further processing.
Data preprocessing is used database-driven applications such as customer relationship
management and rule-based applications (like neural networks).
Data goes through a series of steps during pre processing:
 Data Cleaning: Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
 Data Integration: Data with different representations are put together and conflicts within
the data are resolved.
 Data Transformation: Data is normalized, aggregated and generalized.
 Data Reduction: This step aims to present a reduced representation of the data in a data
warehouse.
 Data Discretization: Involves the reduction of a number of values of a continuous attribute
by dividing the range of attribute intervals.
Integration of a data mining system with a data warehouse:
DB and DW systems, possible integration schemes include no coupling, loose coupling, semi-
tight coupling, and tight coupling. We examine each of these schemes, as follows:

16
SCSA3001 Data Mining And Data Warehousing

1. No coupling: No coupling means that a DM system will not utilize any function of a DB or
DW system. It may fetch data from a particular source (such as a file system), process data using
some data mining algorithms, and then store the mining results in another file.
2. Loose coupling: Loose coupling means that a DM system will use some facilities of a DB or
DW system, fetching data from a data repository managed by these systems, performing data
mining, and then storing the mining results either in a file or in a designated place in a database or
data Warehouse. Loose coupling is better than no coupling because it can fetch any portion of data
stored in databases or data warehouses by using query processing, indexing, and other system
facilities.
However, many loosely coupled mining systems are main memory-based. Because mining does
not explore data structures and query optimization methods provided by DB or DW systems, it is
difficult for loose coupling to achieve high scalability and good performance with large data sets.
3. Semi-tight coupling: Semi-tight coupling means that besides linking a DM system to a
DB/DW system, efficient implementations of a few essential data mining primitives (identified by
the analysis of frequently encountered data mining functions) can be provided in the DB/DW
system. These primitives can include sorting, indexing, aggregation, histogram analysis, multi
way join, and pre computation of some essential statistical measures, such as sum, count, max,
min ,standard deviation,
4. Tight coupling: Tight coupling means that a DM system is smoothly integrated into the
DB/DW system. The data mining subsystem is treated as one functional component of
information system. Data mining queries and functions are optimized based on mining query
analysis, data structures, indexing schemes, and query processing methods of a DB or DW
system.

17
SCSA3001 Data Mining And Data Warehousing

Figure 1.7 Integration of a data mining system with a data warehouse:

DATA MINING APPLICATIONS

Here is the list of areas where data mining is widely used −

 Financial Data Analysis
 Retail Industry
 Telecommunication Industry
 Biological Data Analysis
 Other Scientific Applications
 Intrusion Detection
Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of high quality which
facilitates systematic data analysis and data mining. Some of the typical cases are as follows −
 Design and construction of data warehouses for multidimensional data analysis and data
mining.
 Loan payment prediction and customer credit policy analysis.
 Classification and clustering of customers for targeted marketing.
 Detection of money laundering and other financial crimes.

18
SCSA3001 Data Mining And Data Warehousing

Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of data
from on sales, customer purchasing history, goods transportation, consumption and services. It is
natural that the quantity of data collected will continue to expand rapidly because of the increasing
ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead to
improved quality of customer service and good customer retention and satisfaction. Here is the list
of examples of data mining in the retail industry −
 Design and Construction of data warehouses based on the benefits of data mining.
 Multidimensional analysis of sales, customers, products, time and region.
 Analysis of effectiveness of sales campaigns.
 Customer Retention.
 Product recommendation and cross-referencing of items.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data
transmission, etc. Due to the development of new computer and communication technologies, the
telecommunication industry is rapidly expanding. This is the reason why data mining is become
very important to help and understand the business.
Data mining in telecommunication industry helps in identifying the telecommunication patterns,
catch fraudulent activities, make better use of resource, and improve quality of service. Here is the
list of examples for which data mining improves telecommunication services −
 Multidimensional Analysis of Telecommunication data.
 Fraudulent pattern analysis.
 Identification of unusual patterns.
 Multidimensional association and sequential patterns analysis.
 Mobile Telecommunication services.
 Use of visualization tools in telecommunication data analysis.
Biological Data Analysis
In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very

19
SCSA3001 Data Mining And Data Warehousing

important part of Bioinformatics. Following are the aspects in which data mining contributes for
biological data analysis −
 Semantic integration of heterogeneous, distributed genomic and proteomic databases.
 Alignment, indexing, similarity search and comparative analysis multiple nucleotide
sequences.
 Discovery of structural patterns and analysis of genetic networks and protein pathways.
 Association and path analysis.
 Visualization tools in genetic data analysis.
Other Scientific Applications
The applications discussed above tend to handle relatively small and homogeneous data sets for
which the statistical techniques are appropriate. Huge amount of data have been collected from
scientific domains such as geosciences, astronomy, etc. A large amount of data sets is being
generated because of the fast numerical simulations in various fields such as climate and
ecosystem modelling, chemical engineering, fluid dynamics, etc. Following are the applications of
data mining in the field of Scientific Applications −
 Data Warehouses and data preprocessing.
 Graph-based mining.
 Visualization and domain specific knowledge.
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the availability of
network resources. In this world of connectivity, security has become the major issue. With
increased usage of internet and availability of the tools and tricks for intruding and attacking
network prompted intrusion detection to become a critical component of network administration.
Here is the list of areas in which data mining technology may be applied for intrusion detection −
 Development of data mining algorithm for intrusion detection.
 Association and correlation analysis, aggregation to help select and build discriminating
attributes.
 Analysis of Stream data.
 Distributed data mining.
 Visualization and query tools.

20
SCSA3001 Data Mining And Data Warehousing

Data Mining System Products

There are many data mining system products and domain specific data mining applications. The
new data mining systems and applications are being added to the previous systems. Also, efforts
are being made to standardize data mining languages.
Choosing a Data Mining System
The selection of a data mining system depends on the following features −
 Data Types − The data mining system may handle formatted text, record-based data, and
relational data. The data could also be in ASCII text, relational database data or data warehouse
data. Therefore, we should check what exact format the data mining system can handle.
 System Issues − We must consider the compatibility of a data mining system with different
operating systems. One data mining system may run on only one operating system or on several.
There are also data mining systems that provide web-based user interfaces and allow XML data as
input.
 Data Sources − Data sources refer to the data formats in which data mining system will
operate. Some data mining system may work only on ASCII text files while others on multiple
relational sources. Data mining system should also support ODBC connections or OLE DB for
ODBC connections.
 Data Mining functions and methodologies − There are some data mining systems that provide
only one data mining function such as classification while some provides multiple data mining
functions such as concept description, discovery-driven OLAP analysis, association mining,
linkage analysis, statistical analysis, classification, prediction, clustering, outlier analysis,
similarity search, etc.
 Coupling data mining with databases or data warehouse systems − Data mining systems need
to be coupled with a database or a data warehouse system. The coupled components are integrated
into a uniform information processing environment. Here are the types of coupling listed below −
o No coupling
o Loose Coupling
o Semi tight Coupling
o Tight Coupling
 Scalability − There are two scalability issues in data mining −

21
SCSA3001 Data Mining And Data Warehousing

o Row (Database size) Scalability − A data mining system is considered as row scalable when
the number or rows are enlarged 10 times. It takes no more than 10 times to execute a query.
o Column (Dimension) Scalability − A data mining system is considered as column scalable if
the mining query execution time increases linearly with the number of columns.
 Visualization Tools − Visualization in data mining can be categorized as follows −
o Data Visualization
o Mining Results Visualization
o Mining process visualization
o Visual data mining
 Data Mining query language and graphical user interface − An easy-to-use graphical user
interface is important to promote user-guided, interactive data mining. Unlike relational database
systems, data mining systems do not share underlying data mining query language.
Trends in Data Mining
Data mining concepts are still evolving and here are the latest trends that we get to see in this field
 Application Exploration.
 Scalable and interactive data mining methods.
 Integration of data mining with database systems, data warehouse systems and web database
systems.
 Standardization of data mining query language.
 Visual data mining.
 New methods for mining complex types of data.
 Biological data mining.
 Data mining and software engineering.
 Web mining.
 Distributed data mining.
 Real time data mining.
 Multi database data mining.
 Privacy protection and information security in data mining

22
SCSA3001 Data Mining And Data Warehousing

PART-A

Q. No Questions Competence BT Level

1. Define Data mining. List out the steps in data mining. Remember BTL-1

2. Compare Discrete versus Continuous Attributes. Analyze BTL-4

3. Give the applications of Data Mining. Understand BTL-2

4. Analyze the issues in Data Mining Techniques. Apply BTL-3

5. Generalize in detail about Numeric Attributes. Create BTL-6

6. Evaluate the major tasks of data preprocessing. Evaluate BTL-5

7. Define an efficient procedure for cleaning the noisy data. Remember BTL-1

8. Distinguish between data similarity and dissimilarity. Understand BTL-2

Show the Displays of Basic Statistical Descriptions of
9. Analyze BTL-4
Data.
10. Formulate what is data discretization. Create BTL-6

PART-B

Q. No Questions Competence BT Level

i) Describe the issues of data mining. (7)

1. ii) Describe in detail about the applications of data mining Remember BTL-1
(6)
i) State and explain the various classifications of data
mining systems with example. (7)
2. Analyze BTL-4
ii) Explain the various data mining functionalities in
detail. (6)
i) Describe the steps involved in Knowledge discovery in
databases (KDD). (7)
3. Remember BTL-1
ii) Draw the diagram and Describe the architecture of data
mining system. (6)

23
SCSA3001 Data Mining And Data Warehousing

Suppose that the data for analysis include the attributed

age. The age values for the data tuples are
13,15,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,
4. Create BTL-6
35,35,36,40,45,46,52,70.
i)Use smoothing by bin depth of 3.Illustrate your steps (6)
ii) Classify the various methods for data smoothing. (7)
(i) Discuss whether or not each of the following activities
is a data mining task.(5)
1. Credit card fraud detection using transaction records.
2. Dividing the customers of a company according to their
gender.

5. 3. Computing the total sales of a company Understand BTL-2

4. Predicting the future stock price of a company using
historical records.
5. Monitoring seismic waves for earthquake activities.
(ii) Discuss on descriptive and predictive data mining
tasks with illustrations. (8)
i) Generalize why do we need data preprocessing step in
data mining (8)
6. Evaluate BTL-5
ii) Explain the various methods of data cleaning and data
reduction techniques (7)
i) Compose in detail the various data transformation

7. techniques (7) Create BTL-6

ii) Develop a short note on discretization techniques (6)

TEXT / REFERENCE BOOKS

1. Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques”, 2nd Edition,
Elsevier, 2007
2. Alex Berson and Stephen J. Smith, “ Data Warehousing, Data Mining & OLAP”, Tata McGraw
Hill, 2007.

24
SCSA3001 Data Mining And Data Warehousing

3. Pang-Ning Tan, Michael Steinbach and Vipin Kumar, “Introduction To Data Mining”, Person
Education, 2007.
4. K.P. Soman, Shyam Diwakar and V. Ajay, “Insight into Data mining Theory and Practice”,
Easter Economy Edition,
Prentice Hall of India, 2006.
5. G. K. Gupta, “Introduction to Data Mining with Case Studies”, Easter Economy Edition,
Prentice Hall of India, 2006.
6. Daniel T.Larose, “Data Mining Methods and Models”, Wile-Interscience, 2006

DWDM REFERENCE NOTES
No ratings yet
DWDM REFERENCE NOTES
126 pages
Unit I Notes
No ratings yet
Unit I Notes
23 pages
SCSA3001-1-58
No ratings yet
SCSA3001-1-58
58 pages
Satyabhama Bigdata
No ratings yet
Satyabhama Bigdata
128 pages
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
No ratings yet
Data Mining Techniques (DMT) by Kushal Anjaria Session-1 (Lecture Note)
2 pages
Data Warehousing and Data Mining: DR Seema Agarwal
No ratings yet
Data Warehousing and Data Mining: DR Seema Agarwal
72 pages
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
No ratings yet
DWM Sem V Module 2 - Introduction To Data Mining, Data Exploration and Data Pre-Processing
55 pages
Types of attributes-1
No ratings yet
Types of attributes-1
8 pages
Data Mining
No ratings yet
Data Mining
15 pages
R21 DM Unit1
No ratings yet
R21 DM Unit1
77 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
47 pages
Ch.3 Data Preprocessing
No ratings yet
Ch.3 Data Preprocessing
16 pages
datamining-1class
No ratings yet
datamining-1class
76 pages
Unit 1
No ratings yet
Unit 1
21 pages
Data Warehousing & Data Mining Syllabus Subject Code:56055 L:4 T/P/D:0 Credits:4 Int. Marks:25 Ext. Marks:75 Total Marks:100
No ratings yet
Data Warehousing & Data Mining Syllabus Subject Code:56055 L:4 T/P/D:0 Credits:4 Int. Marks:25 Ext. Marks:75 Total Marks:100
52 pages
Unit 1 - Introduction
No ratings yet
Unit 1 - Introduction
8 pages
Unit-2 Introduction To Data Mining
100% (1)
Unit-2 Introduction To Data Mining
11 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
The Survey of Data Mining Applications and Feature Scope
No ratings yet
The Survey of Data Mining Applications and Feature Scope
16 pages
Data Minng
No ratings yet
Data Minng
20 pages
Unit-2 Finalized
No ratings yet
Unit-2 Finalized
12 pages
DWM 4
No ratings yet
DWM 4
23 pages
Chapter 3: Data Mining
No ratings yet
Chapter 3: Data Mining
20 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
Unit I DWDM
No ratings yet
Unit I DWDM
26 pages
Unit 1 DMDW
No ratings yet
Unit 1 DMDW
57 pages
DM - MOD - 1 Part I
No ratings yet
DM - MOD - 1 Part I
9 pages
DATA MINING UNIT-1
No ratings yet
DATA MINING UNIT-1
59 pages
wao
No ratings yet
wao
9 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Soln 1
100% (1)
Soln 1
6 pages
Data Mining: Priyanka Nemalikanti
No ratings yet
Data Mining: Priyanka Nemalikanti
5 pages
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
No ratings yet
Data Mining and Data Warehouses: Professor: Liana Stanescu Student: Georgian Vladutu
12 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Data Mining New
No ratings yet
Data Mining New
21 pages
DataMining Unit I Notes
No ratings yet
DataMining Unit I Notes
28 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
What Motivated Data Mining? Why Is It Important?
No ratings yet
What Motivated Data Mining? Why Is It Important?
14 pages
Data Mining Report
No ratings yet
Data Mining Report
15 pages
Data Mining System and Applications A Re
No ratings yet
Data Mining System and Applications A Re
13 pages
Unit I DATA MINING AAGAC
No ratings yet
Unit I DATA MINING AAGAC
27 pages
DATA MINING-Knowledge Discovery in Databases
No ratings yet
DATA MINING-Knowledge Discovery in Databases
6 pages
unit2
No ratings yet
unit2
20 pages
Unit 1
No ratings yet
Unit 1
11 pages
Getting To Know Your Data: - Chapter 2
No ratings yet
Getting To Know Your Data: - Chapter 2
63 pages
Data Mining
No ratings yet
Data Mining
7 pages
What Motivated Data Mining? Why Is It Important?: The Evolution of Database Technology
100% (1)
What Motivated Data Mining? Why Is It Important?: The Evolution of Database Technology
18 pages
DATA_MINING_UNIT_1
No ratings yet
DATA_MINING_UNIT_1
13 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
Unit-2
No ratings yet
Unit-2
144 pages
Notes Module 2
No ratings yet
Notes Module 2
28 pages
Data Mining
No ratings yet
Data Mining
26 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
Full
No ratings yet
Full
367 pages
DWDM Notes - Unit 1
No ratings yet
DWDM Notes - Unit 1
26 pages
Solutions To DM I MID (A)
100% (1)
Solutions To DM I MID (A)
19 pages
Basic Concepts in Data Structures
From Everand
Basic Concepts in Data Structures
K.Meenendranath Reddy
No ratings yet
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Unsia_Data Mining Pertemuan 9
No ratings yet
Unsia_Data Mining Pertemuan 9
39 pages
Real-Time Motion Insight Using Mediapipe: A. Lakshmiprabha, Dr. G. Arockia Sahaya Sheela
No ratings yet
Real-Time Motion Insight Using Mediapipe: A. Lakshmiprabha, Dr. G. Arockia Sahaya Sheela
26 pages
Alzheimer's Disease Detection Using Deep Learning On Neuroimaging A Systematic Review
No ratings yet
Alzheimer's Disease Detection Using Deep Learning On Neuroimaging A Systematic Review
42 pages
Kec Ai Gryffindor Dravidianlangtech Naacl 2025
No ratings yet
Kec Ai Gryffindor Dravidianlangtech Naacl 2025
7 pages
Literature Survey on AI-Driven Early Sepsis Prediction Using Clinical Data
No ratings yet
Literature Survey on AI-Driven Early Sepsis Prediction Using Clinical Data
42 pages
Vijay DMPM
No ratings yet
Vijay DMPM
23 pages
Report NutriScanAI Latest
100% (1)
Report NutriScanAI Latest
47 pages
Subjective Answer Evaluation Using NLP
No ratings yet
Subjective Answer Evaluation Using NLP
12 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
11-Predictive Maintenance Toolbox Getting Started Guide
No ratings yet
11-Predictive Maintenance Toolbox Getting Started Guide
56 pages
Midterm
No ratings yet
Midterm
12 pages
2312.17120
No ratings yet
2312.17120
37 pages
SAMEERA documentation
No ratings yet
SAMEERA documentation
32 pages
MIMIC Extract Paper
No ratings yet
MIMIC Extract Paper
14 pages
Simple Load Disaggregation Library Based On NILMTK
No ratings yet
Simple Load Disaggregation Library Based On NILMTK
4 pages
Generative AI For Enhanced Predictive Models: From Disease Diagnosis To Diverse Applications
No ratings yet
Generative AI For Enhanced Predictive Models: From Disease Diagnosis To Diverse Applications
9 pages
Computational Intelligence Theories Applications and Future Directions Volume I Nishchal K. Verma - The latest ebook edition with all chapters is now available
100% (2)
Computational Intelligence Theories Applications and Future Directions Volume I Nishchal K. Verma - The latest ebook edition with all chapters is now available
57 pages
ISSA_DRDO_Report
No ratings yet
ISSA_DRDO_Report
20 pages
Constructing A Highly Accurate Price Prediction Model in Real Estate Investment Using LightGBM
No ratings yet
Constructing A Highly Accurate Price Prediction Model in Real Estate Investment Using LightGBM
4 pages
ML%20PROJECT%20PROPOSAL.pdf
No ratings yet
ML%20PROJECT%20PROPOSAL.pdf
4 pages
Unit 5 6 Pages Notes
No ratings yet
Unit 5 6 Pages Notes
3 pages
NNFL Midsem Presentation (1)
No ratings yet
NNFL Midsem Presentation (1)
20 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
80 pages
Data Leakage
No ratings yet
Data Leakage
13 pages
Data Preprocessing Steps for Machine Learning in Python (Part 1) _ by Learn with Nas _ Wom
No ratings yet
Data Preprocessing Steps for Machine Learning in Python (Part 1) _ by Learn with Nas _ Wom
39 pages
Data Warehousing & Data Mining Unit-3 Notes
No ratings yet
Data Warehousing & Data Mining Unit-3 Notes
27 pages
Chronic Kidney Disease Prediction: Team No: 24
No ratings yet
Chronic Kidney Disease Prediction: Team No: 24
7 pages
House price predictor ppt Project
No ratings yet
House price predictor ppt Project
13 pages
Ca1 Format All
No ratings yet
Ca1 Format All
13 pages

Data Mining M1

Uploaded by

Data Mining M1

Uploaded by

SCSA3001 Data Mining And Data Warehousing

Figure 1.1 Type of attributes

Table 1.1 Nominal Attributes

Table 1.2 binary Attributes

Table 1.3 Ordinal Attributes

Table 1.4 Discrete Attributes

Table 1.5 Continuous Attributes

Figure 1.2 KDD Process

 Data Mapping: Assigning elements from source base to destination to capture

Figure 1.3 system Architecture

DATA MINING FUNCTIONALITIES

Figure 1.4 system Architecture

Classification Based on the kind of Knowledge Mined

Figure 1.5 Data mining tasks

INTEGRATING A DATA MINING SYSTEM WITH A DB/DW SYSTEM

• Issues relating to the diversity of data types

Figure 1.6 Data Mining Issues

Mining Methodology and User Interaction Issues:

It refers to the following kinds of issues −

Figure 1.7 Integration of a data mining system with a data warehouse:

Here is the list of areas where data mining is widely used −

Data Mining System Products

Q. No Questions Competence BT Level

2. Compare Discrete versus Continuous Attributes. Analyze BTL-4

3. Give the applications of Data Mining. Understand BTL-2

4. Analyze the issues in Data Mining Techniques. Apply BTL-3

5. Generalize in detail about Numeric Attributes. Create BTL-6

6. Evaluate the major tasks of data preprocessing. Evaluate BTL-5

8. Distinguish between data similarity and dissimilarity. Understand BTL-2

Q. No Questions Competence BT Level

Suppose that the data for analysis include the attributed

5. 3. Computing the total sales of a company Understand BTL-2

7. techniques (7) Create BTL-6

TEXT / REFERENCE BOOKS

You might also like