0% found this document useful (0 votes)

24 views43 pages

DM - Unit-1 - Fundamentals of Data Mining

Uploaded by

Rudra Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views43 pages

DM - Unit-1 - Fundamentals of Data Mining

Uploaded by

Rudra Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

DATA MINING (03610335)

UNIT-1
FUNDAMENTALS OF DATA MINING

★HISTORY:
● The origins of data mining can be traced back to the 1950s when the first
computers were developed and used for scientific and mathematical
research. As the capabilities of computers and data storage systems
improved, researchers began to explore the use of computers to analyze and
extract insights from large data sets.
● One of the earliest and most influential pioneers of data mining was Dr.
Herbert Simon, a Nobel laureate in economics who is widely considered to
be the father of artificial intelligence. In the 1950s and 1960s, Simon and his
colleagues developed a number of algorithms and techniques for extracting
useful information and insights from data, including clustering,
classification, and decision trees.
● In the 1980s and 1990s, the field of data mining continued to evolve, and
new algorithms and techniques were developed to address the challenges of
working with large and complex data sets. The development of data mining
software and platforms, such as SAS, SPSS, and RapidMiner, made it easier
for organizations to apply data mining techniques to their data.
● In recent years, the availability of large data sets and the growth of cloud
computing and big data technologies have made data mining even more
powerful and widely used. Today, data mining is a crucial tool for many
organizations and industries and is used to extract valuable insights and
information from data sets in a wide range of domains.

PREPARED BY: DIVYA PADHIYAR 1

DATA MINING (03610335)

★WHAT IS DATA MINING ?

● Data mining is the process of discovering patterns and relationships in large
datasets using techniques such as machine learning and statistical analysis.
The goal of data mining is to extract useful information from large datasets
and use it to make predictions or inform decision-making. Data mining is
important because it allows organizations to uncover insights and trends in
their data that would be difficult or impossible to discover manually.
★DATA MINING TECHNIQUES:
● Data mining refers to extracting or mining knowledge from large amounts of
data. In other words, Data mining is the science, art, and technology of
discovering large and complex bodies of data in order to discover useful
patterns. Theoreticians and practitioners are continually seeking improved
techniques to make the process more efficient, cost-effective, and accurate.
Many other terms carry a similar or slightly different meaning to data mining
such as knowledge mining from data, knowledge extraction, data/pattern
analysis and data dredging.
● Data mining treats as a synonym for another popularly used term,
Knowledge Discovery from Data, or KDD. In others view data mining as
simply an essential step in the process of knowledge discovery, in which
intelligent methods are applied in order to extract data patterns.
● Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in
Databases” in 1989. However, the term ‘data mining’ became more popular
in the business and press communities. Currently, Data Mining and
Knowledge Discovery are used interchangeably.
● Nowadays, data mining is used in almost all places where a large amount of
data is stored and processed.

PREPARED BY: DIVYA PADHIYAR 2

DATA MINING (03610335)

● Data Mining Techniques:

PREPARED BY: DIVYA PADHIYAR 3

DATA MINING (03610335)

1. Association

● Association analysis is the finding of association rules showing

attribute-value conditions that occur frequently together in a given set of
data. Association analysis is widely used for a market basket or transaction
data analysis. Association rule mining is a significant and exceptionally
dynamic area of data mining research. One method of association-based
classification, called associative classification, consists of two steps. In the
main step, association instructions are generated using a modified version of
the standard association rule mining algorithm known as Apriori. The
second step constructs a classifier based on the association rules discovered.

2. Classification

● Classification is the process of finding a set of models (or functions) that

describe and distinguish data classes or concepts, for the purpose of being
able to use the model to predict the class of objects whose class label is
unknown. The determined model depends on the investigation of a set of
training data information (i.e. data objects whose class label is known). The
derived model may be represented in various forms, such as classification (if
– then) rules, decision trees, and neural networks. Data Mining has a
different type of classifier:

● Decision Tree
● SVM(Support Vector Machine)
● Generalized Linear Models
● Bayesian classification:

PREPARED BY: DIVYA PADHIYAR 4

DATA MINING (03610335)

● Classification by Backpropagation
● K-NN Classifier
● Rule-Based Classification
● Frequent-Pattern Based Classification
● Rough set theory
● Fuzzy Logic
➢ Decision Trees: A decision tree is a flow-chart-like tree structure, where
each node represents a test on an attribute value, each branch denotes an
outcome of a test, and tree leaves represent classes or class distributions.
Decision trees can be easily transformed into classification rules. Decision
tree enlistment is a nonparametric methodology for building classification
models. In other words, it does not require any prior assumptions regarding
the type of probability distribution satisfied by the class and other attributes.
Decision trees, especially smaller size trees, are relatively easy to interpret.
The accuracies of the trees are also comparable to two other classification
techniques for a much simpler data set. These provide an expressive
representation for learning discrete-valued functions. However, they do not
simplify well to certain types of Boolean problems.

PREPARED BY: DIVYA PADHIYAR 5

DATA MINING (03610335)

This figure was generated on the IRIS data set of the UCI machine
repository. Basically, three different class labels available in the data set:
Setosa, Versicolor, and Virginia.

➢ Support Vector Machine (SVM) Classifier Method: Support

Vector Machines is a supervised learning strategy used for classification and
additionally used for regression. When the output of the support vector
machine is a continuous value, the learning methodology is claimed to
perform regression; and once the learning methodology will predict a
category label of the input object, it’s known as classification. The
independent variables could or could not be quantitative. Kernel equations
are functions that transform linearly non-separable information in one
domain into another domain wherever the instances become linearly
divisible. Kernel equations are also linear, quadratic, Gaussian, or anything
that achieves this specific purpose. A linear classification technique may be
a classifier that uses a linear function of its inputs to base its decision on.
Applying the kernel equations arranges the information instances in such a
way at intervals in the multi-dimensional space, that there is a hyper-plane
that separates knowledge instances of one kind from those of another. The
advantage of Support Vector Machines is that they will make use of certain
kernels to transform the problem, such we are able to apply linear
classification techniques to nonlinear knowledge. Once we manage to divide
the information into two different classes our aim is to include the most
effective hyper-plane to separate two kinds of instances.
➢ Generalized Linear Models: Generalized Linear Models(GLM) is a
statistical technique for linear modeling.GLM provides extensive coefficient
statistics and model statistics, as well as row diagnostics. It also supports
confidence bounds.
➢ Bayesian Classification: Bayesian classifier is a statistical classifier.
They can predict class membership probabilities, for instance, the

PREPARED BY: DIVYA PADHIYAR 6

DATA MINING (03610335)

probability that a given sample belongs to a particular class. Bayesian

classification is created on the Bayes theorem. Studies comparing the
classification algorithms have found a simple Bayesian classifier known as
the naive Bayesian classifier to be comparable in performance with decision
tree and neural network classifiers. Bayesian classifiers have also displayed
high accuracy and speed when applied to large databases. Naive Bayesian
classifiers adopt that the exact attribute value on a given class is independent
of the values of the other attributes. This assumption is termed class
conditional independence. It is made to simplify the calculations involved,
and is considered “naive”. Bayesian belief networks are graphical replicas,
which unlike naive Bayesian classifiers allow the depiction of dependencies
among subsets of attributes. Bayesian belief can also be utilized for
classification.

➢ Classification By Backpropagation: A Backpropagation learns by

iteratively processing a set of training samples, comparing the network’s
estimate for each sample with the actual known class label. For each training
sample, weights are modified to minimize the mean squared error between
the network’s prediction and the actual class. These changes are made in the
“backward” direction, i.e., from the output layer, through each concealed
layer down to the first hidden layer (hence the name backpropagation).
Although it is not guaranteed, in general, the weights will finally converge,
and the knowledge process stops.
➢ K-Nearest Neighbor (K-NN) Classifier Method: The k-nearest
neighbor (K-NN) classifier is taken into account as an example-based
classifier, which means that the training documents are used for comparison
instead of an exact class illustration, like the class profiles utilized by other
classifiers. As such, there’s no real training section. once a new document
has to be classified, the k most similar documents (neighbors) are found and

PREPARED BY: DIVYA PADHIYAR 7

DATA MINING (03610335)

if a large enough proportion of them are allotted to a precise class, the new
document is also appointed to the present class, otherwise not. Additionally,
finding the closest neighbors is quickened using traditional classification
strategies.

➢ Rule-Based Classification: Rule-Based classification represents the

knowledge in the form of If-Then rules. An assessment of a rule evaluated
according to the accuracy and coverage of the classifier. If more than one
rule is triggered then we need conflict resolution in rule-based classification.
Conflict resolution can be performed on three different parameters: Size
ordering, Class-Based ordering, and rule-based ordering. There are some
advantages of Rule-based classifier like:
● Rules are easier to understand than a large tree.
● Rules are mutually exclusive and exhaustive.
● Each attribute-value pair along a path forms conjunction: each leaf
holds the class prediction.
➢ Frequent-Pattern Based Classification: Frequent pattern discovery

(or FP discovery, FP mining, or Frequent itemset mining) is part of data

mining. It describes the task of finding the most frequent and relevant
patterns in large datasets. The idea was first presented for mining transaction
databases. Frequent patterns are defined as subsets (item sets, subsequences,
or substructures) that appear in a data set with a frequency no less than a
user-specified or auto-determined threshold.
➢ Rough Set Theory: Rough set theory can be used for classification to
discover structural relationships within imprecise or noisy data. It applies to

PREPARED BY: DIVYA PADHIYAR 8

DATA MINING (03610335)

discrete-valued features. Continuous-valued attributes must therefore be

discrete prior to their use. Rough set theory is based on the establishment of
equivalence classes within the given training data. All the data samples
forming a similarity class are indiscernible, that is, the samples are equal
with respect to the attributes describing the data. Rough sets can also be used
for feature reduction (where attributes that do not contribute towards the
classification of the given training data can be identified and removed), and
relevance analysis (where the contribution or significance of each attribute is
assessed with respect to the classification task). The problem of finding the
minimal subsets (redacts) of attributes that can describe all the concepts in
the given data set is NP-hard. However, algorithms to decrease the
computation intensity have been proposed. In one method, for example, a
discernibility matrix is used which stores the differences between attribute
values for each pair of data samples. Rather than pointed on the entire
training set, the matrix is instead searched to detect redundant attributes.

➢ Fuzzy-Logic: Rule-based systems for classification have the disadvantage

that they involve sharp cut-offs for continuous attributes. Fuzzy Logic is
valuable for data mining frameworks performing grouping /classification. It
provides the benefit of working at a high level of abstraction. In general, the
usage of fuzzy logic in rule-based systems involves the following:
● Attribute values are changed to fuzzy values.
● For a given new data set /example, more than one fuzzy rule may
apply. Every applicable rule contributes a vote for membership in the
categories. Typically, the truth values for each projected category are
summed.

PREPARED BY: DIVYA PADHIYAR 9

DATA MINING (03610335)

3. Prediction

Data Prediction is a two-step process, similar to that of data classification.

Although, for prediction, we do not utilize the phrasing of “Class label
attribute” because the attribute for which values are being predicted is
consistently valued(ordered) instead of categorical (discrete-esteemed and
unordered). The attribute can be referred to simply as the predicted attribute.
Prediction can be viewed as the construction and use of a model to assess the
class of an unlabeled object, or to assess the value or value ranges of an
attribute that a given object is likely to have.

4. Clustering

Unlike classification and prediction, which analyze class-labeled data

objects or attributes, clustering analyzes data objects without consulting an
identified class label. In general, the class labels do not exist in the training
data simply because they are not known to begin with. Clustering can be
used to generate these labels. The objects are clustered based on the
principle of maximizing the intra-class similarity and minimizing the
interclass similarity. That is, clusters of objects are created so that objects
inside a cluster have high similarity in contrast with each other, but are
different objects in other clusters. Each Cluster that is generated can be seen
as a class of objects, from which rules can be inferred. Clustering can also
facilitate classification formation, that is, the organization of observations
into a hierarchy of classes that group similar events together.

PREPARED BY: DIVYA PADHIYAR 10

DATA MINING (03610335)

5. Regression

Regression can be defined as a statistical modeling method in which

previously obtained data is used to predicting a continuous quantity for new
observations. This classifier is also known as the Continuous Value
Classifier. There are two types of regression models: Linear regression and
multiple linear regression models.

6. Artificial Neural network (ANN) Classifier Method

An artificial neural network (ANN) also referred to as simply a “Neural

Network” (NN), could be a process model supported by biological neural
networks. It consists of an interconnected collection of artificial neurons. A
neural network is a set of connected input/output units where each
connection has a weight associated with it. During the knowledge phase, the
network acquires by adjusting the weights to be able to predict the correct
class label of the input samples. Neural network learning is also denoted as
connectionist learning due to the connections between units. Neural
networks involve long training times and are therefore more appropriate for
applications where this is feasible. They require a number of parameters that
are typically best determined empirically, such as the network topology or
“structure”. Neural networks have been criticized for their poor
interpretability since it is difficult for humans to take the symbolic meaning
behind the learned weights. These features firstly made neural networks less
desirable for data mining.
The advantages of neural networks, however, contain their high tolerance to
noisy data as well as their ability to classify patterns on which they have not
been trained. In addition, several algorithms have newly been developed for

PREPARED BY: DIVYA PADHIYAR 11

DATA MINING (03610335)

the extraction of rules from trained neural networks. These issues contribute
to the usefulness of neural networks for classification in data mining.
An artificial neural network is an adjective system that changes its
structure-supported information that flows through the artificial network
during a learning section. The ANN relies on the principle of learning by
example. There are two classical types of neural networks, perceptron and
also multilayer perceptron.

7. Outlier Detection

A database may contain data objects that do not comply with the general
behavior or model of the data. These data objects are Outliers. The
investigation of OUTLIER data is known as OUTLIER MINING. An outlier
may be detected using statistical tests which assume a distribution or
probability model for the data, or using distance measures where objects
having a small fraction of “close” neighbors in space are considered outliers.
Rather than utilizing factual or distance measures, deviation-based
techniques distinguish exceptions/outliers by inspecting differences in the
principle attributes of items in a group.

8. Genetic Algorithm

Genetic algorithms are adaptive heuristic search algorithms that belong to

the larger part of evolutionary algorithms. Genetic algorithms are based on
the ideas of natural selection and genetics. These are intelligent exploitation
of random search provided with historical data to direct the search into the
region of better performance in solution space. They are commonly used to
generate high-quality solutions for optimization problems and search
problems. Genetic algorithms simulate the process of natural selection which
means those species who can adapt to changes in their environment are able

PREPARED BY: DIVYA PADHIYAR 12

DATA MINING (03610335)

to survive and reproduce and go to the next generation. In simple words,

they simulate “survival of the fittest” among individuals of consecutive
generations for solving a problem. Each generation consists of a population
of individuals and each individual represents a point in search space and
possible solution. Each individual is represented as a string of
character/integer/float/bits. This string is analogous to the Chromosome.

★APPLICATION OF DATA MINING:

● Data is a set of discrete objective facts about an event or a process that have
little use by themselves unless converted into information. We have been
collecting numerous data, from simple numerical measurements and text
documents to more complex information such as spatial data, multimedia
channels, and hypertext documents.
● Nowadays, large quantities of data are being accumulated. The amount of
data collected is said to be almost doubled every year. An extracting data or
seeking knowledge from this massive data, data mining techniques are used.
Data mining is used in almost all places where a large amount of data is
stored and processed. For example, banks typically use ‘data mining’ to find
out their prospective customers who could be interested in credit cards,
personal loans, or insurance as well. Since banks have the transaction details
and detailed profiles of their customers, they analyze all this data and try to
find out patterns that help them predict that certain customers could be
interested in personal loans, etc.
● Basically, the motive behind mining data, whether commercial or scientific,
is the same – the need to find useful information in data to enable better
decision-making or a better understanding of the world around us.
● “Extraction of interesting information or patterns from data in large
databases is known as data mining.”
● Technically, data mining is the computational process of analyzing data from
different perspectives, dimensions, angles and categorizing/summarizing it

PREPARED BY: DIVYA PADHIYAR 13

DATA MINING (03610335)

into meaningful information. Data Mining can be applied to any type of data
e.g. Data Warehouses, Transactional Databases, Relational Databases,
Multimedia Databases, Spatial Databases, Time-series Databases, World
Wide Web.
● Data mining provides competitive advantages in the knowledge economy. It
does this by providing the maximum knowledge needed to rapidly make
valuable business decisions despite the enormous amounts of available data.
● There are many measurable benefits that have been achieved in different
application areas from data mining. So, let’s discuss different applications of
Data Mining:

➢ Scientific Analysis: Scientific simulations are generating bulks of data

every day. This includes data collected from nuclear laboratories, data about
human psychology, etc. Data mining techniques are capable of the analysis
of these data. Now we can capture and store more new data faster than we

PREPARED BY: DIVYA PADHIYAR 14

DATA MINING (03610335)

can analyze the old data already accumulated. Example of scientific

analysis:
● Sequence analysis in bioinformatics
● Classification of astronomical objects
● Medical decision support.
➢ Intrusion Detection: A network intrusion refers to any unauthorized
activity on a digital network. Network intrusions often involve stealing
valuable network resources. Data mining technique plays a vital role in
searching intrusion detection, network attacks, and anomalies. These
techniques help in selecting and refining useful and relevant information
from large data sets. Data mining technique helps in classify relevant data
for Intrusion Detection System. Intrusion Detection system generates alarms
for the network traffic about the foreign invasions in the system. For
example:
● Detect security violations
● Misuse Detection
● Anomaly Detection

PREPARED BY: DIVYA PADHIYAR 15

DATA MINING (03610335)

➢ Business Transactions: Every business industry is memorized for

perpetuity. Such transactions are usually time-related and can be
inter-business deals or intra-business operations. The effective and in-time
use of the data in a reasonable time frame for competitive decision-making
is definitely the most important problem to solve for businesses that struggle
to survive in a highly competitive world. Data mining helps to analyze these
business transactions and identify marketing approaches and
decision-making. Example :
● Direct mail targeting
● Stock trading
● Customer segmentation
● Churn prediction (Churn prediction is one of the most popular Big Data use
cases in business)
➢ Market Basket Analysis: Market Basket Analysis is a technique that
gives the careful study of purchases done by a customer in a supermarket.
This concept identifies the pattern of frequent purchase items by customers.
This analysis can help to promote deals, offers, sale by the companies and
data mining techniques helps to achieve this analysis task. Example:
● Data mining concepts are in use for Sales and marketing to provide better
customer service, to improve cross-selling opportunities, to increase
direct mail response rates.
● Customer Retention in the form of pattern identification and prediction of
likely defections is possible by Data mining.
● Risk Assessment and Fraud area also use the data-mining concept for
identifying inappropriate or unusual behavior etc.

PREPARED BY: DIVYA PADHIYAR 16

DATA MINING (03610335)

➢ Education: For analyzing the education sector, data mining uses

Educational Data Mining (EDM) method. This method generates patterns
that can be used both by learners and educators. By using data mining EDM
we can perform some educational task:
● Predicting students admission in higher education
● Predicting students profiling
● Predicting student performance
● Teachers teaching performance
● Curriculum development
● Predicting student placement opportunities
➢ Research: A data mining technique can perform predictions, classification,
clustering, associations, and grouping of data with perfection in the research
area. Rules generated by data mining are unique to find results. In most of
the technical research in data mining, we create a training model and testing
model. The training/testing model is a strategy to measure the precision of
the proposed model. It is called Train/Test because we split the data set into
two sets: a training data set and a testing data set. A training data set is used
to design the training model whereas a testing data set is used in the testing
model. Example:

● Classification of uncertain data.

● Information-based clustering.
● Decision support system
● Web Mining
● Domain-driven data mining

PREPARED BY: DIVYA PADHIYAR 17

DATA MINING (03610335)

● IoT (Internet of Things)and Cybersecurity

● Smart farming IoT(Internet of Things)
➢ Healthcare and Insurance: A Pharmaceutical sector can examine its
new deals force activity and their outcomes to improve the focusing of
high-value physicians and figure out which promoting activities will have
the best effect in the following upcoming months, Whereas the Insurance
sector, data mining can help to predict which customers will buy new
policies, identify behavior patterns of risky customers and identify
fraudulent behavior of customers.
● Claims analysis i.e which medical procedures are claimed together.
● Identify successful medical therapies for different illnesses.
● Characterizes patient behavior to predict office visits.
➢ Transportation: A diversified transportation company with a large direct
sales force can apply data mining to identify the best prospects for its
services. A large consumer merchandise organization can apply information
mining to improve its business cycle to retailers.
● Determine the distribution schedules among outlets.
● Analyze loading patterns.
➢ Financial/Banking Sector: A credit card company can leverage its vast
warehouse of customer transaction data to identify customers most likely to
be interested in a new credit product.
● Credit card fraud detection.
● Identify ‘Loyal’ customers.
● Extraction of information related to customers.
● Determine credit card spending by customer groups.

PREPARED BY: DIVYA PADHIYAR 18

DATA MINING (03610335)

★CHALLENGES OF DATA MINING:

● Data mining, the process of extracting knowledge from data, has become
increasingly important as the amount of data generated by individuals,
organizations, and machines has grown exponentially. However, data mining
is not without its challenges. In this article, we will explore some of the main
challenges of data mining.

1] Data Quality
The quality of data used in data mining is one of the most significant
challenges. The accuracy, completeness, and consistency of the data affect
the accuracy of the results obtained. The data may contain errors, omissions,
duplications, or inconsistencies, which may lead to inaccurate results.

Moreover, the data may be incomplete, meaning that some attributes or

values are missing, making it challenging to obtain a complete
understanding of the data.
Data quality issues can arise due to a variety of reasons, including data entry
errors, data storage issues, data integration problems, and data transmission
errors. To address these challenges, data mining practitioners must apply
data cleaning and data preprocessing techniques to improve the quality of
the data. Data cleaning involves detecting and correcting errors, while data

PREPARED BY: DIVYA PADHIYAR 19

DATA MINING (03610335)

preprocessing involves transforming the data to make it suitable for data

mining.

2] Data Complexity
Data complexity refers to the vast amounts of data generated by various
sources, such as sensors, social media, and the internet of things (IoT). The
complexity of the data may make it challenging to process, analyze, and
understand. In addition, the data may be in different formats, making it
challenging to integrate into a single dataset.
To address this challenge, data mining practitioners use advanced techniques
such as clustering, classification, and association rule mining. These
techniques help to identify patterns and relationships in the data, which can
then be used to gain insights and make predictions.

3] Data Privacy and Security

Data privacy and security is another significant challenge in data mining. As
more data is collected, stored, and analyzed, the risk of data breaches and
cyber-attacks increases. The data may contain personal, sensitive, or
confidential information that must be protected. Moreover, data privacy
regulations such as GDPR, CCPA, and HIPAA impose strict rules on how
data can be collected, used, and shared.
To address this challenge, data mining practitioners must apply data
anonymization and data encryption techniques to protect the privacy and

PREPARED BY: DIVYA PADHIYAR 20

DATA MINING (03610335)

security of the data. Data anonymization involves removing personally

identifiable information (PII) from the data, while data encryption involves
using algorithms to encode the data to make it unreadable to unauthorized
users.

4] Scalability
Data mining algorithms must be scalable to handle large datasets efficiently.
As the size of the dataset increases, the time and computational resources
required to perform data mining operations also increase. Moreover, the
algorithms must be able to handle streaming data, which is generated
continuously and must be processed in real-time.
To address this challenge, data mining practitioners use distributed
computing frameworks such as Hadoop and Spark. These frameworks
distribute the data and processing across multiple nodes, making it possible
to process large datasets quickly and efficiently.

5] Interpretability
Data mining algorithms can produce complex models that are difficult to
interpret. This is because the algorithms use a combination of statistical and
mathematical techniques to identify patterns and relationships in the data.
Moreover, the models may not be intuitive, making it challenging to
understand how the model arrived at a particular conclusion.
To address this challenge, data mining practitioners use visualization

PREPARED BY: DIVYA PADHIYAR 21

DATA MINING (03610335)

techniques to represent the data and the models visually. Visualization makes
it easier to understand the patterns and relationships in the data and to
identify the most important variables.

6] Ethics
Data mining raises ethical concerns related to the collection, use, and
dissemination of data. The data may be used to discriminate against certain
groups, violate privacy rights, or perpetuate existing biases. Moreover, data
mining algorithms may not be transparent, making it challenging to detect
biases or discrimination.

★FUTURE OF DATA MINING:

● Data mining has come a long way since its inception, evolving from a niche
field to a cornerstone of modern data analytics. As technology continues to
advance at a rapid pace, the future of data mining holds immense promise,
with emerging trends and innovations poised to revolutionize how we extract
insights from vast amounts of data.

In this blog, we'll explore some of the key trends shaping the future of data
mining and their potential implications for businesses and society.

1. Deep Learning and Neural Networks

Deep learning, a subset of machine learning, has emerged as a powerful tool

for data mining tasks, particularly in areas such as image recognition, natural

PREPARED BY: DIVYA PADHIYAR 22

DATA MINING (03610335)

language processing, and speech recognition. Deep neural networks are

capable of learning complex patterns and relationships in data, enabling
more accurate predictions and insights

2. Federated Learning

Federated learning is a decentralized approach to machine learning, where

models are trained collaboratively across multiple devices or servers without
exchanging raw data. This approach offers privacy advantages by keeping
data localized and reducing the risk of data breaches. In the future, federated
learning could revolutionize data mining by enabling organizations to
leverage insights from distributed datasets while preserving data privacy and
security.

3. Augmented Analytics

Augmented analytics combines machine learning and natural language

processing capabilities to automate data preparation, analysis, and insights
generation tasks. By automating routine data mining processes and surfacing
actionable insights in plain language, augmented analytics empowers
business users to make data-driven decisions more effectively.

4. Explainable AI

Explainable AI focuses on making machine learning models more

interpretable and transparent, enabling users to understand how models

PREPARED BY: DIVYA PADHIYAR 23

DATA MINING (03610335)

arrive at their predictions or decisions. In the context of data mining,

explainable AI techniques help users trust and validate the insights generated
by machine learning models, leading to more informed decision-making. In
the future, explainable AI will play a crucial role in enhancing the
trustworthiness and reliability of data mining models, particularly in
regulated industries such as healthcare and finance.

5. Edge Computing

Edge computing involves processing data closer to its source, such as IoT
devices or sensors, rather than relying solely on centralized cloud
infrastructure. In the future, edge computing will enable organizations to
perform data mining tasks directly on the edge devices, enabling faster
insights and decision-making in scenarios where real-time processing is
critical.

★ISSUES IN DATA MINING:

PREPARED BY: DIVYA PADHIYAR 24

DATA MINING (03610335)

➢ Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −

Mining different kinds of knowledge in databases − Different users may

be interested in different kinds of knowledge. Therefore it is necessary for
data mining to cover a broad range of knowledge discovery tasks.
Interactive mining of knowledge at multiple levels of abstraction − The
data mining process needs to be interactive because it allows users to focus
the search for patterns, providing and refining data mining requests based on
the returned results.
Incorporation of background knowledge − To guide the discovery process
and to express the discovered patterns, the background knowledge can be
used. Background knowledge may be used to express the discovered patterns
not only in concise terms but at multiple levels of abstraction.
Data mining query languages and ad hoc data mining − Data Mining
Query language that allows the user to describe ad hoc mining tasks, should
be integrated with a data warehouse query language and optimized for
efficient and flexible data mining.
Presentation and visualization of data mining results − Once the patterns
are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
Handling noisy or incomplete data − The data cleaning methods are
required to handle the noise and incomplete objects while mining the data
regularities. If the data cleaning methods are not there then the accuracy of
the discovered patterns will be poor.
Pattern evaluation − The patterns discovered should be interesting because
either they represent common knowledge or lack novelty.

PREPARED BY: DIVYA PADHIYAR 25

DATA MINING (03610335)

➢ Performance Issues

There can be performance-related issues such as follows −

Efficiency and scalability of data mining algorithms − In order to

effectively extract the information from huge amount of data in databases,
data mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors
such as huge size of databases, wide distribution of data, and complexity of
data mining methods motivate the development of parallel and distributed
data mining algorithms. These algorithms divide the data into partitions
which are further processed in a parallel fashion. Then the results from the
partitions are merged. The incremental algorithms, update databases without
mining the data again from scratch.

➢ Diverse Data Types Issues

There can be diverse data types issues such as follows −

Handling of relational and complex types of data − The database may

contain complex data objects, multimedia data objects, spatial data, temporal
data etc. It is not possible for one system to mine all these kind of data.

Mining information from heterogeneous databases and global

information systems − The data is available at different data sources on
LAN or WAN. These data sources may be structured, semi structured or
unstructured. Therefore mining the knowledge from them adds challenges to
data mining.

PREPARED BY: DIVYA PADHIYAR 26

DATA MINING (03610335)

★KDD PROCESS IN DATA MINING:

➢What is KDD in Data Mining:
● KDD (Knowledge Discovery in Databases) is a process of discovering
useful knowledge and insights from large and complex datasets. The KDD
process involves a range of techniques and methodologies, including data
preprocessing, data transformation, data mining, pattern evaluation, and
knowledge representation. KDD and data mining are closely related
processes, with data mining being a key component and subset of the KDD
process.
● The KDD process aims to identify hidden patterns, relationships, and trends
in data that can be used to make predictions, decisions, and
recommendations. KDD is a broad and interdisciplinary field used in various
industries, such as finance, healthcare, marketing, e-commerce, etc. KDD is
very important for organizations and businesses as it enables them to derive
new insights and knowledge from their data, which can be further used to
improve decision-making, enhance the customer experience, improve
business processes, support strategic planning, optimize operations, and
drive business growth.
➢KDD Process in Data Mining:
● The KDD process in data mining is a multi-step process that involves
various stages to extract useful knowledge from large datasets. The
following are the main steps involved in the KDD process -
● Data Selection - The first step in the KDD process is identifying and
selecting the relevant data for analysis. This involves choosing the relevant
data sources, such as databases, data warehouses, and data streams, and
determining which data is required for the analysis.
● Data Preprocessing - After selecting the data, the next step is data
preprocessing. This step involves cleaning the data, removing outliers, and

PREPARED BY: DIVYA PADHIYAR 27

DATA MINING (03610335)

removing missing, inconsistent, or irrelevant data. This step is critical, as the

data quality can significantly impact the accuracy and effectiveness of the
analysis.

● Data Transformation - Once the data is preprocessed, the next step is to

transform it into a format that data mining techniques can analyze. This step
involves reducing the data dimensionality, aggregating the data, normalizing
it, and discretizing it to prepare it for further analysis.
● Data Mining - This is the heart of the KDD process and involves applying
various data mining techniques to the transformed data to discover hidden
patterns, trends, relationships, and insights. A few of the most common data
mining techniques include clustering, classification, association rule mining,
and anomaly detection.
● Pattern Evaluation - After the data mining, the next step is to evaluate the
discovered patterns to determine their usefulness and relevance. This
involves assessing the quality of the patterns, evaluating their significance,
and selecting the most promising patterns for further analysis.
● Knowledge Representation - This step involves representing the
knowledge extracted from the data in a way humans can easily understand
and use. This can be done through visualizations, reports, or other forms of
communication that provide meaningful insights into the data.
● Deployment - The final step in the KDD process is to deploy the knowledge
and insights gained from the data mining process to practical applications.
This involves integrating the knowledge into decision-making processes or
other applications to improve organizational efficiency and effectiveness.
● In summary, the KDD process in data mining involves several steps to
extract useful knowledge from large datasets. It is a comprehensive and
iterative process that requires careful consideration of each step to ensure the

PREPARED BY: DIVYA PADHIYAR 28

DATA MINING (03610335)

accuracy and effectiveness of the analysis. Various steps involved in the

KDD process in data mining are shown below diagram -

★TYPES OF DATA IN DATA MINING:

● Mining Multimedia Data: Multimedia data objects include image data,
video data, audio data, website hyperlinks, and linkages. Multimedia data
mining tries to find out interesting patterns from multimedia databases.
This includes the processing of the digital data and performs tasks like
image processing, image classification, video, and audio data mining, and
pattern recognition. Multimedia Data mining is becoming the most

PREPARED BY: DIVYA PADHIYAR 29

DATA MINING (03610335)

interesting research area because most of the social media platforms like
Twitter, Facebook data can be analyzed through this and derive
interesting trends and patterns.

● Mining Web Data: Web mining is essential to discover crucial patterns

and knowledge from the Web. Web content mining analyzes data of
several websites which includes the web pages and the multimedia data
such as images in the web pages. Web mining is done to understand the
content of web pages, unique users of the website, unique hypertext links,
web page relevance and ranking, web page content summaries, time that
the users spent on the particular website, and understand user search
patterns. Web mining also finds out the best search engine and determines
the search algorithm used by it. So it helps improve search efficiency and
finds the best search engine for the users.
● Mining Text Data: Text mining is the subfield of data mining, machine
learning, Natural Language processing, and statistics. Most of the
information in our daily life is stored as text such as news articles,
technical papers, books, email messages, blogs. Text Mining helps us to
retrieve high-quality information from text such as sentiment analysis,
document summarization, text categorization, text clustering. We apply
machine learning models and NLP techniques to derive useful
information from the text. This is done by finding out the hidden patterns

PREPARED BY: DIVYA PADHIYAR 30

DATA MINING (03610335)

and trends by means such as statistical pattern learning and statistical

language modeling. In order to perform text mining, we need to preprocess
the text by applying the techniques of stemming and lemmatization in order
to convert the textual data into data vectors.

● Mining Spatiotemporal Data: The data that is related to both space and
time is Spatiotemporal data. Spatiotemporal data mining retrieves
interesting patterns and knowledge from spatiotemporal data.
Spatiotemporal Data mining helps us to find the value of the lands, the
age of the rocks and precious stones, predict the weather patterns.
Spatiotemporal data mining has many practical applications like GPS in
mobile phones, timers, Internet-based map services, weather services,
satellite, RFID, sensor.
● Mining Data Streams: Stream data is the data that can change
dynamically and it is noisy, inconsistent which contain multidimensional
features of different data types. So this data is stored in NoSql database
systems. The volume of the stream data is very high and this is the
challenge for the effective mining of stream data. While mining the Data
Streams we need to perform the tasks such as clustering, outlier analysis,
and the online detection of rare events in data streams.

PREPARED BY: DIVYA PADHIYAR 31

DATA MINING (03610335)

★TYPES OF DATABASE DATA IN DATA MINING:

1. Relational Databases:
● A Relational database is defined as the collection of data organized in
tables with rows and columns.
● Physical schema in Relational databases is a schema which defines the
structure of tables.
● Logical schema in Relational databases is a schema which defines the
relationship among tables.
● Standard API of relational database is SQL.
● A relational database is a type of structured data that organizes data
into one or more tables, with each table consisting of rows and
columns. The rows represent individual records, and the columns
represent fields or attributes within those records.
● The main feature of a relational database is the ability to establish
relationships between different tables using a common field called a
primary key. This allows data to be linked and queried across multiple
tables, enabling more efficient data retrieval and manipulation.
● Relational databases are widely used in many different industries,
such as finance, healthcare, retail and e-commerce. They are also used
to support transactional systems, data warehousing, and business
intelligence.

PREPARED BY: DIVYA PADHIYAR 32

DATA MINING (03610335)

● Relational databases are typically managed by a database management

system (DBMS) such as MySQL, Oracle, SQL Server, and
PostgreSQL. The DBMS provides tools for creating, modifying, and
querying the database, as well as for managing access and security.
● Some advantages of relational databases include:
● Data Integrity: Relational databases have built-in mechanisms for
maintaining data integrity, such as constraints and triggers
● Data Consistency: Relational databases ensure that the data is
consistent across the entire system
● Data Security: Relational databases provide various levels of access
control and security features to protect the data
● Efficient Data Retrieval: Relational databases provide a powerful
query language (SQL) to retrieve data efficiently
Scalability: Relational databases can be easily scaled to support large
data sets and high-performance requirements
Some disadvantages of relational databases include:
● Complexity: Relational databases can be complex to set up and
manage, especially for large and complex data sets
Latency: Relational databases may not be well-suited for real-time,
high-throughput data processing
● Application: Data Mining, ROLAP model, etc.

PREPARED BY: DIVYA PADHIYAR 33

DATA MINING (03610335)

2. Transactional Databases
● Transactional databases is a collection of data organized by time
stamps, date, etc to represent transaction in databases.
● This type of database has the capability to roll back or undo its
operation when a transaction is not completed or committed.
● Highly flexible system where users can modify information without
changing any sensitive information.
● Follows ACID property of DBMS.
● Application: Banking, Distributed systems, Object databases, etc.
3. Multimedia Databases
● Multimedia databases consists audio, video, images and text media.
● They can be stored on Object-Oriented Databases.
● They are used to store complex information in a pre-specified formats.
● Application: Digital libraries, video-on demand, news-on demand,
musical database, etc.
4. Spatial Database
● Store geographical information.
● Stores data in the form of coordinates, topology, lines, polygons, etc.
● Application: Maps, Global positioning, etc.

PREPARED BY: DIVYA PADHIYAR 34

DATA MINING (03610335)

5. Time-series Databases
● Time series databases contains stock exchange data and user logged
activities.
● Handles array of numbers indexed by time, date, etc.
● It requires real-time analysis.
● Application: eXtremeDB, Graphite, InfluxDB, etc.
6. WWW
● WWW refers to the World wide web is a collection of documents and
resources like audio, video, text, etc which are identified by Uniform
Resource Locators (URLs) through web browsers, linked by HTML
pages, and accessible via the Internet network.
● It is the most heterogeneous repository as it collects data from
multiple resources.
● It is dynamic in nature as Volume of data is continuously increasing
and changing.
● Application: Online shopping, Job search, Research, studying, etc.

★DATA WAREHOUSES:

A data warehouse is a subject-oriented, integrated, time-variant and

non-volatile collection of data in support of management's decision making
process.

PREPARED BY: DIVYA PADHIYAR 35

DATA MINING (03610335)

Subject-Oriented: A data warehouse can be used to analyze a particular

subject area. For example, "sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources.

For example, source A and source B may have different ways of identifying
a product, but in a data warehouse, there will be only a single way of
identifying a product.

Time-Variant: Historical data is kept in a data warehouse. For example, one

can retrieve data from 3 months, 6 months, 12 months, or even older data
from a data warehouse. This contrasts with a transactions system, where
often only the most recent data is kept. For example, a transaction system
may hold the most recent address of a customer, where a data warehouse can
hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So,
historical data in a data warehouse should never be altered.

➢ Data Warehouse Design Process:

A data warehouse can be built using a top-down approach, a bottom-up

approach, or a combination of both.

PREPARED BY: DIVYA PADHIYAR 36

DATA MINING (03610335)

● The top-down approach starts with the overall design and planning. It
is useful in cases where the technology is mature and well known, and
where the business problems that must be solved are clear and well
understood.
● The bottom-up approach starts with experiments and prototypes. This
is useful in the early stage of business modeling and technology
development. It allows an organization to move forward at
considerably less expense and to evaluate the benefits of the
technology before making significant commitments.
● In the combined approach, an organization can exploit the planned
and strategic nature of the top-down approach while retaining the
rapid implementation and opportunistic application of the bottom-up
approach. The warehouse design process consists of the following
steps:
● Choose a business process to model, for example, orders, invoices,
shipments, inventory, account administration, sales, or the general
ledger. If the business process is organizational and involves multiple
complex object collections, a data warehouse model should be
followed. However, if the process is departmental and focuses on the
analysis of one kind of business process, a data mart model should be
chosen.

PREPARED BY: DIVYA PADHIYAR 37

DATA MINING (03610335)

● Choose the grain of the business process. The grain is the

fundamental, atomic level of data to be represented in the fact table
for this process, for example, individual transactions, individual daily
snapshots, and so on.
● Choose the dimensions that will apply to each fact table record.
Typical dimensions are time, item, customer, supplier, warehouse,
transaction type, and status.
● Choose the measures that will populate each fact table record. Typical
measures are numeric additive quantities like dollars sold and units
sold.

➢A Three Tier Data Warehouse Architecture:

PREPARED BY: DIVYA PADHIYAR 38

DATA MINING (03610335)

Tier-1:

The bottom tier is a warehouse database server that is almost always a

relational database system. Back-end tools and utilities are used to feed data
into the bottom tier from operational databases or other external sources
(such as customer profile information provided by external consultants).
These tools and utilities perform data extraction, cleaning, and
transformation (e.g., to merge similar data from different sources into a
unified format), as well as load and refresh functions to update the data
warehouse . The data are extracted using application program interfaces
known as gateways. A gateway is supported by the underlying DBMS and
allows client programs to generate SQL code to be executed at a server.

Examples of gateways include ODBC (Open Database Connection) and

OLEDB (Open Link ingand Embedding for Databases) by Microsoft and
JDBC (Java Database Connection).

This tier also contains a metadata repository, which stores information about
the data warehouse and its contents.

Tier-2:

The middle tier is an OLAP server that is typically implemented using either
a relational OLAP (ROLAP) model or a multidimensional OLAP.

PREPARED BY: DIVYA PADHIYAR 39

DATA MINING (03610335)

OLAP model is an extended relational DBMS thatmaps operations on

multidimensional data to standard relational operations.

A multidimensional OLAP (MOLAP) model, that is, a special-purpose

server that directly implements multidimensional data and operations.

Tier-3:

The top tier is a front-end client layer, which contains query and reporting
tools, analysis tools, and/or data mining tools (e.g., trend analysis,
prediction, and so on).

➢Data Warehouse Models:

There are three data warehouse models.

1. Enterprise warehouse:

● An enterprise warehouse collects all of the information about subjects

spanning the entire organization.
● It provides corporate-wide data integration, usually from one or more
operational systems or external information providers, and is
cross-functional in scope.

PREPARED BY: DIVYA PADHIYAR 40

DATA MINING (03610335)

● It typically contains detailed data as well as summarized data, and can

range in size from a few gigabytes to hundreds of gigabytes, terabytes,
or beyond.
● An enterprise data warehouse may be implemented on traditional
mainframes, computer superservers, or parallel architecture platforms.
It requires extensive business modeling and may take years to design
and build.

2. Data mart:

● A data mart contains a subset of corporate-wide data that is of value to

a specific group of users. The scope is confined to specific selected
subjects. For example,a marketing data mart may confine its subjects
to customer, item, and sales. The data contained in data marts tend to
be summarized.
● Data marts are usually implemented on low-cost departmental servers
that are UNIX/LINUX- or Windows-based. The implementation cycle
of a data mart is more likely to be measured in weeks rather than
months or years. However, it may involve complex integration in the
long run if its design and planning were not enterprise-wide. DEPT
OF CSE & IT VSSUT, Burla Depending on the source of data, data
marts can be categorized as independent or dependent. Independent
data marts are sourced from data captured from one or more
operational systems or external information providers, or from data

PREPARED BY: DIVYA PADHIYAR 41

DATA MINING (03610335)

● generated locally within a particular department or geographic area.

Dependent data marts are sourced directly from enterprise data
warehouses.

3. Virtual warehouse:

A virtual warehouse is a set of views over operational databases. For

efficient query processing, only some of the possible summary views may be
materialized.

A virtual warehouse is easy to build but requires excess capacity on

operational database servers.

➢Metadata Repository:

Metadata are data about data.When used in a data warehouse, metadata are
the data that define warehouse objects. Metadata are created for the data
names and definitions of the given warehouse. Additional metadata are
created and captured for time stamping any extracted data, the source of the
extracted data, and missing fields that have been added by data cleaning or
integration processes.

A metadata repository should contain the following:

PREPARED BY: DIVYA PADHIYAR 42

DATA MINING (03610335)

● A description of the structure of the data warehouse, which includes

the warehouse schema, view, dimensions, hierarchies, and derived
data definitions, as well as data mart locations and contents.
● Operational metadata, which include data lineage (history of migrated
data and the sequence of transformations applied to it), currency of
data (active, archived, or purged), and monitoring information
(warehouse usage statistics, error reports, and audit trails).
● The algorithms used for summarization, which include measure and
dimension definition algorithms, data on granularity, partitions,
subject areas, aggregation, summarization,and predefined queries and
reports.
● The mapping from the operational environment to the data warehouse,
which includes source databases and their contents, gateway
descriptions, data partitions, data extraction, cleaning, transformation
rules and defaults, data refresh and purging rules, and security (user
authorization and access control).
● Data related to system performance, which include indices and
profiles that improve data access and retrieval performance, in
addition to rules for the timing and scheduling of refresh, update, and
replication cycles.
● Business metadata, which include business terms and definitions, data
ownership information, and charging policies.

PREPARED BY: DIVYA PADHIYAR 43

Data Mining Notes For BCA 5th Sem 2019 PDF
No ratings yet
Data Mining Notes For BCA 5th Sem 2019 PDF
46 pages
Association Rule Mining For Healthcare Data Analysis
No ratings yet
Association Rule Mining For Healthcare Data Analysis
16 pages
DM Chapter 4
No ratings yet
DM Chapter 4
47 pages
Data Mining
No ratings yet
Data Mining
30 pages
Chapter3 Classification and Prediction
No ratings yet
Chapter3 Classification and Prediction
63 pages
ML Lect1
100% (1)
ML Lect1
51 pages
Data Warehouse and Mining Notes
No ratings yet
Data Warehouse and Mining Notes
12 pages
Classification in Data Mining
No ratings yet
Classification in Data Mining
14 pages
Lecture 3.1.3 3.1.4
No ratings yet
Lecture 3.1.3 3.1.4
24 pages
Madhav Institute of Technology & Science, Gwalior
No ratings yet
Madhav Institute of Technology & Science, Gwalior
28 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
115 pages
On Unit-3
No ratings yet
On Unit-3
30 pages
Data Mining
No ratings yet
Data Mining
254 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
Classification - Prediction Data Model Very Important
No ratings yet
Classification - Prediction Data Model Very Important
173 pages
Chapter 5. Classification and Prediction
No ratings yet
Chapter 5. Classification and Prediction
122 pages
Data Mining Technique Using Weka Tool
No ratings yet
Data Mining Technique Using Weka Tool
21 pages
Basic Concept of Classification (Data Mining)
No ratings yet
Basic Concept of Classification (Data Mining)
11 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
172 pages
Unit 1
No ratings yet
Unit 1
59 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
16 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
5 pages
Classification Basic Concept - Data Mining
No ratings yet
Classification Basic Concept - Data Mining
20 pages
Chapter4 Classification Prediction
No ratings yet
Chapter4 Classification Prediction
173 pages
Survey Paper On Classification
No ratings yet
Survey Paper On Classification
6 pages
Article 6
No ratings yet
Article 6
6 pages
Data Mining: Concepts and Techniques
100% (2)
Data Mining: Concepts and Techniques
139 pages
Chapter 6 - Data Mining
No ratings yet
Chapter 6 - Data Mining
62 pages
Unit-IV Classification Part 1
No ratings yet
Unit-IV Classification Part 1
38 pages
DM Ch6 (Classification and Prediction)
No ratings yet
DM Ch6 (Classification and Prediction)
39 pages
DWM Merged
No ratings yet
DWM Merged
125 pages
Is Zc415 (Data Mining BITS-WILP)
No ratings yet
Is Zc415 (Data Mining BITS-WILP)
4 pages
Screenshot 2024-06-04 at 12.01.00 AM
No ratings yet
Screenshot 2024-06-04 at 12.01.00 AM
45 pages
Screenshot 2024-06-04 at 12.00.45 AM
No ratings yet
Screenshot 2024-06-04 at 12.00.45 AM
45 pages
Screenshot 2024-06-03 at 11.59.21 PM
No ratings yet
Screenshot 2024-06-03 at 11.59.21 PM
45 pages
Screenshot 2024-06-04 at 12.07.18 AM
No ratings yet
Screenshot 2024-06-04 at 12.07.18 AM
45 pages
Data Mining - I
No ratings yet
Data Mining - I
126 pages
Data Warehousing & Mining: Unit - Iv
No ratings yet
Data Warehousing & Mining: Unit - Iv
32 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
Data Mining and Data Warehouse BY: Dept. of Computer Science Engineering
No ratings yet
Data Mining and Data Warehouse BY: Dept. of Computer Science Engineering
10 pages
Data Mining
No ratings yet
Data Mining
9 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
129 pages
Data Mining
No ratings yet
Data Mining
63 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
Data Mining Technique
No ratings yet
Data Mining Technique
7 pages
Lecture 3.1.1
No ratings yet
Lecture 3.1.1
17 pages
Data Mining Techniques and Applications PDF
No ratings yet
Data Mining Techniques and Applications PDF
5 pages
2016 Book PrinciplesOfDataMining PDF
100% (3)
2016 Book PrinciplesOfDataMining PDF
530 pages
Data Mining: Concepts and Techniques: - Chapter 6
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 6
112 pages
Classification
No ratings yet
Classification
50 pages
BI Unit 3 Part 1
No ratings yet
BI Unit 3 Part 1
51 pages
Mmds
No ratings yet
Mmds
12 pages
TFB M1 C2 Data Mining
No ratings yet
TFB M1 C2 Data Mining
18 pages
Slides Courtesy: Ling Chen [email protected]
No ratings yet
Slides Courtesy: Ling Chen [email protected]
42 pages
Evaluation of Student Academic Performan
No ratings yet
Evaluation of Student Academic Performan
7 pages
Data Mining: Prof Jyotiranjan Hota
No ratings yet
Data Mining: Prof Jyotiranjan Hota
17 pages
Data Mining Chapter 1
0% (1)
Data Mining Chapter 1
12 pages
Data Mining
No ratings yet
Data Mining
20 pages
Data Mining Techniques
No ratings yet
Data Mining Techniques
11 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
DM 1
No ratings yet
DM 1
52 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
0% (1)
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
55 pages
Question Bank: Q1) What Is Data Warehouse?
No ratings yet
Question Bank: Q1) What Is Data Warehouse?
17 pages
Computer Networks and Information Security
No ratings yet
Computer Networks and Information Security
35 pages
AzqaSaleemKhan (SP22 RCS 003) FPGrowth
No ratings yet
AzqaSaleemKhan (SP22 RCS 003) FPGrowth
19 pages
Descriptive Data Mining
No ratings yet
Descriptive Data Mining
8 pages
CH 05 PPTaccessible
No ratings yet
CH 05 PPTaccessible
60 pages
Data Mining Unit 4 (1) PDF PDF
No ratings yet
Data Mining Unit 4 (1) PDF PDF
11 pages
Unit 5
No ratings yet
Unit 5
40 pages
6.41 AAMS UG 107 B.SC Data Science Programme
No ratings yet
6.41 AAMS UG 107 B.SC Data Science Programme
50 pages
JOMMBA
No ratings yet
JOMMBA
27 pages
Apriori Algorithm Example Problems
No ratings yet
Apriori Algorithm Example Problems
8 pages
Pattern Sequence Mining: Presented By: Devika Mittal
No ratings yet
Pattern Sequence Mining: Presented By: Devika Mittal
15 pages
Kunal DS
No ratings yet
Kunal DS
92 pages
Full and Correct Notes For FDS-6th Bca
No ratings yet
Full and Correct Notes For FDS-6th Bca
83 pages
Unit 2 Soft
No ratings yet
Unit 2 Soft
14 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
12 pages
IET Communications - 2020 - Safara - Improved Intrusion Detection Method For Communication Networks Using Association Rule
No ratings yet
IET Communications - 2020 - Safara - Improved Intrusion Detection Method For Communication Networks Using Association Rule
6 pages
Bia Unit-3 Part-2
No ratings yet
Bia Unit-3 Part-2
43 pages
Syllabus Datamining
No ratings yet
Syllabus Datamining
4 pages
Unit-4 Introduction To Data Mining
No ratings yet
Unit-4 Introduction To Data Mining
26 pages
Machine Learning Notes Unit 1 To 4
No ratings yet
Machine Learning Notes Unit 1 To 4
101 pages
Final Exam Machine Learning & Data Mining
No ratings yet
Final Exam Machine Learning & Data Mining
3 pages
Summer Training Project Report Format
No ratings yet
Summer Training Project Report Format
94 pages
Savitribaiphule Pune University Third Year of Computer Engineering (2015 Course) 410247:laboratory Practice Ii
0% (1)
Savitribaiphule Pune University Third Year of Computer Engineering (2015 Course) 410247:laboratory Practice Ii
3 pages
DWDM Unit-4
No ratings yet
DWDM Unit-4
27 pages
Direct Hashing and Pruning (Park-Chen-Yu) Direct Hashing and Pruning
No ratings yet
Direct Hashing and Pruning (Park-Chen-Yu) Direct Hashing and Pruning
3 pages
Association Rule Mining
100% (2)
Association Rule Mining
2 pages
Research and Case Analysis of Apriori Algorithm Based On Mining Frequent Item-Sets
No ratings yet
Research and Case Analysis of Apriori Algorithm Based On Mining Frequent Item-Sets
11 pages

DM - Unit-1 - Fundamentals of Data Mining

Uploaded by

DM - Unit-1 - Fundamentals of Data Mining

Uploaded by

DATA MINING (03610335)

PREPARED BY: DIVYA PADHIYAR 1

★WHAT IS DATA MINING ?

PREPARED BY: DIVYA PADHIYAR 2

● Data Mining Techniques:

PREPARED BY: DIVYA PADHIYAR 3

● Association analysis is the finding of association rules showing

● Classification is the process of finding a set of models (or functions) that

PREPARED BY: DIVYA PADHIYAR 4

PREPARED BY: DIVYA PADHIYAR 5

➢ Support Vector Machine (SVM) Classifier Method: Support

PREPARED BY: DIVYA PADHIYAR 6

probability that a given sample belongs to a particular class. Bayesian

➢ Classification By Backpropagation: A Backpropagation learns by

PREPARED BY: DIVYA PADHIYAR 7

➢ Rule-Based Classification: Rule-Based classification represents the

(or FP discovery, FP mining, or Frequent itemset mining) is part of data

PREPARED BY: DIVYA PADHIYAR 8

discrete-valued features. Continuous-valued attributes must therefore be

➢ Fuzzy-Logic: Rule-based systems for classification have the disadvantage

PREPARED BY: DIVYA PADHIYAR 9

Data Prediction is a two-step process, similar to that of data classification.

Unlike classification and prediction, which analyze class-labeled data

PREPARED BY: DIVYA PADHIYAR 10

Regression can be defined as a statistical modeling method in which

6. Artificial Neural network (ANN) Classifier Method

An artificial neural network (ANN) also referred to as simply a “Neural

PREPARED BY: DIVYA PADHIYAR 11

Genetic algorithms are adaptive heuristic search algorithms that belong to

PREPARED BY: DIVYA PADHIYAR 12

to survive and reproduce and go to the next generation. In simple words,

★APPLICATION OF DATA MINING:

PREPARED BY: DIVYA PADHIYAR 13

➢ Scientific Analysis: Scientific simulations are generating bulks of data

PREPARED BY: DIVYA PADHIYAR 14

can analyze the old data already accumulated. Example of scientific

PREPARED BY: DIVYA PADHIYAR 15

➢ Business Transactions: Every business industry is memorized for

PREPARED BY: DIVYA PADHIYAR 16

➢ Education: For analyzing the education sector, data mining uses

● Classification of uncertain data.

PREPARED BY: DIVYA PADHIYAR 17

● IoT (Internet of Things)and Cybersecurity

PREPARED BY: DIVYA PADHIYAR 18

★CHALLENGES OF DATA MINING:

Moreover, the data may be incomplete, meaning that some attributes or

PREPARED BY: DIVYA PADHIYAR 19

preprocessing involves transforming the data to make it suitable for data

3] Data Privacy and Security

PREPARED BY: DIVYA PADHIYAR 20

security of the data. Data anonymization involves removing personally

PREPARED BY: DIVYA PADHIYAR 21

★FUTURE OF DATA MINING:

1. Deep Learning and Neural Networks

Deep learning, a subset of machine learning, has emerged as a powerful tool

PREPARED BY: DIVYA PADHIYAR 22

language processing, and speech recognition. Deep neural networks are

Federated learning is a decentralized approach to machine learning, where

Augmented analytics combines machine learning and natural language

Explainable AI focuses on making machine learning models more

PREPARED BY: DIVYA PADHIYAR 23

arrive at their predictions or decisions. In the context of data mining,

★ISSUES IN DATA MINING:

PREPARED BY: DIVYA PADHIYAR 24

➢ Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −

​Mining different kinds of knowledge in databases − Different users may

PREPARED BY: DIVYA PADHIYAR 25

There can be performance-related issues such as follows −

​Efficiency and scalability of data mining algorithms − In order to

There can be diverse data types issues such as follows −

Handling of relational and complex types of data − The database may

​Mining information from heterogeneous databases and global

PREPARED BY: DIVYA PADHIYAR 26

★KDD PROCESS IN DATA MINING:

PREPARED BY: DIVYA PADHIYAR 27

Mining different kinds of knowledge in databases − Different users may

Efficiency and scalability of data mining algorithms − In order to

Mining information from heterogeneous databases and global