0% found this document useful (0 votes)
24 views43 pages

DM - Unit-1 - Fundamentals of Data Mining

Uploaded by

Rudra Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views43 pages

DM - Unit-1 - Fundamentals of Data Mining

Uploaded by

Rudra Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

DATA MINING (03610335)

UNIT-1
FUNDAMENTALS OF DATA MINING

★HISTORY:
● The origins of data mining can be traced back to the 1950s when the first
computers were developed and used for scientific and mathematical
research. As the capabilities of computers and data storage systems
improved, researchers began to explore the use of computers to analyze and
extract insights from large data sets.
● One of the earliest and most influential pioneers of data mining was Dr.
Herbert Simon, a Nobel laureate in economics who is widely considered to
be the father of artificial intelligence. In the 1950s and 1960s, Simon and his
colleagues developed a number of algorithms and techniques for extracting
useful information and insights from data, including clustering,
classification, and decision trees.
● In the 1980s and 1990s, the field of data mining continued to evolve, and
new algorithms and techniques were developed to address the challenges of
working with large and complex data sets. The development of data mining
software and platforms, such as SAS, SPSS, and RapidMiner, made it easier
for organizations to apply data mining techniques to their data.
● In recent years, the availability of large data sets and the growth of cloud
computing and big data technologies have made data mining even more
powerful and widely used. Today, data mining is a crucial tool for many
organizations and industries and is used to extract valuable insights and
information from data sets in a wide range of domains.

PREPARED BY: DIVYA PADHIYAR 1


DATA MINING (03610335)

★WHAT IS DATA MINING ?


● Data mining is the process of discovering patterns and relationships in large
datasets using techniques such as machine learning and statistical analysis.
The goal of data mining is to extract useful information from large datasets
and use it to make predictions or inform decision-making. Data mining is
important because it allows organizations to uncover insights and trends in
their data that would be difficult or impossible to discover manually.
★DATA MINING TECHNIQUES:
● Data mining refers to extracting or mining knowledge from large amounts of
data. In other words, Data mining is the science, art, and technology of
discovering large and complex bodies of data in order to discover useful
patterns. Theoreticians and practitioners are continually seeking improved
techniques to make the process more efficient, cost-effective, and accurate.
Many other terms carry a similar or slightly different meaning to data mining
such as knowledge mining from data, knowledge extraction, data/pattern
analysis and data dredging.
● Data mining treats as a synonym for another popularly used term,
Knowledge Discovery from Data, or KDD. In others view data mining as
simply an essential step in the process of knowledge discovery, in which
intelligent methods are applied in order to extract data patterns.
● Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in
Databases” in 1989. However, the term ‘data mining’ became more popular
in the business and press communities. Currently, Data Mining and
Knowledge Discovery are used interchangeably.
● Nowadays, data mining is used in almost all places where a large amount of
data is stored and processed.

PREPARED BY: DIVYA PADHIYAR 2


DATA MINING (03610335)

● Data Mining Techniques:

PREPARED BY: DIVYA PADHIYAR 3


DATA MINING (03610335)

1. Association

● Association analysis is the finding of association rules showing


attribute-value conditions that occur frequently together in a given set of
data. Association analysis is widely used for a market basket or transaction
data analysis. Association rule mining is a significant and exceptionally
dynamic area of data mining research. One method of association-based
classification, called associative classification, consists of two steps. In the
main step, association instructions are generated using a modified version of
the standard association rule mining algorithm known as Apriori. The
second step constructs a classifier based on the association rules discovered.

2. Classification

● Classification is the process of finding a set of models (or functions) that


describe and distinguish data classes or concepts, for the purpose of being
able to use the model to predict the class of objects whose class label is
unknown. The determined model depends on the investigation of a set of
training data information (i.e. data objects whose class label is known). The
derived model may be represented in various forms, such as classification (if
– then) rules, decision trees, and neural networks. Data Mining has a
different type of classifier:

● Decision Tree
● SVM(Support Vector Machine)
● Generalized Linear Models
● Bayesian classification:

PREPARED BY: DIVYA PADHIYAR 4


DATA MINING (03610335)

● Classification by Backpropagation
● K-NN Classifier
● Rule-Based Classification
● Frequent-Pattern Based Classification
● Rough set theory
● Fuzzy Logic
➢ Decision Trees: A decision tree is a flow-chart-like tree structure, where
each node represents a test on an attribute value, each branch denotes an
outcome of a test, and tree leaves represent classes or class distributions.
Decision trees can be easily transformed into classification rules. Decision
tree enlistment is a nonparametric methodology for building classification
models. In other words, it does not require any prior assumptions regarding
the type of probability distribution satisfied by the class and other attributes.
Decision trees, especially smaller size trees, are relatively easy to interpret.
The accuracies of the trees are also comparable to two other classification
techniques for a much simpler data set. These provide an expressive
representation for learning discrete-valued functions. However, they do not
simplify well to certain types of Boolean problems.

PREPARED BY: DIVYA PADHIYAR 5


DATA MINING (03610335)

This figure was generated on the IRIS data set of the UCI machine
repository. Basically, three different class labels available in the data set:
Setosa, Versicolor, and Virginia.

➢ Support Vector Machine (SVM) Classifier Method: Support


Vector Machines is a supervised learning strategy used for classification and
additionally used for regression. When the output of the support vector
machine is a continuous value, the learning methodology is claimed to
perform regression; and once the learning methodology will predict a
category label of the input object, it’s known as classification. The
independent variables could or could not be quantitative. Kernel equations
are functions that transform linearly non-separable information in one
domain into another domain wherever the instances become linearly
divisible. Kernel equations are also linear, quadratic, Gaussian, or anything
that achieves this specific purpose. A linear classification technique may be
a classifier that uses a linear function of its inputs to base its decision on.
Applying the kernel equations arranges the information instances in such a
way at intervals in the multi-dimensional space, that there is a hyper-plane
that separates knowledge instances of one kind from those of another. The
advantage of Support Vector Machines is that they will make use of certain
kernels to transform the problem, such we are able to apply linear
classification techniques to nonlinear knowledge. Once we manage to divide
the information into two different classes our aim is to include the most
effective hyper-plane to separate two kinds of instances.
➢ Generalized Linear Models: Generalized Linear Models(GLM) is a
statistical technique for linear modeling.GLM provides extensive coefficient
statistics and model statistics, as well as row diagnostics. It also supports
confidence bounds.
➢ Bayesian Classification: Bayesian classifier is a statistical classifier.
They can predict class membership probabilities, for instance, the

PREPARED BY: DIVYA PADHIYAR 6


DATA MINING (03610335)

probability that a given sample belongs to a particular class. Bayesian


classification is created on the Bayes theorem. Studies comparing the
classification algorithms have found a simple Bayesian classifier known as
the naive Bayesian classifier to be comparable in performance with decision
tree and neural network classifiers. Bayesian classifiers have also displayed
high accuracy and speed when applied to large databases. Naive Bayesian
classifiers adopt that the exact attribute value on a given class is independent
of the values of the other attributes. This assumption is termed class
conditional independence. It is made to simplify the calculations involved,
and is considered “naive”. Bayesian belief networks are graphical replicas,
which unlike naive Bayesian classifiers allow the depiction of dependencies
among subsets of attributes. Bayesian belief can also be utilized for
classification.

➢ Classification By Backpropagation: A Backpropagation learns by


iteratively processing a set of training samples, comparing the network’s
estimate for each sample with the actual known class label. For each training
sample, weights are modified to minimize the mean squared error between
the network’s prediction and the actual class. These changes are made in the
“backward” direction, i.e., from the output layer, through each concealed
layer down to the first hidden layer (hence the name backpropagation).
Although it is not guaranteed, in general, the weights will finally converge,
and the knowledge process stops.
➢ K-Nearest Neighbor (K-NN) Classifier Method: The k-nearest
neighbor (K-NN) classifier is taken into account as an example-based
classifier, which means that the training documents are used for comparison
instead of an exact class illustration, like the class profiles utilized by other
classifiers. As such, there’s no real training section. once a new document
has to be classified, the k most similar documents (neighbors) are found and

PREPARED BY: DIVYA PADHIYAR 7


DATA MINING (03610335)

if a large enough proportion of them are allotted to a precise class, the new
document is also appointed to the present class, otherwise not. Additionally,
finding the closest neighbors is quickened using traditional classification
strategies.

➢ Rule-Based Classification: Rule-Based classification represents the


knowledge in the form of If-Then rules. An assessment of a rule evaluated
according to the accuracy and coverage of the classifier. If more than one
rule is triggered then we need conflict resolution in rule-based classification.
Conflict resolution can be performed on three different parameters: Size
ordering, Class-Based ordering, and rule-based ordering. There are some
advantages of Rule-based classifier like:
● Rules are easier to understand than a large tree.
● Rules are mutually exclusive and exhaustive.
● Each attribute-value pair along a path forms conjunction: each leaf
holds the class prediction.
➢ Frequent-Pattern Based Classification: Frequent pattern discovery

(or FP discovery, FP mining, or Frequent itemset mining) is part of data


mining. It describes the task of finding the most frequent and relevant
patterns in large datasets. The idea was first presented for mining transaction
databases. Frequent patterns are defined as subsets (item sets, subsequences,
or substructures) that appear in a data set with a frequency no less than a
user-specified or auto-determined threshold.
➢ Rough Set Theory: Rough set theory can be used for classification to
discover structural relationships within imprecise or noisy data. It applies to

PREPARED BY: DIVYA PADHIYAR 8


DATA MINING (03610335)

discrete-valued features. Continuous-valued attributes must therefore be


discrete prior to their use. Rough set theory is based on the establishment of
equivalence classes within the given training data. All the data samples
forming a similarity class are indiscernible, that is, the samples are equal
with respect to the attributes describing the data. Rough sets can also be used
for feature reduction (where attributes that do not contribute towards the
classification of the given training data can be identified and removed), and
relevance analysis (where the contribution or significance of each attribute is
assessed with respect to the classification task). The problem of finding the
minimal subsets (redacts) of attributes that can describe all the concepts in
the given data set is NP-hard. However, algorithms to decrease the
computation intensity have been proposed. In one method, for example, a
discernibility matrix is used which stores the differences between attribute
values for each pair of data samples. Rather than pointed on the entire
training set, the matrix is instead searched to detect redundant attributes.

➢ Fuzzy-Logic: Rule-based systems for classification have the disadvantage


that they involve sharp cut-offs for continuous attributes. Fuzzy Logic is
valuable for data mining frameworks performing grouping /classification. It
provides the benefit of working at a high level of abstraction. In general, the
usage of fuzzy logic in rule-based systems involves the following:
● Attribute values are changed to fuzzy values.
● For a given new data set /example, more than one fuzzy rule may
apply. Every applicable rule contributes a vote for membership in the
categories. Typically, the truth values for each projected category are
summed.

PREPARED BY: DIVYA PADHIYAR 9


DATA MINING (03610335)

3. Prediction

Data Prediction is a two-step process, similar to that of data classification.


Although, for prediction, we do not utilize the phrasing of “Class label
attribute” because the attribute for which values are being predicted is
consistently valued(ordered) instead of categorical (discrete-esteemed and
unordered). The attribute can be referred to simply as the predicted attribute.
Prediction can be viewed as the construction and use of a model to assess the
class of an unlabeled object, or to assess the value or value ranges of an
attribute that a given object is likely to have.

4. Clustering

Unlike classification and prediction, which analyze class-labeled data


objects or attributes, clustering analyzes data objects without consulting an
identified class label. In general, the class labels do not exist in the training
data simply because they are not known to begin with. Clustering can be
used to generate these labels. The objects are clustered based on the
principle of maximizing the intra-class similarity and minimizing the
interclass similarity. That is, clusters of objects are created so that objects
inside a cluster have high similarity in contrast with each other, but are
different objects in other clusters. Each Cluster that is generated can be seen
as a class of objects, from which rules can be inferred. Clustering can also
facilitate classification formation, that is, the organization of observations
into a hierarchy of classes that group similar events together.

PREPARED BY: DIVYA PADHIYAR 10


DATA MINING (03610335)

5. Regression

Regression can be defined as a statistical modeling method in which


previously obtained data is used to predicting a continuous quantity for new
observations. This classifier is also known as the Continuous Value
Classifier. There are two types of regression models: Linear regression and
multiple linear regression models.

6. Artificial Neural network (ANN) Classifier Method

An artificial neural network (ANN) also referred to as simply a “Neural


Network” (NN), could be a process model supported by biological neural
networks. It consists of an interconnected collection of artificial neurons. A
neural network is a set of connected input/output units where each
connection has a weight associated with it. During the knowledge phase, the
network acquires by adjusting the weights to be able to predict the correct
class label of the input samples. Neural network learning is also denoted as
connectionist learning due to the connections between units. Neural
networks involve long training times and are therefore more appropriate for
applications where this is feasible. They require a number of parameters that
are typically best determined empirically, such as the network topology or
“structure”. Neural networks have been criticized for their poor
interpretability since it is difficult for humans to take the symbolic meaning
behind the learned weights. These features firstly made neural networks less
desirable for data mining.
The advantages of neural networks, however, contain their high tolerance to
noisy data as well as their ability to classify patterns on which they have not
been trained. In addition, several algorithms have newly been developed for

PREPARED BY: DIVYA PADHIYAR 11


DATA MINING (03610335)

the extraction of rules from trained neural networks. These issues contribute
to the usefulness of neural networks for classification in data mining.
An artificial neural network is an adjective system that changes its
structure-supported information that flows through the artificial network
during a learning section. The ANN relies on the principle of learning by
example. There are two classical types of neural networks, perceptron and
also multilayer perceptron.

7. Outlier Detection

A database may contain data objects that do not comply with the general
behavior or model of the data. These data objects are Outliers. The
investigation of OUTLIER data is known as OUTLIER MINING. An outlier
may be detected using statistical tests which assume a distribution or
probability model for the data, or using distance measures where objects
having a small fraction of “close” neighbors in space are considered outliers.
Rather than utilizing factual or distance measures, deviation-based
techniques distinguish exceptions/outliers by inspecting differences in the
principle attributes of items in a group.

8. Genetic Algorithm

Genetic algorithms are adaptive heuristic search algorithms that belong to


the larger part of evolutionary algorithms. Genetic algorithms are based on
the ideas of natural selection and genetics. These are intelligent exploitation
of random search provided with historical data to direct the search into the
region of better performance in solution space. They are commonly used to
generate high-quality solutions for optimization problems and search
problems. Genetic algorithms simulate the process of natural selection which
means those species who can adapt to changes in their environment are able

PREPARED BY: DIVYA PADHIYAR 12


DATA MINING (03610335)

to survive and reproduce and go to the next generation. In simple words,


they simulate “survival of the fittest” among individuals of consecutive
generations for solving a problem. Each generation consists of a population
of individuals and each individual represents a point in search space and
possible solution. Each individual is represented as a string of
character/integer/float/bits. This string is analogous to the Chromosome.

★APPLICATION OF DATA MINING:


● Data is a set of discrete objective facts about an event or a process that have
little use by themselves unless converted into information. We have been
collecting numerous data, from simple numerical measurements and text
documents to more complex information such as spatial data, multimedia
channels, and hypertext documents.
● Nowadays, large quantities of data are being accumulated. The amount of
data collected is said to be almost doubled every year. An extracting data or
seeking knowledge from this massive data, data mining techniques are used.
Data mining is used in almost all places where a large amount of data is
stored and processed. For example, banks typically use ‘data mining’ to find
out their prospective customers who could be interested in credit cards,
personal loans, or insurance as well. Since banks have the transaction details
and detailed profiles of their customers, they analyze all this data and try to
find out patterns that help them predict that certain customers could be
interested in personal loans, etc.
● Basically, the motive behind mining data, whether commercial or scientific,
is the same – the need to find useful information in data to enable better
decision-making or a better understanding of the world around us.
● “Extraction of interesting information or patterns from data in large
databases is known as data mining.”
● Technically, data mining is the computational process of analyzing data from
different perspectives, dimensions, angles and categorizing/summarizing it

PREPARED BY: DIVYA PADHIYAR 13


DATA MINING (03610335)

into meaningful information. Data Mining can be applied to any type of data
e.g. Data Warehouses, Transactional Databases, Relational Databases,
Multimedia Databases, Spatial Databases, Time-series Databases, World
Wide Web.
● Data mining provides competitive advantages in the knowledge economy. It
does this by providing the maximum knowledge needed to rapidly make
valuable business decisions despite the enormous amounts of available data.
● There are many measurable benefits that have been achieved in different
application areas from data mining. So, let’s discuss different applications of
Data Mining:

➢ Scientific Analysis: Scientific simulations are generating bulks of data


every day. This includes data collected from nuclear laboratories, data about
human psychology, etc. Data mining techniques are capable of the analysis
of these data. Now we can capture and store more new data faster than we

PREPARED BY: DIVYA PADHIYAR 14


DATA MINING (03610335)

can analyze the old data already accumulated. Example of scientific


analysis:
● Sequence analysis in bioinformatics
● Classification of astronomical objects
● Medical decision support.
➢ Intrusion Detection: A network intrusion refers to any unauthorized
activity on a digital network. Network intrusions often involve stealing
valuable network resources. Data mining technique plays a vital role in
searching intrusion detection, network attacks, and anomalies. These
techniques help in selecting and refining useful and relevant information
from large data sets. Data mining technique helps in classify relevant data
for Intrusion Detection System. Intrusion Detection system generates alarms
for the network traffic about the foreign invasions in the system. For
example:
● Detect security violations
● Misuse Detection
● Anomaly Detection

PREPARED BY: DIVYA PADHIYAR 15


DATA MINING (03610335)

➢ Business Transactions: Every business industry is memorized for


perpetuity. Such transactions are usually time-related and can be
inter-business deals or intra-business operations. The effective and in-time
use of the data in a reasonable time frame for competitive decision-making
is definitely the most important problem to solve for businesses that struggle
to survive in a highly competitive world. Data mining helps to analyze these
business transactions and identify marketing approaches and
decision-making. Example :
● Direct mail targeting
● Stock trading
● Customer segmentation
● Churn prediction (Churn prediction is one of the most popular Big Data use
cases in business)
➢ Market Basket Analysis: Market Basket Analysis is a technique that
gives the careful study of purchases done by a customer in a supermarket.
This concept identifies the pattern of frequent purchase items by customers.
This analysis can help to promote deals, offers, sale by the companies and
data mining techniques helps to achieve this analysis task. Example:
● Data mining concepts are in use for Sales and marketing to provide better
customer service, to improve cross-selling opportunities, to increase
direct mail response rates.
● Customer Retention in the form of pattern identification and prediction of
likely defections is possible by Data mining.
● Risk Assessment and Fraud area also use the data-mining concept for
identifying inappropriate or unusual behavior etc.

PREPARED BY: DIVYA PADHIYAR 16


DATA MINING (03610335)

➢ Education: For analyzing the education sector, data mining uses


Educational Data Mining (EDM) method. This method generates patterns
that can be used both by learners and educators. By using data mining EDM
we can perform some educational task:
● Predicting students admission in higher education
● Predicting students profiling
● Predicting student performance
● Teachers teaching performance
● Curriculum development
● Predicting student placement opportunities
➢ Research: A data mining technique can perform predictions, classification,
clustering, associations, and grouping of data with perfection in the research
area. Rules generated by data mining are unique to find results. In most of
the technical research in data mining, we create a training model and testing
model. The training/testing model is a strategy to measure the precision of
the proposed model. It is called Train/Test because we split the data set into
two sets: a training data set and a testing data set. A training data set is used
to design the training model whereas a testing data set is used in the testing
model. Example:

● Classification of uncertain data.


● Information-based clustering.
● Decision support system
● Web Mining
● Domain-driven data mining

PREPARED BY: DIVYA PADHIYAR 17


DATA MINING (03610335)

● IoT (Internet of Things)and Cybersecurity


● Smart farming IoT(Internet of Things)
➢ Healthcare and Insurance: A Pharmaceutical sector can examine its
new deals force activity and their outcomes to improve the focusing of
high-value physicians and figure out which promoting activities will have
the best effect in the following upcoming months, Whereas the Insurance
sector, data mining can help to predict which customers will buy new
policies, identify behavior patterns of risky customers and identify
fraudulent behavior of customers.
● Claims analysis i.e which medical procedures are claimed together.
● Identify successful medical therapies for different illnesses.
● Characterizes patient behavior to predict office visits.
➢ Transportation: A diversified transportation company with a large direct
sales force can apply data mining to identify the best prospects for its
services. A large consumer merchandise organization can apply information
mining to improve its business cycle to retailers.
● Determine the distribution schedules among outlets.
● Analyze loading patterns.
➢ Financial/Banking Sector: A credit card company can leverage its vast
warehouse of customer transaction data to identify customers most likely to
be interested in a new credit product.
● Credit card fraud detection.
● Identify ‘Loyal’ customers.
● Extraction of information related to customers.
● Determine credit card spending by customer groups.

PREPARED BY: DIVYA PADHIYAR 18


DATA MINING (03610335)

★CHALLENGES OF DATA MINING:


● Data mining, the process of extracting knowledge from data, has become
increasingly important as the amount of data generated by individuals,
organizations, and machines has grown exponentially. However, data mining
is not without its challenges. In this article, we will explore some of the main
challenges of data mining.

1] Data Quality
The quality of data used in data mining is one of the most significant
challenges. The accuracy, completeness, and consistency of the data affect
the accuracy of the results obtained. The data may contain errors, omissions,
duplications, or inconsistencies, which may lead to inaccurate results.

Moreover, the data may be incomplete, meaning that some attributes or


values are missing, making it challenging to obtain a complete
understanding of the data.
Data quality issues can arise due to a variety of reasons, including data entry
errors, data storage issues, data integration problems, and data transmission
errors. To address these challenges, data mining practitioners must apply
data cleaning and data preprocessing techniques to improve the quality of
the data. Data cleaning involves detecting and correcting errors, while data

PREPARED BY: DIVYA PADHIYAR 19


DATA MINING (03610335)

preprocessing involves transforming the data to make it suitable for data


mining.

2] Data Complexity
Data complexity refers to the vast amounts of data generated by various
sources, such as sensors, social media, and the internet of things (IoT). The
complexity of the data may make it challenging to process, analyze, and
understand. In addition, the data may be in different formats, making it
challenging to integrate into a single dataset.
To address this challenge, data mining practitioners use advanced techniques
such as clustering, classification, and association rule mining. These
techniques help to identify patterns and relationships in the data, which can
then be used to gain insights and make predictions.

3] Data Privacy and Security


Data privacy and security is another significant challenge in data mining. As
more data is collected, stored, and analyzed, the risk of data breaches and
cyber-attacks increases. The data may contain personal, sensitive, or
confidential information that must be protected. Moreover, data privacy
regulations such as GDPR, CCPA, and HIPAA impose strict rules on how
data can be collected, used, and shared.
To address this challenge, data mining practitioners must apply data
anonymization and data encryption techniques to protect the privacy and

PREPARED BY: DIVYA PADHIYAR 20


DATA MINING (03610335)

security of the data. Data anonymization involves removing personally


identifiable information (PII) from the data, while data encryption involves
using algorithms to encode the data to make it unreadable to unauthorized
users.

4] Scalability
Data mining algorithms must be scalable to handle large datasets efficiently.
As the size of the dataset increases, the time and computational resources
required to perform data mining operations also increase. Moreover, the
algorithms must be able to handle streaming data, which is generated
continuously and must be processed in real-time.
To address this challenge, data mining practitioners use distributed
computing frameworks such as Hadoop and Spark. These frameworks
distribute the data and processing across multiple nodes, making it possible
to process large datasets quickly and efficiently.

5] Interpretability
Data mining algorithms can produce complex models that are difficult to
interpret. This is because the algorithms use a combination of statistical and
mathematical techniques to identify patterns and relationships in the data.
Moreover, the models may not be intuitive, making it challenging to
understand how the model arrived at a particular conclusion.
To address this challenge, data mining practitioners use visualization

PREPARED BY: DIVYA PADHIYAR 21


DATA MINING (03610335)

techniques to represent the data and the models visually. Visualization makes
it easier to understand the patterns and relationships in the data and to
identify the most important variables.

6] Ethics
Data mining raises ethical concerns related to the collection, use, and
dissemination of data. The data may be used to discriminate against certain
groups, violate privacy rights, or perpetuate existing biases. Moreover, data
mining algorithms may not be transparent, making it challenging to detect
biases or discrimination.

★FUTURE OF DATA MINING:


● Data mining has come a long way since its inception, evolving from a niche
field to a cornerstone of modern data analytics. As technology continues to
advance at a rapid pace, the future of data mining holds immense promise,
with emerging trends and innovations poised to revolutionize how we extract
insights from vast amounts of data.

In this blog, we'll explore some of the key trends shaping the future of data
mining and their potential implications for businesses and society.

1. Deep Learning and Neural Networks

Deep learning, a subset of machine learning, has emerged as a powerful tool


for data mining tasks, particularly in areas such as image recognition, natural

PREPARED BY: DIVYA PADHIYAR 22


DATA MINING (03610335)

language processing, and speech recognition. Deep neural networks are


capable of learning complex patterns and relationships in data, enabling
more accurate predictions and insights

2. Federated Learning

Federated learning is a decentralized approach to machine learning, where


models are trained collaboratively across multiple devices or servers without
exchanging raw data. This approach offers privacy advantages by keeping
data localized and reducing the risk of data breaches. In the future, federated
learning could revolutionize data mining by enabling organizations to
leverage insights from distributed datasets while preserving data privacy and
security.

3. Augmented Analytics

Augmented analytics combines machine learning and natural language


processing capabilities to automate data preparation, analysis, and insights
generation tasks. By automating routine data mining processes and surfacing
actionable insights in plain language, augmented analytics empowers
business users to make data-driven decisions more effectively.

4. Explainable AI

Explainable AI focuses on making machine learning models more


interpretable and transparent, enabling users to understand how models

PREPARED BY: DIVYA PADHIYAR 23


DATA MINING (03610335)

arrive at their predictions or decisions. In the context of data mining,


explainable AI techniques help users trust and validate the insights generated
by machine learning models, leading to more informed decision-making. In
the future, explainable AI will play a crucial role in enhancing the
trustworthiness and reliability of data mining models, particularly in
regulated industries such as healthcare and finance.

5. Edge Computing

Edge computing involves processing data closer to its source, such as IoT
devices or sensors, rather than relying solely on centralized cloud
infrastructure. In the future, edge computing will enable organizations to
perform data mining tasks directly on the edge devices, enabling faster
insights and decision-making in scenarios where real-time processing is
critical.

★ISSUES IN DATA MINING:

PREPARED BY: DIVYA PADHIYAR 24


DATA MINING (03610335)

➢ Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −

​Mining different kinds of knowledge in databases − Different users may


be interested in different kinds of knowledge. Therefore it is necessary for
data mining to cover a broad range of knowledge discovery tasks.
​Interactive mining of knowledge at multiple levels of abstraction − The
data mining process needs to be interactive because it allows users to focus
the search for patterns, providing and refining data mining requests based on
the returned results.
​Incorporation of background knowledge − To guide the discovery process
and to express the discovered patterns, the background knowledge can be
used. Background knowledge may be used to express the discovered patterns
not only in concise terms but at multiple levels of abstraction.
​Data mining query languages and ad hoc data mining − Data Mining
Query language that allows the user to describe ad hoc mining tasks, should
be integrated with a data warehouse query language and optimized for
efficient and flexible data mining.
​Presentation and visualization of data mining results − Once the patterns
are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
​Handling noisy or incomplete data − The data cleaning methods are
required to handle the noise and incomplete objects while mining the data
regularities. If the data cleaning methods are not there then the accuracy of
the discovered patterns will be poor.
​Pattern evaluation − The patterns discovered should be interesting because
either they represent common knowledge or lack novelty.

PREPARED BY: DIVYA PADHIYAR 25


DATA MINING (03610335)

➢ Performance Issues

There can be performance-related issues such as follows −

​Efficiency and scalability of data mining algorithms − In order to


effectively extract the information from huge amount of data in databases,
data mining algorithm must be efficient and scalable.
​Parallel, distributed, and incremental mining algorithms − The factors
such as huge size of databases, wide distribution of data, and complexity of
data mining methods motivate the development of parallel and distributed
data mining algorithms. These algorithms divide the data into partitions
which are further processed in a parallel fashion. Then the results from the
partitions are merged. The incremental algorithms, update databases without
mining the data again from scratch.

➢ Diverse Data Types Issues

There can be diverse data types issues such as follows −

Handling of relational and complex types of data − The database may


contain complex data objects, multimedia data objects, spatial data, temporal
data etc. It is not possible for one system to mine all these kind of data.

​Mining information from heterogeneous databases and global


information systems − The data is available at different data sources on
LAN or WAN. These data sources may be structured, semi structured or
​unstructured. Therefore mining the knowledge from them adds challenges to
data mining.

PREPARED BY: DIVYA PADHIYAR 26


DATA MINING (03610335)

★KDD PROCESS IN DATA MINING:


➢What is KDD in Data Mining:
● KDD (Knowledge Discovery in Databases) is a process of discovering
useful knowledge and insights from large and complex datasets. The KDD
process involves a range of techniques and methodologies, including data
preprocessing, data transformation, data mining, pattern evaluation, and
knowledge representation. KDD and data mining are closely related
processes, with data mining being a key component and subset of the KDD
process.
● The KDD process aims to identify hidden patterns, relationships, and trends
in data that can be used to make predictions, decisions, and
recommendations. KDD is a broad and interdisciplinary field used in various
industries, such as finance, healthcare, marketing, e-commerce, etc. KDD is
very important for organizations and businesses as it enables them to derive
new insights and knowledge from their data, which can be further used to
improve decision-making, enhance the customer experience, improve
business processes, support strategic planning, optimize operations, and
drive business growth.
➢KDD Process in Data Mining:
● The KDD process in data mining is a multi-step process that involves
various stages to extract useful knowledge from large datasets. The
following are the main steps involved in the KDD process -
● Data Selection - The first step in the KDD process is identifying and
selecting the relevant data for analysis. This involves choosing the relevant
data sources, such as databases, data warehouses, and data streams, and
determining which data is required for the analysis.
● Data Preprocessing - After selecting the data, the next step is data
preprocessing. This step involves cleaning the data, removing outliers, and

PREPARED BY: DIVYA PADHIYAR 27


DATA MINING (03610335)

removing missing, inconsistent, or irrelevant data. This step is critical, as the


data quality can significantly impact the accuracy and effectiveness of the
analysis.

● Data Transformation - Once the data is preprocessed, the next step is to


transform it into a format that data mining techniques can analyze. This step
involves reducing the data dimensionality, aggregating the data, normalizing
it, and discretizing it to prepare it for further analysis.
● Data Mining - This is the heart of the KDD process and involves applying
various data mining techniques to the transformed data to discover hidden
patterns, trends, relationships, and insights. A few of the most common data
mining techniques include clustering, classification, association rule mining,
and anomaly detection.
● Pattern Evaluation - After the data mining, the next step is to evaluate the
discovered patterns to determine their usefulness and relevance. This
involves assessing the quality of the patterns, evaluating their significance,
and selecting the most promising patterns for further analysis.
● Knowledge Representation - This step involves representing the
knowledge extracted from the data in a way humans can easily understand
and use. This can be done through visualizations, reports, or other forms of
communication that provide meaningful insights into the data.
● Deployment - The final step in the KDD process is to deploy the knowledge
and insights gained from the data mining process to practical applications.
This involves integrating the knowledge into decision-making processes or
other applications to improve organizational efficiency and effectiveness.
● In summary, the KDD process in data mining involves several steps to
extract useful knowledge from large datasets. It is a comprehensive and
iterative process that requires careful consideration of each step to ensure the

PREPARED BY: DIVYA PADHIYAR 28


DATA MINING (03610335)

accuracy and effectiveness of the analysis. Various steps involved in the


KDD process in data mining are shown below diagram -

★TYPES OF DATA IN DATA MINING:


● Mining Multimedia Data: Multimedia data objects include image data,
video data, audio data, website hyperlinks, and linkages. Multimedia data
mining tries to find out interesting patterns from multimedia databases.
This includes the processing of the digital data and performs tasks like
image processing, image classification, video, and audio data mining, and
pattern recognition. Multimedia Data mining is becoming the most

PREPARED BY: DIVYA PADHIYAR 29


DATA MINING (03610335)

interesting research area because most of the social media platforms like
Twitter, Facebook data can be analyzed through this and derive
interesting trends and patterns.

● Mining Web Data: Web mining is essential to discover crucial patterns


and knowledge from the Web. Web content mining analyzes data of
several websites which includes the web pages and the multimedia data
such as images in the web pages. Web mining is done to understand the
content of web pages, unique users of the website, unique hypertext links,
web page relevance and ranking, web page content summaries, time that
the users spent on the particular website, and understand user search
patterns. Web mining also finds out the best search engine and determines
the search algorithm used by it. So it helps improve search efficiency and
finds the best search engine for the users.
● Mining Text Data: Text mining is the subfield of data mining, machine
learning, Natural Language processing, and statistics. Most of the
information in our daily life is stored as text such as news articles,
technical papers, books, email messages, blogs. Text Mining helps us to
retrieve high-quality information from text such as sentiment analysis,
document summarization, text categorization, text clustering. We apply
machine learning models and NLP techniques to derive useful
information from the text. This is done by finding out the hidden patterns

PREPARED BY: DIVYA PADHIYAR 30


DATA MINING (03610335)

and trends by means such as statistical pattern learning and statistical


language modeling. In order to perform text mining, we need to preprocess
the text by applying the techniques of stemming and lemmatization in order
to convert the textual data into data vectors.

● Mining Spatiotemporal Data: The data that is related to both space and
time is Spatiotemporal data. Spatiotemporal data mining retrieves
interesting patterns and knowledge from spatiotemporal data.
Spatiotemporal Data mining helps us to find the value of the lands, the
age of the rocks and precious stones, predict the weather patterns.
Spatiotemporal data mining has many practical applications like GPS in
mobile phones, timers, Internet-based map services, weather services,
satellite, RFID, sensor.
● Mining Data Streams: Stream data is the data that can change
dynamically and it is noisy, inconsistent which contain multidimensional
features of different data types. So this data is stored in NoSql database
systems. The volume of the stream data is very high and this is the
challenge for the effective mining of stream data. While mining the Data
Streams we need to perform the tasks such as clustering, outlier analysis,
and the online detection of rare events in data streams.

PREPARED BY: DIVYA PADHIYAR 31


DATA MINING (03610335)

★TYPES OF DATABASE DATA IN DATA MINING:


1. Relational Databases:
● A Relational database is defined as the collection of data organized in
tables with rows and columns.
● Physical schema in Relational databases is a schema which defines the
structure of tables.
● Logical schema in Relational databases is a schema which defines the
relationship among tables.
● Standard API of relational database is SQL.
● A relational database is a type of structured data that organizes data
into one or more tables, with each table consisting of rows and
columns. The rows represent individual records, and the columns
represent fields or attributes within those records.
● The main feature of a relational database is the ability to establish
relationships between different tables using a common field called a
primary key. This allows data to be linked and queried across multiple
tables, enabling more efficient data retrieval and manipulation.
● Relational databases are widely used in many different industries,
such as finance, healthcare, retail and e-commerce. They are also used
to support transactional systems, data warehousing, and business
intelligence.

PREPARED BY: DIVYA PADHIYAR 32


DATA MINING (03610335)

● Relational databases are typically managed by a database management


system (DBMS) such as MySQL, Oracle, SQL Server, and
PostgreSQL. The DBMS provides tools for creating, modifying, and
querying the database, as well as for managing access and security.
● Some advantages of relational databases include:
● Data Integrity: Relational databases have built-in mechanisms for
maintaining data integrity, such as constraints and triggers
● Data Consistency: Relational databases ensure that the data is
consistent across the entire system
● Data Security: Relational databases provide various levels of access
control and security features to protect the data
● Efficient Data Retrieval: Relational databases provide a powerful
query language (SQL) to retrieve data efficiently
Scalability: Relational databases can be easily scaled to support large
data sets and high-performance requirements
Some disadvantages of relational databases include:
● Complexity: Relational databases can be complex to set up and
manage, especially for large and complex data sets
Latency: Relational databases may not be well-suited for real-time,
high-throughput data processing
● Application: Data Mining, ROLAP model, etc.

PREPARED BY: DIVYA PADHIYAR 33


DATA MINING (03610335)

2. Transactional Databases
● Transactional databases is a collection of data organized by time
stamps, date, etc to represent transaction in databases.
● This type of database has the capability to roll back or undo its
operation when a transaction is not completed or committed.
● Highly flexible system where users can modify information without
changing any sensitive information.
● Follows ACID property of DBMS.
● Application: Banking, Distributed systems, Object databases, etc.
3. Multimedia Databases
● Multimedia databases consists audio, video, images and text media.
● They can be stored on Object-Oriented Databases.
● They are used to store complex information in a pre-specified formats.
● Application: Digital libraries, video-on demand, news-on demand,
musical database, etc.
4. Spatial Database
● Store geographical information.
● Stores data in the form of coordinates, topology, lines, polygons, etc.
● Application: Maps, Global positioning, etc.

PREPARED BY: DIVYA PADHIYAR 34


DATA MINING (03610335)

5. Time-series Databases
● Time series databases contains stock exchange data and user logged
activities.
● Handles array of numbers indexed by time, date, etc.
● It requires real-time analysis.
● Application: eXtremeDB, Graphite, InfluxDB, etc.
6. WWW
● WWW refers to the World wide web is a collection of documents and
resources like audio, video, text, etc which are identified by Uniform
Resource Locators (URLs) through web browsers, linked by HTML
pages, and accessible via the Internet network.
● It is the most heterogeneous repository as it collects data from
multiple resources.
● It is dynamic in nature as Volume of data is continuously increasing
and changing.
● Application: Online shopping, Job search, Research, studying, etc.

★DATA WAREHOUSES:

A data warehouse is a subject-oriented, integrated, time-variant and


non-volatile collection of data in support of management's decision making
process.

PREPARED BY: DIVYA PADHIYAR 35


DATA MINING (03610335)

Subject-Oriented: A data warehouse can be used to analyze a particular


subject area. For example, "sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources.


For example, source A and source B may have different ways of identifying
a product, but in a data warehouse, there will be only a single way of
identifying a product.

Time-Variant: Historical data is kept in a data warehouse. For example, one


can retrieve data from 3 months, 6 months, 12 months, or even older data
from a data warehouse. This contrasts with a transactions system, where
often only the most recent data is kept. For example, a transaction system
may hold the most recent address of a customer, where a data warehouse can
hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So,
historical data in a data warehouse should never be altered.

➢ Data Warehouse Design Process:

A data warehouse can be built using a top-down approach, a bottom-up


approach, or a combination of both.

PREPARED BY: DIVYA PADHIYAR 36


DATA MINING (03610335)

● The top-down approach starts with the overall design and planning. It
is useful in cases where the technology is mature and well known, and
where the business problems that must be solved are clear and well
understood.
● The bottom-up approach starts with experiments and prototypes. This
is useful in the early stage of business modeling and technology
development. It allows an organization to move forward at
considerably less expense and to evaluate the benefits of the
technology before making significant commitments.
● In the combined approach, an organization can exploit the planned
and strategic nature of the top-down approach while retaining the
rapid implementation and opportunistic application of the bottom-up
approach. The warehouse design process consists of the following
steps:
● Choose a business process to model, for example, orders, invoices,
shipments, inventory, account administration, sales, or the general
ledger. If the business process is organizational and involves multiple
complex object collections, a data warehouse model should be
followed. However, if the process is departmental and focuses on the
analysis of one kind of business process, a data mart model should be
chosen.

PREPARED BY: DIVYA PADHIYAR 37


DATA MINING (03610335)

● Choose the grain of the business process. The grain is the


fundamental, atomic level of data to be represented in the fact table
for this process, for example, individual transactions, individual daily
snapshots, and so on.
● Choose the dimensions that will apply to each fact table record.
Typical dimensions are time, item, customer, supplier, warehouse,
transaction type, and status.
● Choose the measures that will populate each fact table record. Typical
measures are numeric additive quantities like dollars sold and units
sold.

➢A Three Tier Data Warehouse Architecture:

PREPARED BY: DIVYA PADHIYAR 38


DATA MINING (03610335)

Tier-1:

The bottom tier is a warehouse database server that is almost always a


relational database system. Back-end tools and utilities are used to feed data
into the bottom tier from operational databases or other external sources
(such as customer profile information provided by external consultants).
These tools and utilities perform data extraction, cleaning, and
transformation (e.g., to merge similar data from different sources into a
unified format), as well as load and refresh functions to update the data
warehouse . The data are extracted using application program interfaces
known as gateways. A gateway is supported by the underlying DBMS and
allows client programs to generate SQL code to be executed at a server.

Examples of gateways include ODBC (Open Database Connection) and


OLEDB (Open Link ingand Embedding for Databases) by Microsoft and
JDBC (Java Database Connection).

This tier also contains a metadata repository, which stores information about
the data warehouse and its contents.

Tier-2:

The middle tier is an OLAP server that is typically implemented using either
a relational OLAP (ROLAP) model or a multidimensional OLAP.

PREPARED BY: DIVYA PADHIYAR 39


DATA MINING (03610335)

OLAP model is an extended relational DBMS thatmaps operations on


multidimensional data to standard relational operations.

A multidimensional OLAP (MOLAP) model, that is, a special-purpose


server that directly implements multidimensional data and operations.

Tier-3:

The top tier is a front-end client layer, which contains query and reporting
tools, analysis tools, and/or data mining tools (e.g., trend analysis,
prediction, and so on).

➢Data Warehouse Models:

There are three data warehouse models.

1. Enterprise warehouse:

● An enterprise warehouse collects all of the information about subjects


spanning the entire organization.
● It provides corporate-wide data integration, usually from one or more
operational systems or external information providers, and is
cross-functional in scope.

PREPARED BY: DIVYA PADHIYAR 40


DATA MINING (03610335)

● It typically contains detailed data as well as summarized data, and can


range in size from a few gigabytes to hundreds of gigabytes, terabytes,
or beyond.
● An enterprise data warehouse may be implemented on traditional
mainframes, computer superservers, or parallel architecture platforms.
It requires extensive business modeling and may take years to design
and build.

2. Data mart:

● A data mart contains a subset of corporate-wide data that is of value to


a specific group of users. The scope is confined to specific selected
subjects. For example,a marketing data mart may confine its subjects
to customer, item, and sales. The data contained in data marts tend to
be summarized.
● Data marts are usually implemented on low-cost departmental servers
that are UNIX/LINUX- or Windows-based. The implementation cycle
of a data mart is more likely to be measured in weeks rather than
months or years. However, it may involve complex integration in the
long run if its design and planning were not enterprise-wide. DEPT
OF CSE & IT VSSUT, Burla Depending on the source of data, data
marts can be categorized as independent or dependent. Independent
data marts are sourced from data captured from one or more
operational systems or external information providers, or from data

PREPARED BY: DIVYA PADHIYAR 41


DATA MINING (03610335)

● generated locally within a particular department or geographic area.


Dependent data marts are sourced directly from enterprise data
warehouses.

3. Virtual warehouse:

A virtual warehouse is a set of views over operational databases. For


efficient query processing, only some of the possible summary views may be
materialized.

A virtual warehouse is easy to build but requires excess capacity on


operational database servers.

➢Metadata Repository:

Metadata are data about data.When used in a data warehouse, metadata are
the data that define warehouse objects. Metadata are created for the data
names and definitions of the given warehouse. Additional metadata are
created and captured for time stamping any extracted data, the source of the
extracted data, and missing fields that have been added by data cleaning or
integration processes.

A metadata repository should contain the following:

PREPARED BY: DIVYA PADHIYAR 42


DATA MINING (03610335)

● A description of the structure of the data warehouse, which includes


the warehouse schema, view, dimensions, hierarchies, and derived
data definitions, as well as data mart locations and contents.
● Operational metadata, which include data lineage (history of migrated
data and the sequence of transformations applied to it), currency of
data (active, archived, or purged), and monitoring information
(warehouse usage statistics, error reports, and audit trails).
● The algorithms used for summarization, which include measure and
dimension definition algorithms, data on granularity, partitions,
subject areas, aggregation, summarization,and predefined queries and
reports.
● The mapping from the operational environment to the data warehouse,
which includes source databases and their contents, gateway
descriptions, data partitions, data extraction, cleaning, transformation
rules and defaults, data refresh and purging rules, and security (user
authorization and access control).
● Data related to system performance, which include indices and
profiles that improve data access and retrieval performance, in
addition to rules for the timing and scheduling of refresh, update, and
replication cycles.
● Business metadata, which include business terms and definitions, data
ownership information, and charging policies.

PREPARED BY: DIVYA PADHIYAR 43

You might also like