0% found this document useful (0 votes)
2 views

data mining introduction

The document provides an overview of data mining, including its history, definitions, techniques, and applications. It distinguishes between data mining and data science, highlights the knowledge discovery in databases (KDD) process, and discusses the advantages and disadvantages of data mining. Additionally, it covers various types of data sources and types of data mining, emphasizing the importance of data quality and the potential for predictive and descriptive analysis.

Uploaded by

Akshatha B
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

data mining introduction

The document provides an overview of data mining, including its history, definitions, techniques, and applications. It distinguishes between data mining and data science, highlights the knowledge discovery in databases (KDD) process, and discusses the advantages and disadvantages of data mining. Additionally, it covers various types of data sources and types of data mining, emphasizing the importance of data quality and the potential for predictive and descriptive analysis.

Uploaded by

Akshatha B
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Fundamentals of Data

Science Unit-1
Data Mining

Akshatha B Rai
Asst Professor
Dept of Computer Science
St Philomena College Puttur
Topics Covered…….

 Data Mining History


 Data Mining Introduction
 Data Mining Definitions
 Pros & cons
 Knowledge Discovery in Databases(KDD)
 KDD Vs Data Mining
 DBMS Vs Data Mining
 Data Mining Techniques
 Problems, Issues and Challenges
 Data Mining Applications
Data Mining History

 In the 1990s, the term "Data Mining" was introduced, but data
mining is the evolution of a sector with an extensive history.
 Early techniques of identifying patterns in data include Bayes
theorem (1700s), and the evolution of regression(1800s).
 The generation and growing power of computer science have
boosted data collection, storage, and manipulation as data sets
have broad in size and complexity level. Explicit hands-on data
investigation has progressively been improved with indirect,
automatic data processing, and other computer science
discoveries such as neural networks, clustering, genetic
algorithms (1950s), decision trees(1960s), and supporting
vector machines (1990s).
 1989 The term “Knowledge Discovery in Databases” (KDD) is
coined by Gregory Piatetsky-Shapiro.
History continued…
 Gregory Piatetsky-Shapiro coined the term
"knowledge discovery in databases" for the first
workshop on the same topic (KDD-1989) and this
term became more popular in the AI and machine
learning communities. However, the term data
mining became more popular in the business and
press communities
 1990s The term “data mining” appeared in the
database community. Retail companies and the
financial community are using data mining to
analyze data and recognize trends to increase their
customer base, predict fluctuations in interest rates,
stock prices, customer demand.
Data Mining Introduction

 Data mining is the process of extracting useful


information from large sets of data. It involves using
various techniques from statistics, machine learning,
and database systems to identify patterns,
relationships, and trends in the data.
 “Data Mining” can be referred to as knowledge
mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and
data dredging.
 Data Mining emerged from the convergence of three
Scientific Disciplines: Artificial Intelligence,
Machine Learning, and Statistics.
Definitions..
 Data mining or knowledge discovery in databases, as it is also
known, is the non- trivial extraction of implicit, previous1y
unknown and potentially useful information from the data. This
encompasses a number of technical approaches, such as
clustering, data summarization, classification, finding
dependency networks, analyzing changes, and detecting
anomalies.

 Data mining is the process of discovering meaningful, new


correlation patterns and trends by sifting through large amount
of data stored in repositories, using pattern recognition
techniques as well as statistical and mathematical techniques.
Steps for Data Mining

 Problem Definition
 Data Collection
 Data Cleaning
 Exploratory Data Analysis
 Model Building
 Model Evaluation
 Interpretation and deployment
Data mining Vs Data Science
Data Mining Data Science

Data mining is a process of Data science refers to the process


extracting useful information, of obtaining valuable insights from
patterns, and trends from huge structured and unstructured data by
databases. using various tools and methods.
Data mining is a technique. Data science is a field.
Primarily used for business purposes. Primarily used for scientific
purposes.
It is involved with the process. It emphasizes the science of the
data.
Data mining aims to make data more The objective of data science is to
important and usable; it means create a dominant data product.
extracting only useful information.
Data mining is a technique that is a It is related to the field of study like
part of KDD (Knowledge discovery in Mechanical engineering, Cloud
database process). architecture, etc.
It primarily deals with structured It deals with any kind of data like
data. structured, semi-structured, and
Real life example…..

 Market Basket Analysis: It is a technique


that gives the careful study of purchases
done by a customer in a supermarket. The
concept is basically applied to identify the
items that are bought together by a
customer. Say, if a person buys bread, what
are the chances that he/she will also
purchase butter? This analysis helps in
promoting offers and deals by the
companies. The same is done with the help
of data mining.
 Protein Folding: It is a technique that carefully
studies biological cells and predicts the protein
interactions and functionality within biological cells.
Applications of this research include
determining causes and possible cures for
Alzheimer’s, Parkinson’s, and cancer caused by
Protein misfolding.
 Fraud Detection: Nowadays, in this land of cell
phones, we can use data mining to analyze cell
phone activities for comparing suspicious phone
activity. This can help us to detect calls made on
cloned phones. Similarly, with credit cards,
comparing purchases with historical purchases can
detect activity with stolen cards.
Advantages of Data Mining

 Improved decision-making: Data mining can provide


valuable insights that can help organizations make
better decisions by identifying patterns and trends in
large data sets.
 Increased efficiency: Data mining can automate
repetitive and time-consuming tasks, such as data
cleaning and preparation, which can help organizations
save time and resources.
 Enhanced competitiveness: Data mining can help
organizations gain a competitive edge by uncovering
new business opportunities and identifying areas for
improvement.
Continued……

 Improved customer service: Data mining


can help organizations better understand
their customers and tailor their products and
services to meet their needs.
 Fraud detection: Data mining can be used
to identify fraudulent activities by detecting
unusual patterns and anomalies in data.
Advantages continued….

 Predictive modeling: Data mining can be used to


build models that can predict future events and
trends, which can be used to make proactive
decisions.
 New product development: Data mining can be
used to identify new product opportunities by
analyzing customer purchase patterns and
preferences.
 Risk management: Data mining can be used to
identify potential risks by analyzing data on
customer behavior, market conditions, and other
factors.
Knowledge Discovery in
Database(KDD)

KDD (Knowledge Discovery in Databases) is a process


that involves the extraction of useful, previously
unknown, and potentially valuable information from
large datasets. The KDD process is an iterative
process and it requires multiple iterations of the above
steps to extract accurate knowledge from the
data.The following steps are included in KDD process:
Different Stages(steps) of KDD
KDD continued….
 Data Cleaning
Data cleaning is defined as removal of noisy and irrelevant data from
collection.
1. Cleaning in case of Missing values.
2. Cleaning noisy data, where noise is a random or variance error.
3. Cleaning with Data discrepancy detection and Data
transformation tools.

 Data Integration
Data integration is defined as heterogeneous data from multiple
sources combined in a common source(DataWarehouse). Data
integration using Data Migration tools, Data Synchronization tools
and ETL(Extract-Load-Transformation) process.
KDD Continued…
 Data Selection
Data selection is defined as the process where data
relevant to the analysis is decided and retrieved from the data
collection. For this we can use Neural network, Decision Trees,
Naive bayes, Clustering, and Regression methods.

 Data Transformation
Data Transformation is defined as the process of
transforming data into appropriate form required by mining
procedure. Data Transformation is a two step process:
1. Data Mapping: Assigning elements from source base to destination
to capture transformations.
2. Code generation: Creation of the actual transformation program.
KDD Continued…
 Data Mining
Data mining is defined as techniques that are applied to
extract patterns potentially useful. It transforms task relevant data
into patterns, and decides purpose of model
using classification or characterization.

 Pattern Evaluation
Pattern Evaluation is defined as identifying strictly
increasing patterns representing knowledge based on given measures. It
find interestingness score of each pattern, and
uses summarization and Visualization to make data understandable
by user.

 Knowledge Representation
This involves presenting the results in a way that is
Advantages of KDD

1. Improves decision-making: KDD provides valuable insights and


knowledge that can help organizations make better decisions.
2. Increased efficiency: KDD automates repetitive and time-
consuming tasks and makes the data ready for analysis, which saves
time and money.
3. Better customer service: KDD helps organizations gain a better
understanding of their customers’ needs and preferences, which can
help them provide better customer service.
4. Fraud detection: KDD can be used to detect fraudulent activities by
identifying patterns and anomalies in the data that may indicate
fraud.
5. Predictive modeling: KDD can be used to build predictive models
that can forecast future trends and patterns.
Disadvantages of KDD
1. Privacy concerns: KDD can raise privacy concerns as it involves collecting
and analyzing large amounts of data, which can include sensitive information
about individuals.
2. Complexity: KDD can be a complex process that requires specialized skills
and knowledge to implement and interpret the results.
3. Unintended consequences: KDD can lead to unintended consequences,
such as bias or discrimination, if the data or models are not properly
understood or used.
4. Data Quality: KDD process heavily depends on the quality of data, if data is
not accurate or consistent, the results can be misleading
5. High cost: KDD can be an expensive process, requiring significant
investments in hardware, software, and personnel.
6. Overfitting: KDD process can lead to overfitting, which is a common problem
in machine learning where a model learns the detail and noise in the training
data to the extent that it negatively impacts the performance of the model on
new unseen data.
Difference between KDD and Data Mining
Parameter KDD Data Mining

KDD refers to a process of identifying Data Mining refers to a


valid, novel, potentially useful, and process of extracting useful
Definition
ultimately understandable patterns and and valuable information or
relationships in data. patterns from large data sets.

To extract useful information


Objective To find useful knowledge from data.
from data.

Data cleaning, data integration, data Association rules,


selection, data transformation, data classification, clustering,
Techniques Used mining, pattern evaluation, and regression, decision trees,
knowledge representation and neural networks, and
visualization. dimensionality reduction.

Patterns, associations, or
Structured information, such as rules and
insights that can be used to
Output models, that can be used to make
improve decision-making or
decisions or predictions.
understanding.

Focus is on the discovery of useful Data mining focus is on the


Focus knowledge, rather than simply finding discovery of patterns or
patterns in data. relationships in data.

Domain expertise is less


Domain expertise is important in KDD, as
critical in data mining, as the
Role of domain it helps in defining the goals of the
algorithms are designed to
expertise process, choosing appropriate data, and
identify patterns without
interpreting the results.
relying on prior knowledge.
Difference between DBMS and Data Mining

DBMS Data mining

 It Create, store, maintain data in


 Extracting useful and unknown
database
 The database is the organized data from raw data
collection of data. Most of the  Data Mining is analyzing data
times, these raw data are stored
from different information to
in very large databases.
discover useful knowledge.
 It supports Query languages.
 Automatic searching of data
 Can work without data mining
 May not work without database.
technique.
 Basic elements- Query
 Basic concepts Classification,
languages, data store and
regression, Clustering and
transaction mechanism.
Association.
Types of Sources of Data in Data Mining

 1. Data stored in the database

A database is also called a database management system or DBMS.


Every DBMS stores data that are related to each other in a way or the
other. It also has a set of software programs that are used to manage
data and provide easy access to it. These software programs serve a lot
of purposes, including defining structure for database, making sure that
the stored information remains secured and consistent, and managing
different types of data access, such as shared, distributed, and
concurrent.
A relational database has tables that have different names, attributes,
and can store rows or records of large data sets. Every record stored in a
table has a unique key. Entity-relationship model is created to provide a
representation of a relational database that features entities and the
relationships that exist between them.
Types of Sources of Data in Data
Mining…..
 Data warehouse
A data warehouse is a single data storage location that collects data
from multiple sources and then stores it in the form of a unified plan.
When data is stored in a data warehouse, it undergoes cleaning,
integration, loading, and refreshing. Data stored in a data warehouse is
organized in several parts.
 Transactional data
Transactional database stores record that are captured as transactions.
These transactions include flight booking, customer purchase, click on a
website, and others. Every transaction record has a unique ID. It also
lists all those items that made it a transaction.
Types of Sources of Data in Data Mining……

 Multimedia Databases
• Multimedia databases consists audio, video, images and text media.
• They can be stored on Object-Oriented Databases.
• They are used to store complex information in a pre-specified formats.
• Application: Digital libraries, video-on demand, news-on demand, musical
database, etc.

 Time-series Databases

Time series databases contains stock exchange data and user logged activities.
• Handles array of numbers indexed by time, date, etc.
• It requires real-time analysis.
• Application: eXtremeDB, Graphite, InfluxDB, etc
Types of Sources of Data in Data
Mining……

 Cloud Data:
This type of data is stored and processed in cloud computing
environments such as AWS, Azure, and GCP.

 Big Data:
This type of data is characterized by its huge volume, high velocity,
and high variety, and can be stored and processed using big data
technologies such as Hadoop and Spark.
Types of Data in Data Mining……

 Structured Data:
This type of data is organized into a specific format, such as a database
table or spreadsheet. Examples include transaction data, customer data,
and inventory data.
 Semi-Structured Data:
This type of data has some structure, but not as much as structured
data. Examples include XML and JSON files, and email messages.
 Unstructured Data:
This type of data does not have a specific format, and can include
text, images, audio, and video. Examples include social media posts,
customer reviews, and news articles.
Types of Data Mining

 1. Predictive Data Mining


As the name signifies, Predictive Data-Mining analysis
works on the data that may help to know what may happen
later (or in the future) in business. Predictive Data-Mining can
also be further divided into four types that are listed below:

• Classification Analysis
• Regression Analysis
• Time Serious Analysis
• Prediction Analysis
Types of Data Mining

 2. Descriptive Data Mining


The main goal of the Descriptive Data Mining
tasks is to summarize or turn given data into relevant
information. The Descriptive Data-Mining Tasks can
also be further divided into four types that are as
follows:
• Clustering Analysis
• Summarization Analysis
• Association Rules Analysis
• Sequence Discovery Analysis
Difference between predictive and descriptive data mining

Descriptive data mining Predictive data mining


Descriptive mining is usually used The term 'Predictive' means to
to provide correlation, cross- predict something, so predictive
tabulation, frequency, etc. data mining is the analysis done to
predict the future event or other
data or trends.
It is based on the reactive It is based on the proactive
approach. approach.
It specifies the characteristics of It executes the induction over the
the data in a target data set. current and past data so that
prediction can happen.
It needs data aggregation and data It needs statistics and data
mining. forecasting procedures.
It provides precise data. It produces outcomes without
Data Mining Techniques
 Classification
 Clustering
 Association Rule
 Regression
 Anomaly Detection
 Time series Analysis
 Outlier Detection
 Artificial Neural Networks classifier
 Decision Trees
 Text Mining
1. Classification
Classification is a technique used to categorize data into
predefined classes or categories based on the features or
attributes of the data instances. It involves training a model on
labeled data and using it to predict the class labels of new, unseen
data instances.
• Decision Tree
• SVM(Support Vector Machine)
• Generalized Linear Models
• Bayesian classification:
• Classification by Backpropagation
• K-NN Classifier
• Rule-Based Classification
• Frequent-Pattern Based Classification
• Rough set theory
• Fuzzy Logic
2.Clustering
 Clustering is a technique used to group similar
data instances together based on their intrinsic
characteristics or similarities. It aims to discover
natural patterns or structures in the data without
any predefined classes or labels.
 This technique helps to recognize the differences
and similarities between the data. Clustering is
very similar to the classification, but it involves
grouping chunks of data together based on their
similarities.
 Algorithms: K-means, Hierarchical clustering,
density based clustering
3. Regression:

 Regression is employed to predict numeric or continuous


values based on the relationship between input variables and
a target variable.
 It establish a relationship between a dependent variable and
one or more independent variable…
4. Association Rules:

 Association rules helps to discover a link between


two or more items. It finds a hidden pattern in the
data set.
 Association rules are if-then statements that support
to show the probability of interactions between
data items within large data sets in different
types of databases. Association rule mining has
several applications and is commonly used to help
sales correlations in data or medical data sets.
 For example, a list of grocery items that you have
been buying for the last six months. It calculates a
percentage of items being purchased together.
• Lift:
This measurement technique measures the accuracy of the
confidence over how often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)

• Support:
This measurement technique measures how often multiple items are
purchased and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)

• Confidence:
This measurement technique measures how often item B is
purchased when item A is purchased as well.
(Item A + Item B)/ (Item A)
5. Artificial Neural Network Classifier

 A process model supported by biological neurons could be an artificial


neural network (ANN), also known as a "Neural Network" (NN). It is
made up of a networked group of synthetic neurons. A neural network
is a collection of connected input/output units with weights assigned
to each connection.
 In order to be able to anticipate the class label of the input samples
correctly, the network accumulates information during the knowledge
phase by modifying the weights. Due to the links between units,
neural network learning is also known as connectionist learning.
 Neural networks require lengthy training periods, making them more
suitable for applications where it is possible. They need a variety of
parameters, like the network topology or "structure," which are often
best determined empirically.
6.Outlier Detection

 Data objects that do not adhere to the overall behavior or model of


the data may be found in a database. These informational items are
outliers. OUTLIER MINING is the process of looking into OUTLIER data.
 When employing distance measurements, objects with a tiny
percentage of "near" neighbors in space are regarded as outliers.
Statistical tests that assume a distribution and probability model for
the data can also be used to identify outliers.
Issues and Challenges in Data
Mining
 Limited Information
 Noise or missing Data
 User Interaction and Prior Knowledge
 Data Complexity
 Size, updates and irrelevant fields.
 Data Privacy and Security
 Scalability
 Limited Information:
In Databse sometimes, some attributes which are essential for knowledge
discovery of the application domain are not present in the data. Thus, it may be
very difficult to discover significant knowledge about a given domain.

 Noise or missing Data(Data Quality)


The quality of data used in data mining is one of the most significant challenges.
The accuracy, completeness, and consistency of the data affect the accuracy of
the results obtained. The data may contain errors, omissions, duplications, or
inconsistencies, which may lead to inaccurate results. Moreover, the data may
be incomplete, meaning that some attributes or values are missing, making it
challenging to obtain a complete understanding of the data.
Data quality issues can arise due to a variety of reasons, including data entry
errors, data storage issues, data integration problems, and data transmission
errors
 User Interaction and Prior Knowledge
An analyst is usually not a KDD expert, but simply a person making use of the data by
means of the available KDD techniques. Since the KDD process is by definition interactive
and iterative, it is challenging to provide a high performance, rapid-response environment
that also assists the users in the proper selection and matching of the appropriate techniques
to achieve their goals.

 Data Complexity
Data complexity refers to the vast amounts of data generated by
various sources, such as sensors, social media, and the internet
of things (IoT). The complexity of the data may make it
challenging to process, analyze, and understand. In addition, the
data may be in different formats, making it challenging to
integrate into a single dataset.
Issues and challenges of DM
 Size, updates and irrelevant fields
Databases tend to be large and dynamic, in that their contents are keep changing as
information is added, modified or removed. The problem with this, from the perspective of
data mining, is how to ensure that the rules are up-to-date and consistent with the most
current information.
 Data Privacy and Security
Data privacy and security is another significant challenge in data mining. As more data is
collected, stored, and analyzed, the risk of data breaches and cyber-attacks increases. The
data may contain personal, sensitive, or confidential information that must be protected.
Moreover, data privacy regulations such as GDPR, CCPA, and HIPAA impose strict rules
on how data can be collected, used, and shared.
Issues and challenges of DM
 Scalability
Data mining algorithms must be scalable to handle large datasets
efficiently. As the size of the dataset increases, the time and
computational resources required to perform data mining operations
also increase. Moreover, the algorithms must be able to handle
streaming data, which is generated continuously and must be processed
in real-time.
 Ethical and Legal Considerations:
Data mining raises various ethical and legal considerations, including
consent, data ownership, intellectual property rights, and compliance
with regulations such as GDPR (General Data Protection Regulation) and
HIPAA (Health Insurance Portability and Accountability Act).
Data Mining Applications
Business Transactions:

The effective and in-time use of the data in a reasonable


time frame for competitive decision-making is definitely the most
important problem to solve for businesses that struggle to
survive in a highly competitive world. Data mining helps to
analyze these business transactions and identify marketing
approaches and decision-making. Example :
• Direct mail targeting
• Stock trading
• Customer segmentation and retention
• Churn prediction (Churn prediction is one of the most popular
Big Data use cases in business)
Intrusion Detection

A network intrusion refers to any unauthorized activity on a


digital network. Network intrusions often involve stealing
valuable network resources. Data mining technique plays a vital
role in searching intrusion detection, network attacks, and
anomalies. These techniques help in selecting and refining useful
and relevant information from large data sets. Data mining
technique helps in classify relevant data for Intrusion Detection
System. Intrusion Detection system generates alarms for the
network traffic about the foreign invasions in the system. For
example:
• Detect security violations
• Misuse Detection
• Anomaly Detection
Market Basket Analysis:

 Market Basket Analysis: Market Basket Analysis is a


technique that gives the careful study of purchases done by a
customer in a supermarket. This concept identifies the pattern
of frequent purchase items by customers. This analysis can
help to promote deals, offers, sale by the companies and data
mining techniques helps to achieve this analysis task.
Example:
• Data mining concepts are in use for Sales and marketing to
provide better customer service, to improve cross-selling
opportunities, to increase direct mail response rates.
Education:
For analyzing the education sector, data mining
uses Educational Data Mining (EDM) method. This method
generates patterns that can be used both by learners and
educators. By using data mining EDM we can perform some
educational task:
• Predicting students admission in higher education
• Predicting students profiling
• Predicting student performance
• Teachers teaching performance
• Curriculum development
• Predicting student placement opportunities
Research
A data mining technique can perform predictions, classification,
clustering, associations, and grouping of data with perfection in the
research area. In most of the technical research in data mining, we
create a training model and testing model. The training/testing model
is a strategy to measure the precision of the proposed model. It is
called Train/Test because we split the data set into two sets: a training
data set and a testing data set. A training data set used to design the
training model whereas testing data set is used in the testing model.
Example:
• Classification of uncertain data.
• Information-based clustering.
• Decision support system
• Web Mining
• IoT (Internet of Things)and Cybersecurity
• Smart farming IoT(Internet of Things)
Healthcare and Insurance:

A Pharmaceutical sector can examine its new deals


force activity and their outcomes to improve the focusing of
high-value physicians and figure out which promoting activities
will have the best effect in the following upcoming months,
Whereas the Insurance sector, data mining can help to predict
which customers will buy new policies, identify behavior patterns
of risky customers and identify fraudulent behavior of customers.
• Claims analysis i.e which medical procedures are claimed
together.
• Identify successful medical therapies for different illnesses.
• Characterizes patient behavior to predict office visits.
Financial/Banking Sector

A credit card company can leverage its


vast warehouse of customer transaction data to
identify customers most likely to be interested in a
new credit product.
• Credit card fraud detection.
• Identify ‘Loyal’ customers.
• Extraction of information related to customers.
• Determine credit card spending by customer
groups.
Transportation:

A diversified transportation company


with a large direct sales force can apply data
mining to identify the best prospects for its
services. A large consumer merchandise
organization can apply information mining to
improve its business cycle to retailers.
• Determine the distribution schedules among
outlets.
• Analyze loading patterns.

You might also like