data mining introduction
data mining introduction
Science Unit-1
Data Mining
Akshatha B Rai
Asst Professor
Dept of Computer Science
St Philomena College Puttur
Topics Covered…….
In the 1990s, the term "Data Mining" was introduced, but data
mining is the evolution of a sector with an extensive history.
Early techniques of identifying patterns in data include Bayes
theorem (1700s), and the evolution of regression(1800s).
The generation and growing power of computer science have
boosted data collection, storage, and manipulation as data sets
have broad in size and complexity level. Explicit hands-on data
investigation has progressively been improved with indirect,
automatic data processing, and other computer science
discoveries such as neural networks, clustering, genetic
algorithms (1950s), decision trees(1960s), and supporting
vector machines (1990s).
1989 The term “Knowledge Discovery in Databases” (KDD) is
coined by Gregory Piatetsky-Shapiro.
History continued…
Gregory Piatetsky-Shapiro coined the term
"knowledge discovery in databases" for the first
workshop on the same topic (KDD-1989) and this
term became more popular in the AI and machine
learning communities. However, the term data
mining became more popular in the business and
press communities
1990s The term “data mining” appeared in the
database community. Retail companies and the
financial community are using data mining to
analyze data and recognize trends to increase their
customer base, predict fluctuations in interest rates,
stock prices, customer demand.
Data Mining Introduction
Problem Definition
Data Collection
Data Cleaning
Exploratory Data Analysis
Model Building
Model Evaluation
Interpretation and deployment
Data mining Vs Data Science
Data Mining Data Science
Data Integration
Data integration is defined as heterogeneous data from multiple
sources combined in a common source(DataWarehouse). Data
integration using Data Migration tools, Data Synchronization tools
and ETL(Extract-Load-Transformation) process.
KDD Continued…
Data Selection
Data selection is defined as the process where data
relevant to the analysis is decided and retrieved from the data
collection. For this we can use Neural network, Decision Trees,
Naive bayes, Clustering, and Regression methods.
Data Transformation
Data Transformation is defined as the process of
transforming data into appropriate form required by mining
procedure. Data Transformation is a two step process:
1. Data Mapping: Assigning elements from source base to destination
to capture transformations.
2. Code generation: Creation of the actual transformation program.
KDD Continued…
Data Mining
Data mining is defined as techniques that are applied to
extract patterns potentially useful. It transforms task relevant data
into patterns, and decides purpose of model
using classification or characterization.
Pattern Evaluation
Pattern Evaluation is defined as identifying strictly
increasing patterns representing knowledge based on given measures. It
find interestingness score of each pattern, and
uses summarization and Visualization to make data understandable
by user.
Knowledge Representation
This involves presenting the results in a way that is
Advantages of KDD
Patterns, associations, or
Structured information, such as rules and
insights that can be used to
Output models, that can be used to make
improve decision-making or
decisions or predictions.
understanding.
Multimedia Databases
• Multimedia databases consists audio, video, images and text media.
• They can be stored on Object-Oriented Databases.
• They are used to store complex information in a pre-specified formats.
• Application: Digital libraries, video-on demand, news-on demand, musical
database, etc.
Time-series Databases
Time series databases contains stock exchange data and user logged activities.
• Handles array of numbers indexed by time, date, etc.
• It requires real-time analysis.
• Application: eXtremeDB, Graphite, InfluxDB, etc
Types of Sources of Data in Data
Mining……
Cloud Data:
This type of data is stored and processed in cloud computing
environments such as AWS, Azure, and GCP.
Big Data:
This type of data is characterized by its huge volume, high velocity,
and high variety, and can be stored and processed using big data
technologies such as Hadoop and Spark.
Types of Data in Data Mining……
Structured Data:
This type of data is organized into a specific format, such as a database
table or spreadsheet. Examples include transaction data, customer data,
and inventory data.
Semi-Structured Data:
This type of data has some structure, but not as much as structured
data. Examples include XML and JSON files, and email messages.
Unstructured Data:
This type of data does not have a specific format, and can include
text, images, audio, and video. Examples include social media posts,
customer reviews, and news articles.
Types of Data Mining
• Classification Analysis
• Regression Analysis
• Time Serious Analysis
• Prediction Analysis
Types of Data Mining
• Support:
This measurement technique measures how often multiple items are
purchased and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
• Confidence:
This measurement technique measures how often item B is
purchased when item A is purchased as well.
(Item A + Item B)/ (Item A)
5. Artificial Neural Network Classifier
Data Complexity
Data complexity refers to the vast amounts of data generated by
various sources, such as sensors, social media, and the internet
of things (IoT). The complexity of the data may make it
challenging to process, analyze, and understand. In addition, the
data may be in different formats, making it challenging to
integrate into a single dataset.
Issues and challenges of DM
Size, updates and irrelevant fields
Databases tend to be large and dynamic, in that their contents are keep changing as
information is added, modified or removed. The problem with this, from the perspective of
data mining, is how to ensure that the rules are up-to-date and consistent with the most
current information.
Data Privacy and Security
Data privacy and security is another significant challenge in data mining. As more data is
collected, stored, and analyzed, the risk of data breaches and cyber-attacks increases. The
data may contain personal, sensitive, or confidential information that must be protected.
Moreover, data privacy regulations such as GDPR, CCPA, and HIPAA impose strict rules
on how data can be collected, used, and shared.
Issues and challenges of DM
Scalability
Data mining algorithms must be scalable to handle large datasets
efficiently. As the size of the dataset increases, the time and
computational resources required to perform data mining operations
also increase. Moreover, the algorithms must be able to handle
streaming data, which is generated continuously and must be processed
in real-time.
Ethical and Legal Considerations:
Data mining raises various ethical and legal considerations, including
consent, data ownership, intellectual property rights, and compliance
with regulations such as GDPR (General Data Protection Regulation) and
HIPAA (Health Insurance Portability and Accountability Act).
Data Mining Applications
Business Transactions: