Introduction To Data Mining
Introduction To Data Mining
patterns in data.
• Data is Everywhere, in sheets, in social media platforms, in
product reviews and feedback, everywhere. In this latest
information age it’s created at blinding speeds and, when
data is analyzed correctly, can be a company’s most valuable
asset.
• “To grow your business even to grow in your life, sometimes
all you need to do is Analysis
• Data is raw information, and analysis of data is the
systematic process of interpreting and transforming that data
into meaningful insights.
• In a data-driven world, analysis involves applying
statistical, mathematical, or computational techniques to
extract patterns, trends, and correlations from datasets.
• Data analysis is the process of inspecting, cleaning,
transforming, and modeling data to discover useful
information, draw conclusions, and support decision-
making.
• It involves the application of various techniques and
tools to extract meaningful insights from raw data,
helping in understanding patterns, trends, and
relationships within a dataset.
• Data and analysis together form the backbone of evidence-
based decision-making, enabling organizations and
individuals to understand complex phenomena, predict
outcomes, and derive actionable conclusions for improved
outcomes and efficiency.
Why Data Analysis is important?
• Data analysis is crucial for informed decision-making,
revealing patterns, trends, and insights within datasets.
• It enhances strategic planning, identifies opportunities and
challenges, improves efficiency, and fosters a deeper
understanding of complex phenomena across various
industries and fields.
• Informed Decision-Making: Analysis of data provides a basis for informed
decision-making by offering insights into past performance, current trends,
and potential future outcomes.
• Business Intelligence: Analyzed data helps organizations gain a competitive
edge by identifying market trends, customer preferences, and areas for
improvement.
• Problem Solving: It aids in identifying and solving problems within a system
or process by revealing patterns or anomalies that require attention.
• Performance Evaluation: Analysis of data enables the assessment of
performance metrics, allowing organizations to measure success, identify
areas for improvement, and set realistic goals.
• Risk Management: Understanding patterns in data helps in predicting and
managing risks, allowing organizations to mitigate potential challenges.
• Optimizing Processes: Data analysis identifies inefficiencies in processes,
allowing for optimization and cost reduction.
Introduction to Data Mining
Data mining is a technology that blends traditional data
analysis methods with sophisticated algorithms for
processing large volumes of data.
It has also opened up stimulating opportunities for exploring
and analyzing new types of data and for analyzing old types
of data in new ways.
Business Point-of-sale data collection (bar code scanners,
radio frequency identification (RFID), and smart card
technology) have allowed retailers to collect up-to-the-
minute data about customer purchases at the checkout
counters of their stores.
• Retailers can utilize this information, along with other business-
critical data such as Web logs from e-commerce Web sites and cus
tomer service records from call centers, to help them better
understand the needs of their customers and make more informed
business decisions.
• Data mining techniques can be used to support a wide range of
business intelligence applications such as customer profiling, targeted
marketing, work flow management, store layout, and fraud detection.
• It can also help retailers answer important business questions such as
“Who are the most profitable customers?” “What products can be
cross-sold or up-sold?” and “What is the revenue outlook of the
company for next year?”
• Data mining is the process of automatically discovering useful
information in large data repositories.
• Data mining techniques are deployed to place large databases in
order to find novel and useful patterns that might otherwise
remain unknown.
• Not all information discovery tasks are considered to be data
mining.
• For example, looking up individual records using a database
management system or finding particular Web pages via a query
to an Internet search engine are tasks related to the area of
information retrieval.
Data Mining and Knowledge
Discovery Diagram
Data Mining and Knowledge Discovery
• Data mining is an integral part of knowledge discovery in databases
(KDD), which is the overall process of converting raw data into useful in
formation, as shown in the Diagram.
• This process consists of a series of transformation steps, from data
preprocessing to postprocessing of data mining results
KDD Process in Data Mining
• In the context of computer science, “Data Mining” can be referred to as
knowledge mining from data, knowledge extraction, data/pattern analysis,
data archaeology, and data dredging.
• Data Mining also known as Knowledge Discovery in Databases, refers to the
nontrivial extraction of implicit, previously unknown and potentially useful
information from data stored in databases.
• The need of data mining is to extract useful information from large datasets
and use it to make predictions or better decision-making.
• Nowadays, data mining is used in almost all places where a large amount of
data is stored and processed.
For examples: Banking sector, Market Analysis, Network Intrusion Detection.
KDD Process
• KDD (Knowledge Discovery in Databases) is a process that involves the
extraction of useful, previously unknown, and potentially valuable information
from large datasets.
• The KDD process is an iterative process and it requires multiple iterations of the
above steps to extract accurate knowledge from the data.The following steps
are included in KDD process:
Data Cleaning
• Data cleaning is defined as removal of noisy and irrelevant data from collection.
Cleaning in case of Missing values.
• Cleaning noisy data, where noise is a random or variance error.
• Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
• Data integration is defined as heterogeneous data from
multiple sources combined in a common
source(DataWarehouse). Data integration using Data
Migration tools, Data Synchronization tools and
ETL(Extract-Load-Transformation) process.
Data Selection
• Data selection is defined as the process where data
relevant to the analysis is decided and retrieved from
the data collection. For this we can use Neural
network, Decision Trees, Clustering, and Regression
methods.
Data Transformation
• Data Transformation is defined as the process of transforming data
into appropriate form required by mining procedure. Data
Transformation is a two step process:
Data Mapping: Assigning elements from source base to destination
to capture transformations.
• Code generation: Creation of the actual transformation program.
Data Mining
• Data mining is defined as techniques that are applied to extract
patterns potentially useful. It transforms task relevant data into
patterns, and decides purpose of model using classification or
characterization.
Pattern Evaluation
• Pattern Evaluation is defined as identifying strictly
increasing patterns representing knowledge based on
given measures. It find interestingness score of each
pattern, and uses summarization and Visualization to
make data understandable by user.
Knowledge Representation
• This involves presenting the results in a way that is
meaningful and can be used to make decisions.
Advantages of KDD
• Temperature provides a good illustration of some of the concepts that have been
described.
• First, temperature can be either an interval or a ratio attribute, depending on its
measurement scale.
• When measured on the Kelvin scale, a temperature of 2◦ is, in a physically
meaningful way, twice that of a temperature of 1◦.
• This is not true when temperature is measured on either the Celsius or Fahrenheit
scales, because, physically, a temperature of 1◦ Fahrenheit (Celsius) is not much
different than a temperature of 2◦ Fahrenheit (Celsius).
• The problem is that the zero points of the Fahrenheit and Celsius scales are, in a
physical sense, arbitrary, and therefore, the ratio of two Celsius or Fahrenheit
temperatures is not physically meaningful.
Describing Attributes by the Number of Values
• An independent way of distinguishing between attributes is by the number
of values they can take.
Discrete
• A discrete attribute has a finite or countably infinite set of values. Such
attributes can be categorical, such as zip codes or ID numbers, or numeric,
such as counts.
• Discrete attributes are often represented using integer variables. Binary
attributes are a special case of discrete attributes and assume only two
values, e.g., true/false, yes/no, male/female, or 0/1.
• Binary attributes are often represented as Boolean variables, or as integer
variables that only take the values 0 or 1.
Continuous Attribute
• A continuous attribute is one whose values are real numbers.
• Examples include attributes such as temperature, height, or weight. Continuous
attributes are typically represented as floating-point variables.
• Practically, real values can only be measured and represented with limited precision.
• In theory, any of the measurement scale types—nominal, ordinal, interval, and ratio—
could be combined with any of the types based on the number of attribute values—
binary, discrete, and continuous.
• However, some combinations occur only infrequently or do not make much sense.
• For instance, it is difficult to think of a realistic data set that contains a continuous
binary attribute.
• Typically, nominal and ordinal attributes are binary or discrete, while interval and ratio
attributes are continuous.
• However, count attributes, which are discrete, are also ratio attributes.
Types of Data Sets
• There are many types of data sets, and as the field of data
mining develops and matures, a greater variety of data sets
become available for analysis.
• In this section, we describe some of the most common types.
• For convenience, we have grouped the types of data sets into
three groups: record data, graph based data, and ordered
data.
• These categories do not cover all possibilities and other
groupings are certainly possible
General Characteristics of Data Sets
• Dimensionality
• Sparsity
• Resolution
• Record Data
• Transaction or Market Basket Data
• The Data Matrix