0% found this document useful (0 votes)
29 views

Data Mining

Lecture Notes

Uploaded by

saaiemsalaar525
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Data Mining

Lecture Notes

Uploaded by

saaiemsalaar525
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Chapter:02

Data Mining
Data mining is one of the most useful techniques that help entrepreneurs, researchers, and individuals to
extract valuable information from huge sets of data. Data mining is also called Knowledge Discovery in
Database (KDD). The knowledge discovery process includes Data cleaning, Data integration, Data
selection, Data transformation, Data mining, Pattern evaluation, and Knowledge presentation.

Our Data mining tutorial includes all topics of Data mining such as applications, Data mining vs Machine
learning, Data mining tools, Social Media Data mining, Data mining techniques, Clustering in data mining,
Challenges in Data mining, etc.

What is Data Mining?


The process of extracting information to identify patterns, trends, and useful data that would allow the
business to take the data-driven decision from huge sets of data is called Data Mining.
Advantages of Data Mining:
○ The Data Mining technique enables organizations to obtain knowledge-based data.
○ Data mining enables organizations to make lucrative modifications in operation and
production.
○ Compared with other statistical data applications, data mining is a cost-efficient.
○ Data Mining helps the decision-making process of an organization.
○ It Facilitates the automated discovery of hidden patterns as well as the prediction of trends
and behaviors.
○ It can be induced in the new system as well as the existing platforms.
○ It is a quick process that makes it easy for new users to analyze enormous amounts of
data in a short time.
Data Mining Applications
Challenges of Implementation in Data mining
Data Mining Architecture
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files, and other
documents. You need a huge amount of historical data for data mining to be successful. Organizations
typically store data in databases or data warehouses. Data warehouses may comprise one or more
databases, text files spreadsheets, or other repositories of data. Sometimes, even plain text files or
spreadsheets may contain information. Another primary source of data is the World Wide Web or the
internet.

Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned, integrated,
and selected. As the information comes from various sources and in different formats, it can't be used
directly for the data mining procedure because the data may not be complete and accurate. So, the first
data requires to be cleaned and unified. More information than needed will be collected from various data
sources, and only the data of interest will have to be selected and passed to the server. These procedures
are not as easy as we think. Several methods may be performed on the data as part of selection,
integration, and cleaning.
Database or Data Warehouse Server:

The database or data warehouse server consists of the original data that is ready to be processed. Hence, the
server is cause for retrieving the relevant data that is based on data mining as per user request.

Data Mining Engine:

The data mining engine is a major component of any data mining system. It contains several modules for
operating data mining tasks, including association, characterization, classification, clustering, prediction,
time-series analysis, etc.

In other words, we can say data mining is the root of our data mining architecture. It comprises instruments and
software used to obtain insights and knowledge from data collected from various data sources and stored
within the data warehouse.

Pattern Evaluation Module:

The Pattern evaluation module is primarily responsible for the measure of investigation of the pattern by using a
threshold value. It collaborates with the data mining engine to focus the search on exciting patterns.
Graphical User Interface:
The graphical user interface (GUI) module communicates between the data mining system and the user.
This module helps the user to easily and efficiently use the system without knowing the complexity of the
process. This module cooperates with the data mining system when the user specifies a query or a task
and displays the results.

Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the
search or evaluate the stake of the result patterns. The knowledge base may even contain user views and
data from user experiences that might be helpful in the data mining process. The data mining engine
may receive inputs from the knowledge base to make the result more accurate and reliable. The pattern
assessment module regularly interacts with the knowledge base to get inputs, and also update it.
KDD- Knowledge Discovery in Databases:
The main objective of the KDD process is to extract information from data in the context of
large databases. It does this by using Data Mining algorithms to identify what is deemed
knowledge.
The KDD Process:

1. Building up an understanding of the application domain

2. Choosing and creating a data set on which discovery will be performed

3. Preprocessing and cleansing

4. Data Transformation

5. Prediction and description

6. Selecting the Data Mining algorithm

7. Utilizing the Data Mining algorithm

8. Evaluation

9. Using the discovered knowledge


The KDD Process:
Data Mining Implementation Process
The Cross-Industry Standard Process for Data Mining (CRISP-DM)
1. Business understanding:

It focuses on understanding the project goals and requirements form a business point of
view, then converting this information into a data mining problem afterward a preliminary
plan designed to accomplish the target.

Task:

○ Determine business objectives


○ Access situation
○ Determine data mining goals
○ Produce a project plan
2. Data Understanding:

Data understanding starts with an original data collection and proceeds with operations to get
familiar with the data, to data quality issues, to find better insight in data, or to detect interesting
subsets for concealed information hypothesis.

Tasks:

○ Collects initial data


○ Describe data
○ Explore data
○ Verify data quality
3. Data Preparation:

○ It usually takes more than 90 percent of the time.


○ It covers all operations to build the final data set from the original raw information.
○ Data preparation is probable to be done several times and not in any prescribed order.

Tasks:

○ Select data
○ Clean data
○ Construct data
○ Integrate data
○ Format data
4. Modeling:

In modeling, various modeling methods are selected and applied, and their parameters are measured
to optimum values. Some methods gave particular requirements on the form of data. Therefore,
stepping back to the data preparation phase is necessary.

Tasks:

○ Select modeling technique


○ Generate test design
○ Build model
○ Access mode
5. Evaluation:

○ At the last of this phase, a decision on the use of the data mining results should be reached.
○ It evaluates the model efficiently, and review the steps executed to build the model and to ensure
that the business objectives are properly achieved.
○ The main objective of the evaluation is to determine some significant business issue that has not
been regarded adequately.
○ At the last of this phase, a decision on the use of the data mining outcomes should be reached.

Tasks:

○ Evaluate results
○ Review process
○ Determine next steps
6. Deployment:

Determine:

○ Deployment refers to how the outcomes need to be utilized.

Deploy data mining results by:

○ It includes scoring a database, utilizing results as company guidelines, interactive internet scoring.
○ The information acquired will need to be organized and presented in a way that can be used by the client. However, the
deployment phase can be as easy as producing. However, depending on the demands, the deployment phase may be as
simple as generating a report or as complicated as applying a repeatable data mining method across the organizations.

Task:

○ Plan deployment
○ Plan monitoring and maintenance
○ Produce final report
○ Review project
Data Mining Techniques

Data mining includes the utilization of refined data


analysis tools to find previously unknown, valid
patterns and relationships in huge data sets. These
tools can incorporate statistical models, machine
learning techniques, and mathematical algorithms,
such as neural networks or decision trees. Thus,
data mining incorporates analysis and prediction.

In recent data mining projects, various major data


mining techniques have been developed and used,
including association, classification, clustering,
prediction, sequential patterns, and regression.
1. Classification:

This technique is used to obtain important and relevant information about data and metadata. This
data mining technique helps to classify data in different classes.

Classification of Data mining frameworks as per the type of data sources mined

Classification of data mining frameworks as per the database involved

Classification of data mining frameworks as per the kind of knowledge discovered

Classification of data mining frameworks according to data mining techniques used


2. Clustering:

Clustering analysis is a data mining technique to identify similar data. This technique helps to recognize
the differences and similarities between the data. Clustering is very similar to the classification, but it
involves grouping chunks of data together based on their similarities.

3. Regression:

Regression analysis is the data mining process is used to identify and analyze the relationship between
variables because of the presence of the other factor. It is used to define the probability of the specific
variable. Primarily it gives the exact relationship between two or more variables in the given data set.

4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a hidden pattern in
the data set.
5. Outer detection:

This type of data mining technique relates to the observation of data items in the data set, which do not match
an expected pattern or expected behavior. This technique may be used in various domains like intrusion,
detection, fraud detection, etc.

6. Sequential Patterns:

The sequential pattern is a data mining technique specialized for evaluating sequential data to discover
sequential patterns. It comprises of finding interesting subsequences in a set of sequences, In other words, this
technique of data mining helps to discover or recognize similar patterns in transaction data over some time.

7. Prediction:

Prediction used a combination of other data mining techniques such as trends, clustering, classification, etc. It
analyzes past events or instances in the right sequence to predict a future event.
Data Preprocessing in Data Mining:

Data preprocessing is an important step in the data mining process. It refers


to the cleaning, transforming, and integrating of data in order to make it ready
for analysis. The goal of data preprocessing is to improve the quality of the
data and to make it more suitable for the specific data mining task.
steps in data preprocessing include:Data preprocessing is an important step in the data mining
process that involves cleaning and transforming raw data to make it suitable for analysis. Some
common steps in data preprocessing include:

Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as
missing values, outliers, and duplicates. Various techniques can be used for data cleaning, such as
imputation, removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a unified dataset.
Data integration can be challenging as it requires handling data with different formats, structures,
and semantics. Techniques such as record linkage and data fusion can be used for data integration.

Data Transformation: This involves converting the data into a suitable format for analysis. Common
techniques used in data transformation include normalization, standardization, and discretization.
Normalization is used to scale the data to a common range, while standardization is used to
transform the data to have zero mean and unit variance. Discretization is used to convert continuous
data into discrete categories.
Data Reduction: This involves reducing the size of the dataset while preserving the
important information. Data reduction can be achieved through techniques such as
feature selection and feature extraction. Feature selection involves selecting a subset
of relevant features from the dataset, while feature extraction involves transforming
the data into a lower-dimensional space while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete categories
or intervals. Discretization is often used in data mining and machine learning
algorithms that require categorical data. Discretization can be achieved through
techniques such as equal width binning, equal frequency binning, and clustering.
Data Normalization: This involves scaling the data to a common range, such as
between 0 and 1 or -1 and 1. Normalization is often used to handle data with
different units and scales. Common normalization techniques include min-max
normalization, z-score normalization, and decimal scaling.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It
involves handling of missing data, noisy data etc.

● (a). Missing Data:


This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
Removing Records with Missing Values (Listwise Deletion):

● Description: This technique involves deleting rows (or records) that contain missing values. This method is
simple but can lead to a significant loss of data, especially if many records have missing values.

Imputation with Mean/Median/Mode:

● Description: Missing values can be replaced with the mean (for numerical data), median (for skewed numerical
data), or mode (for categorical data) of the available data. This method is widely used because it is simple and
effective for small amounts of missing data.

Interpolation:

● Description: Interpolation is often used in time-series data, where missing values are estimated using
surrounding data points. Linear interpolation is the simplest method, but other methods like spline or polynomial
interpolation can also be used.
● Example: In a time-series dataset of daily temperatures, if data for one day is missing, you could estimate it by
taking the average of the temperatures from the day before and the day after.
2. Data Transformation:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help
the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from lower level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.
3. Data Reduction:
Data reduction is a crucial step in the data mining process that involves reducing the
size of the dataset while preserving the important information. This is done to
improve the efficiency of data analysis and to avoid overfitting of the model. Some
common steps involved in data reduction are:

Feature Extraction: This involves transforming the data into a lower-dimensional


space while preserving the important information. Feature extraction is often used
when the original features are high-dimensional and complex.

Feature Selection: This involves selecting a subset of relevant features from the
dataset. Feature selection is often performed to remove irrelevant or redundant
features from the dataset.
Sampling: This involves selecting a subset of data points from the dataset. Sampling
is often used to reduce the size of the dataset while preserving the important
information.
Example: If you have a dataset of 1 million customer transactions, you might take a random sample of 10,000
transactions to build a predictive model, which can significantly speed up the process.

Clustering: This involves grouping similar data points together into clusters.
Clustering is often used to reduce the size of the dataset by replacing similar data
points with a representative centroid. It can be done using techniques such as
k-means, hierarchical clustering, and density-based clustering.
Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage
and transmission purposes.

Data Aggregation

Description: Combines multiple data records into a summary form. This technique is useful when dealing with
large volumes of data.

Example: Instead of storing individual sales transactions, you might aggregate sales data by month and region.
This way, you have monthly sales summaries for each region, which are easier to analyze compared to individual
transaction records.
What is Data Visualization?
Data visualization translates complex data sets into visual formats that are easier for the
human brain to comprehend. This can include a variety of visual tools such as:
● Charts: Bar charts, line charts, pie charts, etc.
● Graphs: Scatter plots, histograms, etc.
● Maps: Geographic maps, heat maps, etc.
● Dashboards: Interactive platforms that combine multiple visualizations.

The primary goal of data visualization is to make data more accessible and easier to
interpret, allowing users to identify patterns, trends, and outliers quickly. This is
particularly important in the context of big data, where the sheer volume of information
can be overwhelming without effective visualization techniques.
Types of Data for Visualization

Performing accurate visualization of data is very critical to market research where


both numerical and categorical data can be visualized, which helps increase the
impact of insights and also helps in reducing the risk of analysis paralysis. So, data
visualization is categorized into the following categories:
● Numerical Data
● Categorical Data
1. Numerical Data

Numerical data is quantitative and can be measured and expressed numerically. It can be further divided into two
main types: continuous and discrete data.

a. Continuous Data

- Definition: Continuous data can take any value within a given range. It is often measured rather than counted
and can be divided into smaller increments.

- Examples:

- Height: A person’s height can be 170.5 cm, 170.55 cm, etc.

- Temperature: Measured in degrees Celsius or Fahrenheit, such as 25.6°C.

b. Discrete Data

- Definition: Discrete data consists of distinct, separate values and is often counted rather than measured.
Discrete values cannot be divided into smaller increments.

- Examples:

- Test Scores: Scores can be whole numbers, like 85, 90, or 95.
2. Categorical Data

Categorical data represents characteristics or qualities and can be divided into groups or categories. It can be
further classified into binary, nominal, and ordinal data.

a. Binary Data

- Definition: Binary data has only two categories or levels.

- Examples:

- Gender: Male or Female.

b. Nominal Data

- Definition: Nominal data consists of categories without any intrinsic ordering or ranking.

- Examples:

- Types of Fruits: Apples, oranges, bananas.

- Colors: Red, blue, green.


c. Ordinal Data

- Definition: Ordinal data consists of categories with a meaningful order or ranking but without consistent
intervals between the ranks.

- Examples:

- Education Level: Categories like High School, Bachelor’s Degree, Master’s Degree, which have a clear order.
Types of Data Visualization Techniques
Various types of visualizations cater to diverse data sets and analytical goals.
1. Bar Charts: Ideal for comparing categorical data or displaying frequencies, bar
charts offer a clear visual representation of values.
2. Line Charts: Perfect for illustrating trends over time, line charts connect data
points to reveal patterns and fluctuations.
3. Pie Charts: Efficient for displaying parts of a whole, pie charts offer a simple way
to understand proportions and percentages.
4. Scatter Plots: Showcase relationships between two variables, identifying patterns
and outliers through scattered data points.
1. Histograms: Depict the distribution of a continuous variable, providing insights into the
underlying data patterns.
2. Heatmaps: Visualize complex data sets through color-coding, emphasizing variations
and correlations in a matrix.
3. Box Plots: Unveil statistical summaries such as median, quartiles, and outliers, aiding
in data distribution analysis.
4. Area Charts: Similar to line charts but with the area under the line filled, these charts
accentuate cumulative data patterns.
5. Bubble Charts: Enhance scatter plots by introducing a third dimension through varying
bubble sizes, revealing additional insights.
Tools for Visualization of Data
The following are the 10 best Data Visualization Tools
1. Tableau
2. Microsoft Power BI
3. Zoho Analytics
4. Sisense
5. IBM Cognos Analytics
6. Qlik Sense
7. Domo
8. Looker
9. Klipfolio
10. Microsoft excel

You might also like