Data Mining UNIT-I
Data Mining UNIT-I
UNIT - I Data Mining: Data–Types of Data–, Data Mining Functionalities– Interestingness Patterns
Classification of Data Mining systems– Data mining Task primitives –Integration of Data mining
system with a Data warehouse–Major issues in Data Mining–Data Preprocessing.
UNIT - I
Data Mining:
• Data mining is defined as procedure of extracting info from huge sets of data.
• Also defined as mining knowledge from data.
(OR)
• Data Mining involves extracting useful patterns from large datasets. It combines statistics, machine learning,
and database technology to find patterns and relationships.
Example:
Market Basket Analysis: Retailers use data mining to analyze sales data and find that customers who buy
bread often also buy butter. This information helps in cross-selling products by placing them close to each
other.
Structured data is highly organized and formatted, usually stored in databases or spreadsheets. It is easy to
analyze using traditional data mining techniques.
Examples
• Sales Data: Data stored in tables with columns like date, product ID, price, quantity sold, etc. For instance, a
retail chain can mine this data to find seasonal sales patterns.
• Customer Data: Information such as name, age, address, purchase history, etc., can be mined to understand
customer preferences and segment markets.
• Unstructured data is messy and hard to understand because it doesn't follow a set pattern. It includes things
like text, pictures, and videos.
1
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
Examples:
• Text Data: Mining reviews, blogs, and social media posts to analyze customer sentiment about a product. For
example, companies analyze tweets to gauge public opinion about a new product launch.
• Images and Videos: Face recognition systems mine image data to identify individuals. For instance, social
media platforms like Facebook use this to tag people in photos automatically.
Semi-structured data is like a mix of organized and messy data. It has some structure, like tags that show what
each part means, but it's not as strict as completely organized data.
Examples:
2
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
XML/JSON Files: These files have tags that tell you what each part is, like a letter with parts for the sender,
receiver, and message.
Log Files: These files keep track of what's happening on a computer or program. They can be used to figure
out how people use a website or what problems are happening.
Time-series data is a sequence of data points collected at consistent time intervals. It is used to analyze
trends and patterns over time.
Examples:
Stock Prices: Daily stock prices can be mined to predict future price movements using time-series
forecasting models.
Sensor Data: Data from sensors in smart homes or industrial equipment, such as temperature readings or
energy consumption, can be mined for predictive maintenance.
Spatial data relates to geographical or location-based information. It involves mining data that has a spatial
component, such as latitude and longitude.
Examples:
GIS Data: Geographic Information Systems (GIS) data can be mined to find patterns like hotspots of crime in
a city.
Satellite Images: Mining satellite data for environmental monitoring, like detecting deforestation patterns or
urban development.
Multimedia data includes images, audio, and video. Mining such data involves extracting useful patterns or
information.
Examples:
Video Surveillance: Analyzing video feeds to detect unusual activities in security monitoring.
Music Recommendation: Mining audio data to recommend songs based on listening habits, such as
Spotify's recommendation engine.
Web data is mined from web pages, social media, and online transactions. It is often semi-structured and
involves complex data formats.
3
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
Examples:
Web Page Content: Searching and collecting information from websites to create search engine databases
or generate insights about trending topics.
Clickstream Data: Analyzing users' click paths on a website to understand their behavior and optimize user
experience.
Graph data consists of nodes and edges, representing relationships between entities. It is particularly useful
in network analysis.
Examples:
Social Networks: Mining data from platforms like Facebook or LinkedIn to find influencer nodes or detect
communities.
Supply Chain Networks: Analyzing the connections between suppliers, manufacturers, and distributors to
optimize logistics.
Sequence data involves ordered data that is mined to find sequential patterns. It is useful for applications like
behavior analysis.
Examples:
Customer Transactions: Analyzing sequences of purchases to understand buying patterns, such as the order
in which products are typically bought.
Biological Sequences: Mining DNA or protein sequences to find motifs that may indicate a genetic trait or
disease.
This type of data contains a large number of features, making traditional analysis challenging. It is often
encountered in areas like genomics or image processing.
Examples:
Image Recognition: High-dimensional pixel data in images is analyzed to recognize objects or faces.
Each data type requires specific preprocessing and analysis techniques, making data mining a diverse and
adaptable field.
A Relational Database Management System (RDBMS) is a widely used system for managing structured data.
It stores data in tables, where each table is a collection of rows and columns. In an RDBMS, data is organized
4
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
based on a schema (a predefined structure), and relationships between different data tables are established
using keys (primary and foreign keys).
Example:
1. Analysing customer data to predict the credit risks of new customers based on previous dataset.
Collection of data integrated from different sources with querying and decision making on data.
In data warehouse, data is stored in multidimensional structure (data cube) where each dimension is each
attribute.
While RDBMS focuses on efficient storage and retrieval of operational data, Data Warehouses are specialized
databases designed for analytical purposes. A data warehouse integrates data from various sources and
stores it in a way optimized for data mining and business intelligence.
Each record is called as transaction (Sales, Flight booking, user clicks on web page). Transaction has
transaction ID, List of other items making transaction at what time transaction starts, transaction end time,
transaction data, transaction Location etc. form transaction Database, we can mine frequent patterns.
Data mining functionalities refer to the different types of patterns or knowledge that can be discovered from
data. These functionalities provide the core objectives of a data mining task and are categorized into
descriptive and predictive functionalities.
5
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
Descriptive data mining is used to find patterns that describe the general properties of the data. It focuses on
summarizing or identifying trends and associations in datasets.
This identifies relationships or associations between items in a dataset. It seeks patterns where
items frequently occur together.
Example:
Market Basket Analysis: In a supermarket, frequent pattern mining can discover that customers who buy
bread often also buy milk. A common rule might be {Bread} → {Milk}, meaning customers who buy bread are
likely to buy milk as well.
Clustering is the process of grouping similar data points into clusters so that data points within a
group are more similar to each other than to those in other groups. It helps discover the inherent structure in
data.
Example:
Customer Segmentation: In marketing, customers can be grouped based on purchasing behavior into
segments like high spenders, frequent buyers, and occasional shoppers. These clusters help in targeted
marketing.
2.1.3. Summarization
Summarization techniques provide a concise representation of the data, often in the form of
summary statistics or visualizations.
Example:
Descriptive Statistics: Calculating the mean, median, standard deviation, or generating summary reports of
sales data to understand overall trends.
Anomaly detection aims to identify rare or unusual data points that deviate significantly from the
general pattern of the dataset.
Example:
Fraud Detection: Detecting credit card transactions that are significantly larger or from unusual locations to
identify potential fraud.
6
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
Predictive data mining involves predicting unknown or future values based on the patterns in the data. It uses
known information to forecast outcomes or classify data.
2.2.1 Classification
Example:
Spam Detection: Emails can be classified as "Spam" or "Not Spam" based on the presence of certain words
or patterns in the email content using classification algorithms like Decision Trees, Naive Bayes, or Support
Vector Machines.
2.2.2 Regression
Regression is used to predict a continuous value (numerical output) based on input data. It
estimates the relationships among variables.
Example:
House Price Prediction: Using regression analysis, one can predict house prices based on features like
square footage, number of rooms, and location.
2.2.3 Prediction
Prediction involves using historical data to make predictions about future outcomes. It can use
classification or regression techniques.
Example:
Sales Forecasting: Predicting future sales based on past sales data, seasonal trends, and market conditions.
Sequential pattern mining discovers regular sequences or patterns over time in datasets.
Example:
Purchase Behavior Analysis: A retail store might find that customers who buy a smartphone often purchase
a phone case and a charger within the next few weeks.
7
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
This functionality focuses on analyzing how data changes over time and identifying
deviations or trends in the evolution of the data.
Example:
Stock Market Analysis: Tracking the rise and fall of stock prices over time to detect trends or significant
deviations from expected performance.
Correlation analysis measures the statistical relationship between two or more variables.
It helps identify dependencies between variables.
Example:
Health Data: Correlation analysis can be used to determine if there's a relationship between exercise
frequency and blood pressure levels.
Text mining involves extracting meaningful information or patterns from unstructured text
data.
Example:
Sentiment Analysis: Mining customer reviews or social media posts to determine whether the sentiment
toward a product or service is positive, negative, or neutral.
3.Interestingness Patterns
In a data mining system, thousands, millions, or even billions of data patterns are generated every day.
However, among all these patterns, how many are truly interesting or useful to the user?
(OR)
The interestingness of patterns refers to the fact that not all patterns generated by a data mining system are
actually useful. Only the patterns that are relevant or beneficial to us should be selected.
Example :
When you visit a supermarket or go shopping in a mall like Inorbit Mall, you're surrounded by a wide variety
of stores and sections—accessories, footwear, gaming, food, western clothes, traditional clothes, and more.
However, you don’t need to visit every store, especially if you're there for something specific, say, for an urgent
need. In such a case, you won't browse through all the shops. You'll go directly to the store that has what you
need, make your purchase, and leave.
8
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
A Pattern is interesting if it is
- In reality it is not possible for a data mining system to generate all interesting patterns.
- If only interesting patterns are generated, it becomes easy and efficient for the user (time is saved).
This classification is based on the type of data the system processes. Different systems are optimized for
different kinds of data.
9
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
Example: Analyzing customer purchases in an e-commerce relational database to discover frequent buying
patterns.
Example: Market basket analysis, which mines the transactions of items bought together in a store to
recommend frequently bought item sets.
These systems handle spatial data like geographic or location-based data. They often
involve complex data structures like polygons, lines, or grids.
Example: Mining satellite images or geographic information system (GIS) data to discover patterns, such as
identifying areas prone to flooding.
These systems mine multimedia data, which includes images, videos, audio, and other
unstructured data types. They require specialized algorithms for image recognition, sound classification, and
more.
Example: Identifying trends in video content consumption on platforms like YouTube or Netflix.
These systems are designed to handle time-series data (data recorded over time) and
sequence data (ordered events), like stock prices or web logs.
Example: Predicting future stock prices based on historical price data, or mining patterns from user activity
logs on websites.
These systems mine unstructured or semi-structured text data and web data. Text mining
systems extract valuable insights from documents, emails, blogs, or social media posts, while web mining
focuses on online content and hyperlinks.
Example: Sentiment analysis of social media posts to gauge public opinion on a product or brand.
10
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
Example: Integrating customer data from a company's relational database with social media data and
analyzing both sources for marketing insights.
This classification is based on the type of patterns or knowledge the system seeks to uncover.
Example: Clustering systems that group similar customer profiles based on purchasing behaviors.
Example: Classification systems that predict whether a customer will respond to a marketing campaign
based on past responses.
Example: Decision tree classifiers or neural networks used to predict whether a loan applicant is likely to
default.
Example: Regression analysis used to predict future sales based on previous sales data.
11
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
Example: Using heat maps to identify correlations between various stock prices.
Example: Fraud detection systems used in banks to identify suspicious transactions based on historical
patterns.
Example: Predictive systems that forecast patient readmissions based on historical medical data.
These systems are optimized for retail data to analyze sales patterns,
customer behavior, and inventory management.
Example: Market basket analysis systems used by retailers to recommend complementary products to
customers.
Example: Systems that analyze phone call records to detect unusual calling patterns that could indicate
fraudulent activity.
12
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
Example: Self-tuning systems that automatically analyze network data to optimize performance without
human intervention.
These systems allow users to guide and control the mining process,
providing domain expertise to refine the results.
Example: A system where analysts can input specific queries or constraints and interactively explore patterns
in financial or medical data.
Example: A rule-based system that uses expert knowledge to discover patterns in a medical dataset for
disease diagnosis.
Example: A clustering algorithm that groups customers into segments based purely on purchase history,
without predefined rules.
Example: Real-time stock trading systems that analyze market trends as new stock prices come in.
13
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
These systems mine static, historical datasets that have already been
collected and stored.
Example: A system that mines a company’s customer database to find patterns in past purchasing behavior
for marketing purposes.
The task primitives in data mining can be categorized into the following aspects:
Structured data stored in tables with rows and columns, often involving
relationships between different tables.
Example: Customer purchase data stored in a relational database.
14
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
Example: Stock prices or weather data measured daily over several years.
Unstructured or semi-structured text data and data from web content (e.g., web
pages, social media posts).
This refers to the specific patterns or knowledge the user aims to extract
from the data. Different tasks can lead to different types of knowledge.
5.2.1. Characterization
5.2.2. Discrimination
5.2.3. Association
5.2.4. Classification
5.2.5. Prediction
15
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
5.2.6. Clustering
Groups objects into clusters based on their similarity, without using predefined categories.
Example: Grouping customers into segments based on purchasing patterns for targeted marketing.
Background knowledge refers to any pre-existing knowledge or constraints that guide the data mining
process. This knowledge helps in making the data mining process more effective and relevant.
Knowledge about the specific domain of the data, such as business rules or scientific facts.
Example: In medical data mining, knowing that certain symptoms often indicate a particular disease can
guide the mining process.
5.3.2. Constraints
Specific rules or constraints applied to the mining process, such as data ranges or filters.
Example: Mining only transactions with a minimum purchase amount above a certain threshold.
Custom preferences provided by the user regarding what patterns are interesting or useful.
Example: Focusing the mining process only on high-value customers in a sales dataset.
These measures help evaluate and rank the discovered patterns to ensure that only useful, relevant, and
novel information is presented. Not all patterns discovered are valuable, so interestingness measures help
filter out unimportant patterns.
16
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
These are based on the statistical properties of the data and patterns, such as frequency,
confidence, and correlation.
Example: In association rule mining, support and confidence values can be used to rank association rules.
Example: A surprising pattern that contradicts a user's expectations might be considered more interesting
than a well-known pattern.
5.4.3. Simplicity
Simpler patterns are often considered more interesting because they are easier to interpret.
Example: A decision tree with fewer levels is generally preferable to a complex tree that’s harder to
understand.
How the results of the data mining process are presented to the user is important for making sense of the
data. This involves deciding the form in which the mined knowledge will be displayed.
Visual representations like bar charts, pie charts, scatter plots, or line graphs help users see
patterns more clearly.
Example: A textual report summarizing the clusters discovered in a dataset and their key characteristics.
The level of user involvement in the data mining process can vary. This primitive
specifies whether the system operates autonomously or interacts with the user for input and feedback.
17
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
The system allows users to interact with the mining process, refine parameters, and explore
different results.
Example: A user can set thresholds for support and confidence when generating association rules.
The system mines the data automatically without requiring user input.
Example: An automated system that detects anomalies in real-time network traffic without user
intervention.
6.1. No Coupling:
"No coupling" means that there is no integration between the data mining system and the database
or data warehouse. The data mining system is not connected to, nor does it use any functionalities of the
database or data warehouse. Without communication with the database, the question arises—where will
the data mining system obtain the data? To mine data, you must first have access to it. In this case, the data
mining system retrieves the data from alternative storage methods, directly communicating with those
sources instead of a traditional database or data warehouse.
In tight coupling, the data mining system is embedded within the data warehouse itself. The data
mining tools use the warehouse’s database as their primary data source, and the mining algorithms are
optimized for the warehouse's architecture.
18
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
Benefits:
Example: A telecom company uses a tightly integrated system to mine call detail records for fraud
detection, leveraging real-time access to the data warehouse.
In loose coupling, the data mining system and the data warehouse are connected but operate
independently. The data mining system pulls data from the data warehouse as needed but stores its results
separately.
Benefits:
• Flexibility in choosing the data mining tools and algorithms, as they are not constrained by the warehouse's
architecture.
• Easier to upgrade or replace the data mining system without affecting the data warehouse.
This approach is a mix of tight and loose coupling. The data mining system interacts with the
data warehouse but uses intermediate data processing tools (e.g., data marts or OLAP servers) to perform
operations.
Benefits:
• Improved performance compared to loose coupling as some intermediate data processing is done within
the warehouse or in data marts.
• Flexibility to perform mining outside the warehouse if needed, but with better integration than loose
coupling.
Example: A retail chain might maintain data marts for each region, where regional managers use semi-tight-
coupled data mining tools to forecast demand and optimize stock levels.
There are so many users in the data mining system each user will have different needs
and in order to satisfy the needs of each and every user data mining system has to be cover a range of all
types of knowledges.
19
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
"Interactive" means, for example, if you want to find the roll numbers of all CSE students
whose names start with "S" and who are female, and you only need up to 10 results, you can refine your
search criteria step-by-step. First, you would filter by department (CSE) and check how many records you
have. Next, you would apply the filter for names starting with "S." If you still have more than 10 results, you
would then apply the final filter for gender (female). Applying filters one by one helps ensure you don’t
overlook records. if you applied all filters at once, you might end up with fewer results than needed.
Therefore, filtering step-by-step can help achieve a specific number of results more effectively.
Background knowledge is essential not only in data mining but in any new project or
subject you undertake.
Example: If you're preparing for a data mining exam, you first need foundational knowledge on the topic.
You might search for PDFs on Google, watch YouTube videos, or gather notes from faculty to build a solid
base. Similarly, in data mining, the domain knowledge relevant to the project must be included and utilized
in the current data mining processes to improve understanding and outcomes.
Presenting and visualizing data mining results is essential to make insights clear and
actionable. By transforming complex patterns and trends into understandable charts, graphs, and
dashboards, data mining outcomes become accessible to a wider audience, enabling effective decision-
making and deeper insights into the data.
Example: Suppose a data mining system analyzes customer purchasing patterns for a retail store. The
results could reveal that young adults are more likely to buy electronics on weekends. To present this finding,
you might use:
• Bar Charts to show the increase in electronics purchases over different days of the week.
• Heat Maps to highlight peak shopping hours and locations within the store.
• These visualizations make it easier for store managers to identify key trends at a glance and adjust marketing
or inventory strategies accordingly.
Noisy or incomplete data can lead to inaccuracies in data mining results. Effective
handling involves techniques to clean and preprocess the data, such as
20
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SREE DATTHA
INSTITUTIONS OF ENGINEERING AND SCIENCE
(Established by Vyjayanthi Educational Society, approved by AICTE and affiliated to JNTUH)
Sheriguda (V), Ibrahimpatnam (M), Ranga Reddy Dist – 501510. www.sreedattha.ac.in/sdgi/
• Data Imputation: Filling in missing values using methods like mean, median, or predictive models.
• Smoothing Techniques: Reducing noise by applying smoothing algorithms, such as moving averages or
binning.
These steps help improve the quality of the data, ensuring that data mining results are more accurate and
reliable.
Efficiency and scalability are critical aspects of data mining algorithms, especially
when handling large datasets. An efficient algorithm processes data within a reasonable timeframe and uses
resources optimally, while a scalable algorithm maintains its performance as the data size grows. Key
techniques for achieving efficiency and scalability include.
• Optimized Data Structures: Using efficient structures like hash tables or indexed databases to speed up data
access.
• Parallel Processing: Distributing tasks across multiple processors to handle large datasets quickly.
• Sampling Methods: Analyzing a representative subset of data to reduce processing time without sacrificing
accuracy.
By focusing on these aspects, data mining algorithms can effectively handle increasing data volumes while
delivering timely insights.
Example: consider a data mining algorithm used by a social media platform to detect trending topics among
millions of daily posts. To handle this efficiently, the algorithm could:
• Use optimized data structures to index posts by keywords, hashtags, and timestamps, allowing for faster
search and retrieval.
• Implement parallel processing by splitting the data across multiple servers that analyze posts concurrently,
reducing the time required to process massive volumes.
• Apply sampling methods by analyzing a subset of posts from highly active users to quickly detect trends,
while still providing an accurate view of what’s popular.
• These techniques enable the algorithm to scale effectively as the number of posts grows, delivering timely
insights on trending topics across the platform.
21
Dr. J. PAVNU SAI ASSOCIATE PROFESSOR DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING