Data Mining Unit-V
Data Mining Unit-V
UNIT-V
Data Stream: A continuous and potentially infinite sequence of data instances that are generated
and arrive over time. Examples include sensor readings, tweets, stock market ticks, etc.
Stream Window: A fixed-size or sliding window used to capture a subset of the data stream for
analysis. This allows focusing on a specific portion of the data stream to perform computations and
extract patterns.
Concept Drift: The phenomenon where the statistical properties of the data stream change over
time. This could be due to various factors such as changes in user behavior, environment, or system
dynamics.
Feature Extraction: The process of extracting relevant features or attributes from the data stream
that are suitable for analysis. This is crucial for reducing the dimensionality of the data and
capturing essential information.
Online Learning: Techniques for updating models or making predictions incrementally as new
data arrives. Online learning algorithms adapt to changes in the data stream and can handle large
volumes of data efficiently.
Clustering: Grouping similar data instances together in the data stream. Clustering algorithms can
identify patterns or anomalies in the stream and are useful for exploratory analysis.
Classification: Assigning data instances to predefined categories or classes based on their features.
Classification models can be trained incrementally on streaming data and used for tasks such as
spam detection, sentiment analysis, etc.
Anomaly Detection: Identifying unusual or unexpected patterns in the data stream that deviate
from normal behavior. Anomaly detection algorithms help detect potential problems or fraud in
real-time applications.
The ability of mining algorithms to handle large volumes of data efficiently in real-time. Scalable
algorithms are essential for processing data streams with high velocity and volume without
overwhelming computational resources.
These concepts form the foundation for analyzing and extracting useful information from data
streams in various applications. Advanced techniques and algorithms are continuously developed to
address the challenges posed by streaming data and enable real-time analytics and decision-
making.
Mining Time–series data:
Mining time-series data involves analyzing data points collected at regular intervals over time to
identify patterns, trends, and relationships. Time-series data analysis is crucial in various domains,
including finance, weather forecasting, signal processing, and industrial process control. Here are
some basic concepts in mining time-series data:
Time Series: A sequence of data points indexed in chronological order. Each data point
corresponds to a specific time instance, making it suitable for analyzing temporal patterns.
Temporal Patterns: Patterns that occur over time, such as trends, seasonality, cycles, and irregular
fluctuations. Identifying and understanding these patterns are essential for making predictions and
decisions based on time-series data.
Trend Analysis: Determining the overall direction or tendency of a time series over a long period.
Trends can be linear, exponential, or polynomial, and trend analysis helps in understanding the
underlying growth or decline patterns.
Seasonality: Regular and predictable patterns that repeat over fixed intervals, such as daily,
weekly, or yearly cycles. Seasonality analysis helps in understanding periodic fluctuations and
adjusting forecasts accordingly.
Cycle Detection: Identifying repeating patterns or cycles in the data that are not necessarily of
fixed duration. Cycles can represent economic cycles, business cycles, or other recurring
phenomena.
Smoothing Techniques: Methods for reducing noise and highlighting underlying patterns in time-
series data. Smoothing techniques include moving averages, exponential smoothing, and Savitzky–
Golay filtering.
Time-series Decomposition: Breaking down a time series into its constituent components, such as
trend, seasonality, and residual (random fluctuations). Decomposition helps in understanding the
individual contributions of different components to the overall behavior of the time series.
Forecasting: Predicting future values of a time series based on historical data and identified
patterns. Forecasting techniques include autoregressive models (AR), moving average models
(MA), autoregressive integrated moving average models (ARIMA), and machine learning
algorithms such as neural networks.
Anomaly Detection: Identifying unusual or unexpected patterns in time-series data that deviate
from normal behavior. Anomalies could indicate system failures, fraud, or other significant events
requiring attention.
Feature Engineering: Selecting or creating informative features from time-series data to improve
predictive models. Features could include lagged values, moving averages, Fourier transforms, or
other transformations that capture relevant information.
Evaluation Metrics: Metrics used to assess the performance of time-series forecasting models,
such as mean absolute error (MAE), root mean squared error (RMSE), and mean absolute
percentage error (MAPE).
Mining time-series data requires specialized techniques and algorithms tailored to handle the
unique characteristics of temporal data. Advanced methods continue to evolve to address
challenges such as nonlinearity, seasonality, and irregularities in real-world time-series datasets.
Sequential Pattern Mining: The process of discovering frequent sequences of events or itemsets
in transactional databases. Frequent sequences are those that occur with a frequency greater than or
equal to a predefined minimum support threshold.
Itemset: A collection of items that appear together in a transaction. In the context of sequence
mining, an itemset can represent a single event or a set of events occurring together within a
transaction.
Sequential Pattern: A sequence of itemsets that occurs frequently in the transactional database.
Sequential patterns represent the temporal order of events or itemsets within transactions.
Sequence Length: The number of itemsets or events in a sequential pattern. Sequence length can
vary depending on the complexity of the patterns being mined.
Apriori-Based Algorithms: Algorithms based on the Apriori principle, which exploit the
downward closure property of frequent itemsets to efficiently mine sequential patterns. Apriori-
based algorithms include GSP (Generalized Sequential Pattern) and SPADE (Sequential PAttern
Discovery using Equivalence classes).
PrefixSpan Algorithm: An algorithm specifically designed for mining sequential patterns in
transactional databases. PrefixSpan uses a recursive divide-and-conquer approach to discover
frequent sequential patterns efficiently.
Pattern Evaluation Metrics: Metrics used to evaluate the interestingness of discovered sequence
patterns, such as confidence, lift, and sequence length. These metrics help identify meaningful
patterns that can be used for decision-making or business insights.
Mining sequence patterns in transactional databases is a fundamental task in data mining and
provides valuable insights into customer behavior, process optimization, and recommendation
systems. Advanced techniques continue to be developed to handle large-scale transactional datasets
and discover complex patterns efficiently.
Examples of spatial data include maps, satellite images, GPS data, and other geospatial
information.
Spatial data mining involves analyzing and discovering patterns, relationships, and trends in this
data to gain insights and make informed decisions. The use of spatial data mining has become
increasingly important in various fields, such as logistics, environmental science, urban planning,
transportation, and public health.
For instance, a transportation company can optimize its delivery routes for faster and more
efficient deliveries using spatial data mining techniques.
They can analyze their delivery data along with other spatial data, such as traffic flow, road
network, and weather patterns, to identify the most efficient routes for each delivery.
Types of Spatial Data
Different types of spatial data are used in spatial data mining. These include point data, line data,
and polygon data.
Point Data
• Point data represents a single location or a set of locations on a map. Each point is defined
by its x and y coordinates, representing its position in the geographic space.
• Point data is commonly used to represent geographic features such as cities, landmarks, or
specific locations of interest. Examples of point data in transportation include delivery
locations, bus stops, or railway stations.
Line Data
• Line data represents a linear feature, such as a road, a river, or a pipeline, on a map. Each
line is defined by a set of vertices, which represent the start and end points of the line.
• Line data is commonly used to represent `transportation networks, such as roads,
highways, or railways.
Polygon Data
• Polygon data represents a closed shape or an area on a map. Each polygon is defined by a
set of vertices that connect to form a closed boundary.
• Polygon data is commonly used to represent administrative boundaries, land use, or
demographic data.
• In transportation, polygon data can be used to represent areas of interest, such as delivery
zones or traffic zones.
Text Mining:
• Text data mining can be described as the process of extracting essential data from standard
language text.
• All the data that we generate via :
o Text messages
o Text documents
o Emails
o Files
• The primary source of data is e-commerce websites, social media platforms, published
articles, survey, and many more.
• The larger part of the generated data is unstructured, which makes it challenging and
expensive for the organizations to analyze with the help of the people.
• Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the
process of deriving high-quality information from text.
• Text mining uses natural language processing (NLP), allowing machines to understand the
human language and process it automatically.
• Natural language processing (NLP) is the ability of a computer program to understand
human language as it is spoken and written. It is a component of artificial intelligence (AI).
Why is Text Mining Important?
• Individuals and organizations generate tons of data every day. Stats claim that almost 80%
of the existing text data is unstructured, meaning it’s not organized in a predefined way,
it’s not searchable, and it’s almost impossible to manage. In other words, it’s just not useful.
• Tokenizing:
- Tokenization is a process of splitting a text / sentence into smaller units (words) which are
also called tokens.
• Stemming:
- Stemming is a technique used to extract the base form of the words by removing affixes
from them. For example, the stem of the words eating, eats, eaten is eat.
• Lemmatization:
- Lemmatization considers the context and converts the word to its meaningful base form,
which is called Lemma. For instance, stemming the word 'Caring' would return 'Car'. For
instance, lemmatizing the word 'Caring' would return 'Care'.
• The WWW is huge, widely distributed, global information service centre and, therefore, It
is a rich source for data mining.
Examples:
• Web search, e.g. Google, Yahoo, Bing, Dogpile ,Duck Duck Go Ecosia ,Gigablast ,…
• Specialized search: e.g. Squool Tube - Search for factual, educational videos., Elephind -
search the world's historical newspaper archives.
• eCommerce : e.g. Amazon,Flipkart,eBay,Fiverr,Upwork.
• Advertising, e.g. Google Adsense
• Improving Web site design and performance
Web Content Mining:
• Web content mining can be used to extract useful data, information, knowledge from the
web page content.
• In web content mining, each web page is considered as an individual document.
• The primary task of content mining is data extraction, where structured data is extracted
from unstructured websites.
• Web content mining can be utilized to distinguish topics on the web.
For Example, if any user searches for a specific book on the search engine, then the user will get a
list of suggestions.
• The technologies that are normally used in web content mining are NLP (Natural language
processing) and IR (Information Retrieval).
• Techniques in Web Content Mining :
– Classification
– Clustering