0% found this document useful (0 votes)
30 views

Data Mining Unit-V

Uploaded by

revanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Data Mining Unit-V

Uploaded by

revanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

ADVANCED CONCEPTS

UNIT-V

Basic concepts in Mining data streams:


Mining data streams refers to the process of analyzing data that arrives continuously and rapidly,
typically in high volume and at a fast pace. This is common in various real-time applications such
as sensor networks, financial transactions, social media feeds, and web clickstreams. Basic
concepts in mining data streams include:

Data Stream: A continuous and potentially infinite sequence of data instances that are generated
and arrive over time. Examples include sensor readings, tweets, stock market ticks, etc.

Stream Window: A fixed-size or sliding window used to capture a subset of the data stream for
analysis. This allows focusing on a specific portion of the data stream to perform computations and
extract patterns.

Concept Drift: The phenomenon where the statistical properties of the data stream change over
time. This could be due to various factors such as changes in user behavior, environment, or system
dynamics.

Feature Extraction: The process of extracting relevant features or attributes from the data stream
that are suitable for analysis. This is crucial for reducing the dimensionality of the data and
capturing essential information.

Online Learning: Techniques for updating models or making predictions incrementally as new
data arrives. Online learning algorithms adapt to changes in the data stream and can handle large
volumes of data efficiently.

Clustering: Grouping similar data instances together in the data stream. Clustering algorithms can
identify patterns or anomalies in the stream and are useful for exploratory analysis.
Classification: Assigning data instances to predefined categories or classes based on their features.
Classification models can be trained incrementally on streaming data and used for tasks such as
spam detection, sentiment analysis, etc.

Anomaly Detection: Identifying unusual or unexpected patterns in the data stream that deviate
from normal behavior. Anomaly detection algorithms help detect potential problems or fraud in
real-time applications.

Summarization: Generating concise representations of the data stream to provide insights or


monitor key metrics. Summarization techniques include sketching, sampling, and approximate
counting.

The ability of mining algorithms to handle large volumes of data efficiently in real-time. Scalable
algorithms are essential for processing data streams with high velocity and volume without
overwhelming computational resources.

fig: workflow of data stream mining

These concepts form the foundation for analyzing and extracting useful information from data
streams in various applications. Advanced techniques and algorithms are continuously developed to
address the challenges posed by streaming data and enable real-time analytics and decision-
making.
Mining Time–series data:
Mining time-series data involves analyzing data points collected at regular intervals over time to
identify patterns, trends, and relationships. Time-series data analysis is crucial in various domains,
including finance, weather forecasting, signal processing, and industrial process control. Here are
some basic concepts in mining time-series data:

Time Series: A sequence of data points indexed in chronological order. Each data point
corresponds to a specific time instance, making it suitable for analyzing temporal patterns.

Temporal Patterns: Patterns that occur over time, such as trends, seasonality, cycles, and irregular
fluctuations. Identifying and understanding these patterns are essential for making predictions and
decisions based on time-series data.

Trend Analysis: Determining the overall direction or tendency of a time series over a long period.
Trends can be linear, exponential, or polynomial, and trend analysis helps in understanding the
underlying growth or decline patterns.

Seasonality: Regular and predictable patterns that repeat over fixed intervals, such as daily,
weekly, or yearly cycles. Seasonality analysis helps in understanding periodic fluctuations and
adjusting forecasts accordingly.

Cycle Detection: Identifying repeating patterns or cycles in the data that are not necessarily of
fixed duration. Cycles can represent economic cycles, business cycles, or other recurring
phenomena.

Smoothing Techniques: Methods for reducing noise and highlighting underlying patterns in time-
series data. Smoothing techniques include moving averages, exponential smoothing, and Savitzky–
Golay filtering.

Time-series Decomposition: Breaking down a time series into its constituent components, such as
trend, seasonality, and residual (random fluctuations). Decomposition helps in understanding the
individual contributions of different components to the overall behavior of the time series.
Forecasting: Predicting future values of a time series based on historical data and identified
patterns. Forecasting techniques include autoregressive models (AR), moving average models
(MA), autoregressive integrated moving average models (ARIMA), and machine learning
algorithms such as neural networks.

Anomaly Detection: Identifying unusual or unexpected patterns in time-series data that deviate
from normal behavior. Anomalies could indicate system failures, fraud, or other significant events
requiring attention.

Feature Engineering: Selecting or creating informative features from time-series data to improve
predictive models. Features could include lagged values, moving averages, Fourier transforms, or
other transformations that capture relevant information.

Evaluation Metrics: Metrics used to assess the performance of time-series forecasting models,
such as mean absolute error (MAE), root mean squared error (RMSE), and mean absolute
percentage error (MAPE).

Mining time-series data requires specialized techniques and algorithms tailored to handle the
unique characteristics of temporal data. Advanced methods continue to evolve to address
challenges such as nonlinearity, seasonality, and irregularities in real-world time-series datasets.

Mining sequence patterns in Transactional databases:


Mining sequence patterns in transactional databases involves identifying frequent sequences of
events or itemsets in a sequence of transactions. This type of analysis is common in market basket
analysis, where sequences of items purchased by customers are analyzed to discover patterns and
relationships. Here are the basic concepts involved in mining sequence patterns in transactional
databases:

Transactional Database: A database containing records of transactions, where each transaction


consists of a set of items purchased or events that occurred together. Examples include retail sales
databases, web clickstream logs, and user activity logs.
Sequence Database: A representation of transactional data where transactions are ordered
sequentially based on their occurrence time. Each transaction is treated as a sequence of events or
itemsets.

Sequential Pattern Mining: The process of discovering frequent sequences of events or itemsets
in transactional databases. Frequent sequences are those that occur with a frequency greater than or
equal to a predefined minimum support threshold.

Itemset: A collection of items that appear together in a transaction. In the context of sequence
mining, an itemset can represent a single event or a set of events occurring together within a
transaction.

Sequential Pattern: A sequence of itemsets that occurs frequently in the transactional database.
Sequential patterns represent the temporal order of events or itemsets within transactions.

Support: A measure of the frequency of occurrence of a sequence pattern in the transactional


database. Support indicates how often a particular sequence occurs relative to the total number of
transactions.

Sequence Length: The number of itemsets or events in a sequential pattern. Sequence length can
vary depending on the complexity of the patterns being mined.

Temporal Constraints: Constraints imposed on sequence patterns to capture temporal


relationships between events. Examples include requiring events to occur within a certain time
window or enforcing a strict ordering of events.

Apriori-Based Algorithms: Algorithms based on the Apriori principle, which exploit the
downward closure property of frequent itemsets to efficiently mine sequential patterns. Apriori-
based algorithms include GSP (Generalized Sequential Pattern) and SPADE (Sequential PAttern
Discovery using Equivalence classes).
PrefixSpan Algorithm: An algorithm specifically designed for mining sequential patterns in
transactional databases. PrefixSpan uses a recursive divide-and-conquer approach to discover
frequent sequential patterns efficiently.

Pattern Evaluation Metrics: Metrics used to evaluate the interestingness of discovered sequence
patterns, such as confidence, lift, and sequence length. These metrics help identify meaningful
patterns that can be used for decision-making or business insights.

Mining sequence patterns in transactional databases is a fundamental task in data mining and
provides valuable insights into customer behavior, process optimization, and recommendation
systems. Advanced techniques continue to be developed to handle large-scale transactional datasets
and discover complex patterns efficiently.

Spatial Data Mining:


Spatial data mining is a specialized subfield of data mining that deals with extracting knowledge
from spatial data. Spatial data refers to data that is associated with a particular location or
geography.

Examples of spatial data include maps, satellite images, GPS data, and other geospatial
information.

Spatial data mining involves analyzing and discovering patterns, relationships, and trends in this
data to gain insights and make informed decisions. The use of spatial data mining has become
increasingly important in various fields, such as logistics, environmental science, urban planning,
transportation, and public health.

For instance, a transportation company can optimize its delivery routes for faster and more
efficient deliveries using spatial data mining techniques.

They can analyze their delivery data along with other spatial data, such as traffic flow, road
network, and weather patterns, to identify the most efficient routes for each delivery.
Types of Spatial Data
Different types of spatial data are used in spatial data mining. These include point data, line data,
and polygon data.

Point Data
• Point data represents a single location or a set of locations on a map. Each point is defined
by its x and y coordinates, representing its position in the geographic space.

• Point data is commonly used to represent geographic features such as cities, landmarks, or
specific locations of interest. Examples of point data in transportation include delivery
locations, bus stops, or railway stations.
Line Data
• Line data represents a linear feature, such as a road, a river, or a pipeline, on a map. Each
line is defined by a set of vertices, which represent the start and end points of the line.
• Line data is commonly used to represent `transportation networks, such as roads,
highways, or railways.
Polygon Data
• Polygon data represents a closed shape or an area on a map. Each polygon is defined by a
set of vertices that connect to form a closed boundary.
• Polygon data is commonly used to represent administrative boundaries, land use, or
demographic data.
• In transportation, polygon data can be used to represent areas of interest, such as delivery
zones or traffic zones.

Applications of Spatial Data Mining


The following are some of the applications of spatial data mining:
Urban Planning
• Spatial Data Mining is used by urban planners to analyze and improve urban dynamics. It
can be used to enhance urban growth, improve transportation systems, and refine decisions
about land.
Public Health
• Spatial Data Mining plays an important role in public health research. It is used to develop
strategies to identify diseases, track the spread of infections, and optimize healthcare
resources.
Transportation
• Spatial Data Mining can be used to identify traffic patterns, prevent congestion, manage the
transportation network, and optimize transportation routes.
Environmental Management
• Spatial Data Mining also contributes to environmental management by detecting changes in
the environment, identifying the land at risk, conserving water and biodiversity, and
monitoring natural resources.
Crime Analysis
• Spatial Data Mining can be used to identify crime hotspots, understand crime patterns and
develop proper strategies to prevent crimes and hence improve public safety.

Multimedia Data Mining:


• Multimedia data mining discovers interesting patterns from multimedia databases that store
and manage large collections of multimedia objects.
• The Multimedia data includes the following:
– image data,
– video data,
– audio data,
– sequence data,
– hypertext data containing text.
• Multimedia data mining has a number of uses in today’s society. An example of this would
be the use of traffic camera footage to analyze traffic flow.
• Multimedia data mining can be defined as a process that finds patterns in various types of
data, including images, audio, video, and animation.
• Multimedia data mining is classified into two broad categories: static and dynamic media.
Text mining
• Text Mining also referred to as text data mining and it is used to find meaningful
information from the unstructured texts that are from various sources.
Image mining
• Image mining systems can discover meaningful information or image patterns from a huge
collection of images.
Video mining
• Video mining has the objective of describing interesting patterns form large amount of
video data.
• Video has several type of multimedia data such as image, text, audio, visual etc.
• It is widely used in application such as entertainment, medicine, education, sports etc.
Audio mining
• Audio mining is the technique in which audio signals are automatically analyzed and
searched. This technique is generally implemented in automatic speech recognition.
Applications of Multimedia Mining:
• Digital Library
• Traffic Video Sequences
• Medical Analysis
• Media Making and Broadcasting
• Surveillance system
Process of Multimedia Data Mining:

Architecture for Multimedia Data Mining:


We considered two main families of multimedia retrieval systems, i.e. similarity search in
multimedia data.
• Description-based retrieval system creates indices and object retrieval based on image
descriptions, such as keywords, captions, size, and creation time.
• Content-based retrieval system supports image content retrieval, for example, color
histogram, texture, shape, objects, and wavelet transform.
Models for Multimedia Mining
The data mining models / techniques that are applied to multimedia data are
• classification,
• clustering,
• association rule mining

Text Mining:
• Text data mining can be described as the process of extracting essential data from standard
language text.
• All the data that we generate via :
o Text messages
o Text documents
o Emails
o Files

• The primary source of data is e-commerce websites, social media platforms, published
articles, survey, and many more.
• The larger part of the generated data is unstructured, which makes it challenging and
expensive for the organizations to analyze with the help of the people.
• Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the
process of deriving high-quality information from text.
• Text mining uses natural language processing (NLP), allowing machines to understand the
human language and process it automatically.
• Natural language processing (NLP) is the ability of a computer program to understand
human language as it is spoken and written. It is a component of artificial intelligence (AI).
Why is Text Mining Important?
• Individuals and organizations generate tons of data every day. Stats claim that almost 80%
of the existing text data is unstructured, meaning it’s not organized in a predefined way,
it’s not searchable, and it’s almost impossible to manage. In other words, it’s just not useful.

How Does Text Mining Work?


• Text mining helps to analyze large amounts of raw data and find relevant insights.
Combined with machine learning, it can create text analysis models that learn to classify or
extract specific information based on previous training.
• The first step in text mining is collecting or gathering the data.
• Data can be internal (interactions through chats, emails, surveys, spreadsheets, databases,
etc) or external (information from social media, review sites, news outlets, and any other
websites).
• The second step is preparing (preprocessing) your data. Text mining systems use
several NLP techniques ― like Segmentation, tokenization, lemmatization, stemming and
stop removal ― to build the inputs of your machine learning model.

The steps to perform preprocessing of data:


• Segmentation:
- Break the entire document/article into its component sentences by its punctuations like full
stops and commas.

• Tokenizing:
- Tokenization is a process of splitting a text / sentence into smaller units (words) which are
also called tokens.
• Stemming:
- Stemming is a technique used to extract the base form of the words by removing affixes
from them. For example, the stem of the words eating, eats, eaten is eat.
• Lemmatization:
- Lemmatization considers the context and converts the word to its meaningful base form,
which is called Lemma. For instance, stemming the word 'Caring' would return 'Car'. For
instance, lemmatizing the word 'Caring' would return 'Care'.

• Filtering (or) Removing Stop Words:


- It is a process of removing non-essential words,i.e Words such as was, in, is, and, the, are
called stop words and can be removed.
• The third step is Feature Extraction form pre processed data.
• The mapping from textual data to real-valued vectors is called feature extraction.
• One of the commonly used technique to extract the features from textual data is calculating
the frequency of words/tokens in the document/corpus.

Bag of Words (BOW):


• One of the simplest technique is Bag of Words (BOW) to represent the text in numerical
format(Vectors).
• In BOW, we make a list of unique words in the text corpus called vocabulary. Then we can
represent each sentence or document as a vector, with each word represented as 1 for
presence and 0 for absence.
• The bag-of-words (BOW) model is a representation that turns arbitrary text into fixed-
length vectors by counting how many times each word appears. This process is often
referred to as vectorization.
• Let’s understand this with an example. Suppose we wanted to vectorize the following:
Document 1 : the cat sat
Document 2: the cat sat in the hat
Document 3: the cat with the hat
Step 1: Determine the Vocabulary
- We first define our vocabulary, which is the set of all unique words found in our document
set.
- The Words are :
the, cat, sat, in, hat, with
Step 2: Count

Step 3: Vector Representation


Now we have length-6 vectors for each document.
• the cat sat: [1, 1, 1, 0, 0, 0]
• the cat sat in the hat: [2, 1, 1, 1, 1, 0]
• the cat with the hat: [2, 1, 0, 0, 1, 1]
- Notice that we lose contextual information, e.g. where in the document the word
appeared, when we use BOW. It’s like a literal bag-of-words: it only tells you what words occur in
the document, not where they occurred.

-Term frequency — Inverse document frequency (TFIDF)


-Word2Vec (W2V)
Web Mining:
• Web mining is a data mining technique to extract knowledge from web data.
• Web data includes :
o web documents
o hyperlinks between documents
o usage logs of web sites

• The WWW is huge, widely distributed, global information service centre and, therefore, It
is a rich source for data mining.

World Wide Web:


 There are about 1.5 billion websites.
 But less than 400 million are active.
 Grows at about 1 million pages a day
 By the time you finish this class, thousands of new sites will spawn.
 Lots of duplication (70-80%)
What is the most visited site in the world?
Go ahead — Google it!
Fun fact:
More than 80% of all Google searches are initiated by the Google staff, in the process of
developing and refining its search algorithms.

Diverse types of data


– Text
– Images
– Audio & Video
• Web mining is the application of data mining techniques to discover useful information
from the World Wide Web.
• It uses automated methods to extract both structured and unstructured data from web
pages, server logs and link structures.

Examples:
• Web search, e.g. Google, Yahoo, Bing, Dogpile ,Duck Duck Go Ecosia ,Gigablast ,…
• Specialized search: e.g. Squool Tube - Search for factual, educational videos., Elephind -
search the world's historical newspaper archives.
• eCommerce : e.g. Amazon,Flipkart,eBay,Fiverr,Upwork.
• Advertising, e.g. Google Adsense
• Improving Web site design and performance
Web Content Mining:
• Web content mining can be used to extract useful data, information, knowledge from the
web page content.
• In web content mining, each web page is considered as an individual document.
• The primary task of content mining is data extraction, where structured data is extracted
from unstructured websites.
• Web content mining can be utilized to distinguish topics on the web.
For Example, if any user searches for a specific book on the search engine, then the user will get a
list of suggestions.
• The technologies that are normally used in web content mining are NLP (Natural language
processing) and IR (Information Retrieval).
• Techniques in Web Content Mining :
– Classification
– Clustering

Web Usage Mining:


• Web usage mining is the application of identifying or discovering interesting usage
patterns from large data sets. And these patterns enable you to understand the user
behaviors or something like that.
• Web usage mining is used for mining the web log records (access information of web
pages) and helps to discover the user access patterns of web pages.
• Web server registers a web log entry for every web page.
• The main source of data here is Web Server and Application Server.
• Log files are created when a user/customer interacts with a web page.
• Techniques in Web Usage Mining :
• Association Rules
• Classification
• Clustering
Advantage:
• This technology has enabled e-commerce to do personalized marketing, which eventually
results in higher trade volumes.
Disadvantage:
• This technology when used on data of personal nature might cause concerns. The most
criticized ethical issue involving web usage mining is the invasion of privacy.
Web Structure Mining:
• Web structure mining is the application of discovering structure information from the
web. The structure of the web graph consists of web pages as nodes, and hyperlinks as
edges connecting related pages.
• Structure mining basically shows the structured summary of a particular website. It
identifies relationship between web pages linked by information or direct link connection.
• Techniques in Web Structure Mining :
– Association Rules
– Classification
Example:
• The most important application in this regard is the Google search engine, which estimates
the ranking of its outcomes primarily with the PageRank algorithm.
• The rank of a page is decided by the number and quality of links pointing to the target node.

You might also like