Data Mining Notes UNIT V
Data Mining Notes UNIT V
Data mining has made broad and significant progress since its early beginnings in
the 1980s.
Today, data mining is used in a vast array of areas, and numerous commercial data
mining systems are available.
Most banks and financial institutions offer a wide variety of banking services (such
as checking and savings accounts for business or individual customers), credit (such
as business, mortgage, and automobile loans), and investment services (such as
mutual funds).
Some also offer insurance services and stock investment services.
Multimedia mining is a subfield of data mining that is used to find interesting information
of implicit knowledge from multimedia databases. Mining in multimedia is referred to as
automatic annotation or annotation mining. Mining multimedia data requires two or more
data types, such as text and video or text video and audio.
Multimedia data mining is an interdisciplinary field that integrates image processing and
understanding, computer vision, data mining, and pattern recognition. Multimedia data
mining discovers interesting patterns from multimedia databases that store and manage
large collections of multimedia objects, including image data, video data, audio data,
sequence data and hypertext data containing text, text markups, and linkages. Issues in
multimedia data mining include content-based retrieval and similarity search, generalization
and multidimensional analysis. Multimedia data cubes contain additional dimensions and
measures for multimedia information.
The framework that manages different types of multimedia data stored, delivered, and
utilized in different ways is known as a multimedia database management system. There
are three classes of multimedia databases: static, dynamic, and dimensional media. The
content of the Multimedia Database management system is as follows:
o Media data:The actual data representing an object.
o Media format data: Information such as sampling rate, resolution, encoding
scheme etc., about the format of the media data after it goes through the acquisition,
processing and encoding phase.
o Media keyword data:Keywords description relating to the generation of data. It is
also known as content descriptive data. Example: date, time and place of recording.
o Media feature data: Content dependent data such as the distribution of colours,
kinds of texture and different shapes present in data.
Below are the following areas where a multimedia database is applied, such as:
1. Text Mining
Text is the foremost general medium for the proper exchange of information. Text Mining
evaluates a huge amount of usual language text and detects exact patterns to find useful
information. Text Mining also referred to as text data mining, is used to find meaningful
information from unstructured texts from various sources.
2. Image Mining
Image mining systems can discover meaningful information or image patterns from a huge
collection of images. Image mining determines how low-level pixel representation consists
of a raw image or image sequence that can be handled to recognize high-level spatial
objects and relationships. It includes digital image processing, image understanding,
database, AI, etc.
3. Video Mining
Video mining is unsubstantiated to find interesting patterns from many video data;
multimedia data is video data such as text, image, metadata, visuals and audio. It is
commonly used in security and surveillance, entertainment, medicine, sports and
education programs. The processing is indexing, automatic segmentation, content-based
retrieval, classification and detecting triggers.
4. Audio Mining
There are different kinds of applications of multimedia data mining, some of which are as
follows:
o Digital Library: The collection of digital data is stored and maintained in a digital
library, which is essential to convert different digital data formats into text, images,
video, audio, etc.
o Traffic Video Sequences: To determine important but previously unidentified
knowledge from the traffic video sequences, detailed analysis and mining are to be
performed based on vehicle identification, traffic flow, and queue temporal relations
of the vehicle at an intersection. This provides an economic approach for regular
traffic monitoring processes.
o Medical Analysis: Multimedia mining is primarily used in the medical field,
particularly for analyzing medical images. Various data mining techniques are used
for image classification. Examples, Automatic 3D delineation of highly aggressive
brain tumours, Automatic localization and identification of vertebrae in 3D CT scans,
MRI Scans, ECG and X-Ray.
o Customer Perception: It contains details about customers' opinions, products or
services, customers complaints, customers preferences, and the level of customer
satisfaction with products or services, which are collected together. The audio data
serve as topic detection, resource assignment and evaluation of the quality of
services. Many companies have call centres that receive telephone calls from
customers.
o Media Making and Broadcasting: Radio stations and TV channels create
broadcasting companies, and multimedia mining can be applied to monitor their
content to search for more efficient approaches and improve their quality.
o Surveillance system: It consists of collecting, analyzing, summarizing audio, video
or audiovisual information about specific areas like government organizations,
multi-national companies, shopping malls, banks, forests, agricultural areas and,
highways etc. The main use of this technology in the field of security; hence it can be
utilized by military, police and private companies since they provide security
services.
Process of Multimedia Data Mining
The below image shows the present architecture, which includes the types of the
multimedia mining process. Data Collection is the initial stage of the learning system; Pre-
processing is to extract significant features from raw data. It includes data cleaning,
transformation, normalization, feature extraction, etc. Learning can be direct if informative
types can be recognized at preprocessing stage. The complete process depends extremely
on the nature of raw data and the difficulty field. The product of preprocessing is the
training set. A learning model must be selected for the specified training set to learn from it
and make the multimedia model more constant.
Converting Un-structured data to structured data: Data resides in a fixed field within a
record or file is called structured data, and these data are stored in sequential form.
Structured data has been easily entered, stored, queried and analyzed. Unstructured data is
bitstream, for example, pixel representation for an image, audio, video and character
representation for text. These files may have an internal structure, but they are still
considered "unstructured" because their data does not fit neatly in a database. For
example, images and videos of different objects have some similarities - each represents an
interpretation of a building without a clear structure.
Current data mining tools operate on structured data, which resides in a huge volume of
the relational database, while data in multimedia databases are semi-structured or
unstructured. Hence, the semi-structured or unstructured multimedia data is converted
into structured one, and then the current data mining tools are used to extract the
knowledge. The sequence or time element is different between unstructured and
structured data mining. The architecture of converting unstructured data to structured
data and which is used for extracting information from the unstructured database, is
shown in the above image. Then data mining tools are applied to the stored structured
databases.
Multimedia mining architecture is given in the below image. The architecture has several
components. Important components are Input, Multimedia Content, Spatiotemporal
Segmentation, Feature Extraction, Finding similar Patterns, and Evaluation of Results.
1. The input stage comprises a multimedia database used to find the patterns and
perform the data mining.
2. Multimedia Content is the data selection stage that requires the user to select the
databases, subset of fields, or data for data mining.
3. Spatio-temporal segmentation is nothing but moving objects in image sequences
in the videos, and it is useful for object segmentation.
4. Feature extraction is the preprocessing step that involves integrating data from
various sources and making choices regarding characterizing or coding certain data
fields to serve when inputs to the pattern-finding stage. Such representation of
choices is required because certain fields could include data at various levels and
are not considered for finding a similar pattern stage. In MDM, the preprocessing
stage is significant since the unstructured nature of multimedia records.
5. Finding a similar pattern stage is the heart of the whole data mining process. The
hidden patterns and trends in the data are basically uncovered in this stage. Some
approaches to finding similar pattern stages contain association, classification,
clustering, regression, time-series analysis and visualization.
6. Evaluation of Results is a data mining process used to evaluate the results, and this
is important to determine whether the prior stage must be revisited or not. This
stage consists of reporting and using the extracted knowledge to produce new
actions, products, services, or marketing strategies.
Over the last few years, the World Wide Web has become a significant source of
information and simultaneously a popular platform for business. Web mining can define as
the method of utilizing data mining techniques and algorithms to extract useful information
directly from the web, such as Web documents and services, hyperlinks, Web content, and
server logs. The World Wide Web contains a large amount of data that provides a rich
source to data mining. The objective of Web mining is to look for patterns in Web data by
collecting and examining data in order to gain insights.
Web mining can widely be seen as the application of adapted data mining techniques to the
web, whereas data mining is defined as the application of the algorithm to discover
patterns on mostly structured data embedded into a knowledge discovery process. Web
mining has a distinctive property to provide a set of various data types. The web has
multiple aspects that yield different approaches for the mining process, such as web pages
consist of text, web pages are linked via hyperlinks, and user activity can be monitored via
web server logs. These three features lead to the differentiation between the three areas
are web content mining, web structure mining, web usage mining.
Web content mining can be used to extract useful data, information, knowledge from the
web page content. In web content mining, each web page is considered as an individual
document. The individual can take advantage of the semi-structured nature of web pages,
as HTML provides information that concerns not only the layout but also logical structure.
The primary task of content mining is data extraction, where structured data is extracted
from unstructured websites. The objective is to facilitate data aggregation over various
web sites by using the extracted structured data. Web content mining can be utilized to
distinguish topics on the web. For Example, if any user searches for a specific task on the
search engine, then the user will get a list of suggestions.
The web structure mining can be used to find the link structure of hyperlink. It is used to
identify that data either link the web pages or direct link network. In Web Structure
Mining, an individual considers the web as a directed graph, with the web pages being the
vertices that are associated with hyperlinks. The most important application in this regard
is the Google search engine, which estimates the ranking of its outcomes primarily with the
PageRank algorithm. It characterizes a page to be exceptionally relevant when frequently
connected by other highly related pages. Structure and content mining methodologies are
usually combined. For example, web structured mining can be beneficial to organizations
to regulate the network between two commercial sites.
Web usage mining is used to extract useful data, information, knowledge from the weblog
records, and assists in recognizing the user access patterns for web pages. In Mining, the
usage of web resources, the individual is thinking about records of requests of visitors of a
website, that are often collected as web server logs. While the content and structure of the
collection of web pages follow the intentions of the authors of the pages, the individual
requests demonstrate how the consumers see these pages. Web usage mining may disclose
relationships that were not proposed by the creator of the pages.
Some of the methods to identify and analyze the web usage patterns are given below:
I. Session and visitor analysis:
The document is created after this analysis, which contains the details of repeatedly visited
web pages, common entry, and exit.
The web pretends incredible challenges for resources, and knowledge discovery based on
the following observations:
The site pages don't have a unifying structure. They are extremely complicated as
compared to traditional text documents. There are enormous amounts of documents in the
digital library of the web. These libraries are not organized according to a specific order.
The data on the internet is quickly updated. For example, news, climate, shopping, financial
news, sports, and so on.
The client network on the web is quickly expanding. These clients have different interests,
backgrounds, and usage purposes. There are over a hundred million workstations that are
associated with the internet and still increasing tremendously.
o Relevancy of data:
It is considered that a specific person is generally concerned about a small portion of the
web, while the rest of the segment of the web contains the data that is not familiar to the
user and may lead to unwanted results.
The size of the web is tremendous and rapidly increasing. It appears that the web is too
huge for data warehousing and data mining.
Mining the Web's Link Structures to recognize Authoritative Web Pages:
The web comprises of pages as well as hyperlinks indicating from one to another page.
When a creator of a Web page creates a hyperlink showing another Web page, this can be
considered as the creator's authorization of the other page. The unified authorization of a
given page by various creators on the web may indicate the significance of the page and
may naturally prompt the discovery of authoritative web pages. The web linkage data
provide rich data about the relevance, the quality, and structure of the web's content, and
thus is a rich source of web mining.
Web mining has an extensive application because of various uses of the web. The list of
some applications of web mining is given below.
Data Mining is primarily used today by companies with a strong consumer focus —
retail, financial, communication, and marketing organizations, to “drill down” into their
transactional data and determine pricing, customer preferences and product
positioning, impact on sales, customer satisfaction and corporate profits. With data
mining, a retailer can use point-of-sale records of customer purchases to develop
products and promotions to appeal to specific customer segments.
Here is the list of 14 other important areas where data mining is widely used:
Future Healthcare
o Data mining holds great potential to improve health systems. It uses data and
analytics to identify best practices that improve care and reduce costs. Researchers
use data mining approaches like multi-dimensional databases, machine learning,
soft computing, data visualization and statistics. Mining can be used to predict the
volume of patients in every category. Processes are developed that make sure that
the patients receive appropriate care at the right place and at the right time. Data
mining can also help healthcare insurers to detect fraud and abuse.
o Market basket analysis is a modelling technique based upon a theory that if you buy
a certain group of items you are more likely to buy another group of items. This
technique may allow the retailer to understand the purchase behaviour of a buyer.
This information may help the retailer to know the buyer’s needs and change the
store’s layout accordingly. Using differential analysis comparison of results between
different stores, between customers in different demographic groups can be done.
Education
o There is a new emerging field, called Educational Data Mining, concerns with
developing methods that discover knowledge from data originating from
educational Environments. The goals of EDM are identified as predicting students’
future learning behaviour, studying the effects of educational support, and
advancing scientific knowledge about learning. Data mining can be used by an
institution to take accurate decisions and also to predict the results of the student.
With the results the institution can focus on what to teach and how to teach.
Learning pattern of the students can be captured and used to develop techniques to
teach them.
Manufacturing Engineering
o Knowledge is the best asset a manufacturing enterprise would possess. Data mining
tools can be very useful to discover patterns in complex manufacturing process.
Data mining can be used in system-level designing to extract the relationships
between product architecture, product portfolio, and customer needs data. It can
also be used to predict the product development span time, cost, and dependencies
among other tasks.
CRM
Fraud Detection
o Billions of dollars have been lost to the action of frauds. Traditional methods of
fraud detection are time consuming and complex. Data mining aids in providing
meaningful patterns and turning data into information. Any information that is valid
and useful is knowledge. A perfect fraud detection system should protect
information of all the users. A supervised method includes collection of sample
records. These records are classified fraudulent or non-fraudulent. A model is built
using this data and the algorithm is made to identify whether the record is
fraudulent or not.
Intrusion Detection
o Any action that will compromise the integrity and confidentiality of a resource is an
intrusion. The defensive measures to avoid an intrusion includes user
authentication, avoid programming errors, and information protection. Data mining
can help improve intrusion detection by adding a level of focus to anomaly
detection. It helps an analyst to distinguish an activity from common everyday
network activity. Data mining also helps extract data which is more relevant to the
problem.
Lie Detection
o Apprehending a criminal is easy whereas bringing out the truth from him is difficult.
Law enforcement can use mining techniques to investigate crimes, monitor
communication of suspected terrorists. This filed includes text mining also. This
process seeks to find meaningful patterns in data which is usually unstructured text.
The data sample collected from previous investigations are compared and a model
for lie detection is created. With this model processes can be created according to
the necessity.
Customer Segmentation
o Traditional market research may help us to segment customers but data mining
goes in deep and increases market effectiveness. Data mining aids in aligning the
customers into a distinct segment and can tailor the needs according to the
customers. Market is always about retaining the customers. Data mining allows to
find a segment of customers based on vulnerability and the business could offer
them with special offers and enhance satisfaction.
Financial Banking
Corporate Surveillance
Research Analysis
o History shows that we have witnessed revolutionary changes in research. Data
mining is helpful in data cleaning, data pre-processing and integration of databases.
The researchers can find any similar data from the database that might bring any
change in the research. Identification of any co-occurring sequences and the
correlation between any activities can be known. Data visualisation and visual data
mining provide us with a clear view of the data.
Criminal Investigation
Bio Informatics
o Data Mining approaches seem ideally suited for Bioinformatics, since it is data-rich.
Mining biological data helps to extract useful knowledge from massive datasets
gathered in biology, and in other related life sciences areas such as medicine and
neuroscience. Applications of data mining to bioinformatics include gene finding,
protein function inference, disease diagnosis, disease prognosis, disease treatment
optimization, protein and gene interaction network reconstruction, data cleansing,
and protein sub-cellular location prediction.