PDF Advanced Database Management System
PDF Advanced Database Management System
A
SEMESTER - I (CBCS)
ADVANCED
DATABASE TECHNIQUES
SUBJECT CODE : MCA13
© UNIVERSITY OF MUMBAI
Prof. Suhas Pednekar
Vice Chancellor
University of Mumbai, Mumbai.
Prof. Ravindra D. Kulkarni Prof. Prakash Mahanwar
Pro Vice-Chancellor, Director
University of Mumbai. IDOL, University of Mumbai.
Unit I
1. Database Management System 1
2. Contents 17
3. Structured Data Types In Dbms 31
Unit II
4. Dimensional Modelling 49
5. Data Warehouse 62
6. Olap In The Data Warehouse 84
Unit III
7. Data Mining And Preprocessing
Introduction To Data Mining 99
8. Data Mining And Preprocessing
Data Preprocessing 109
9. Data Mining And Preprocessing Data Reduction 120
Unit IV
10. Association Rules 136
Unit V
11. Classification – I 154
12. Classification – II 175
Unit VI
13. Advanced Database Management System 191
14. Web Mining 231
15. Text Mining 249
16. Information Retrieval 266
*****
Syllabus
Unit structure
1.1 Introduction
1.2 Necessity of learning DBMS
1.3 Applications of DBMS
1.4 Types of databases
1.5 Performance Measurement of Databases
1.6 Goals of parallel databases
1.7 Techniques of query Evaluation
1.8 Optimization of Parallel Query
1.9 Goals of Query optimization
1.10 Approaches of Query Optimization
1.11 Virtualization on Multicore Processor
1.12 References
1.13 MOOCS
1.14 Quiz
1.15 Video Links
1.1 INTRODUCTION
1
Real-world entity: It is realistic and uses real-world entities to design
its architecture. Example: A University database where we represent
students as an entity and their roll_number as an attribute.
Relation-based tables: It allows entities and the relations among them
to form tables. We can understand the architecture of it by looking at
the table names.
Isolation of data and application: A database system is entirely
different from its data where database is an active entity and data is
said to be passive.
Less redundancy: It follows the rules of normalization and splits a
relation when any of its attributes is having redundancy in values.
Consistency: It provides a greater consistency as compared to earlier
forms of data storing applications.
Query Language: It is equipped with query language and makes it
more efficient to retrieve and manipulate data which was not possible
in the earlier file-processing system.
1) Centralized Database:
It stores data at a centralized database system and comforts the users to
access the stored data from varied locations through various applications.
An example of a University database can be a Student which carries a
central database of each student in the university.
Example: University_of_Mumbai
Advantages:
o Decreased the risk of data management
o Data consistency
o Enables organizations to establish data standards.
o Less costly
Disadvantages:
o Large size increases the response time for fetching the data
o Complex to update
o Server failure leads to the entire data loss.
2) Distributed Database:
In distributed database data is distributed among different database
systems of an organization and are connected via communication links
helping the end-users to access the data easily. Examples: Oracle, Apache
Cassandra, HBase, Ignite and etc.
3
Homogeneous DDB: Executes on the same operating system using
the same application process carrying same hardware devices.
Heterogeneous DDB: Executes on different operating systems with
different application procedures carrying different hardware devices.
Advantages of Distributed Database:
o Modular development
o Server failure will not affect the entire data set.
3) Relational Database:
It is based on the relational data model that stores data in the form of rows
and columns forming a table. It uses SQL for storing, maintain and
manipulating of the data invented by E.F. Codd invented in the year 1970.
Each table carries a key making the data unique from the
others. Examples: Oracle, Sybase, MySQL and etc.
Properties of Relational Database
Four commonly known properties of a relational model are Atomicity,
Consistency, Durability, Isolation (ACID):
Atomicity: It ensures the data operation will complete either with success
or with failure following the 'all or nothing' strategy. Example: A
transaction will either be committed or rollbacked.
Consistency: Any operation over the data should be consistent in terms of
its value either before or after the operation. Example: Account balance
before and after the transaction should remain conserved.
Isolation: Data remains isolated even when numerous concurrent users
are accessing data at the same time. Example: When multiple transactions
are processed at the same instance, effect of a transaction should not be
visible to the other transactions.
Durability: It ensures permanent data storage as it completes the specified
operation and issues commit.
4) NoSQL Database:
NoSQL is used for storing a wide range of data sets in different ways.
Example: MongoDB.
Based on the demand NoSQL is further classified into the following types:
Advantages:
Good productivity in the application development
Better option to handle large data sets
Highly scalable
Quicker access using the key field/value
5) Cloud Database:
Data is stored in a virtual environment getting executed over the cloud
computing platform having numerous cloud platforms. Examples:
Amazon Web Services
Microsoft Azure
Google Cloud SQL and etc.
6) Object-oriented Databases:
It uses the object-based data model approach for storing data where the
data is represented and stored as objects.
Examples: Realm, ObjectBox
7) Hierarchical Databases:
Stores the data in the form of parent-children relationship and organizes
the data in a tree-like structure. Data in the form of records are connected
through links where each child record has only one parent whereas the
parent record can have numerous child records.
Examples: IBM Information Management System (IMS) and the RDM
Mobile
9) Personal Database:
Designed for a single user where data collection and storing is on the
user's system.
Examples: DSRao_database
Advantages:
Simple.
Less storage space.
Advantages:
Multiprocessing
Executing parallel queries.
Multiprocessor architecture:
It has the following alternatives:
6
Shared memory architecture
Shared disk architecture
Shared nothing architecture
Advantages:
Simple to implement
Effective communication among the processors
Less communication overhead
Disadvantages:
Limited degree of parallelism
Addition of processor would slow down the existing processors.
Cache-coherency need to be maintained
Bandwidth issue
7
Figure 5: Shared Disk Architecture
Advantages:
Fault tolerance is achieved
Interconnection to the memory is not a bottleneck
Supports large number of processors
Disadvantages:
Limited scalability
Inter-processor communication is slow
Applications:
Digital Equipment Corporation(DEC).
8
Advantages:
Flexible to add any number of processors
Data request can be forwarded via interconnection n/w
Disadvantages:
Data partitioning is required
Cost of communication is higher
Applications:
Teradata
OraclenCUBE
The Grace and Gamma research prototypes
Tandem and etc.
Advantages:
Improved performance
High availability
Proper resource utilization
Highly Reliable
Disadvantages:
High cost
Numerous Resources
Complexity in managing the systems
9
Original time = time required as to execute the task using a single or 1
processor
Parallel time = time required as to execute the task using numerous or 'n'
processors
Example:
Where ,
Volume Parallel = volume executed in a given amount of time using 'n'
processor
Example: –
20 users are using a CPU at 100% efficiency, if we try to add some more
users, it becomes difficult for a single processor to handle additional
users instead a new processor can be added as to serve the users in
parallel mode and provides 200% efficiency.
11
1.7 TECHNIQUES OF QUERY EVALUATION
1. Horizontal partitioning
2. Vertical partitioning
3. De-normalization
12
Horizontal partitioning: Tables are created vertically using columns.
REFERENCES
13
8. DBMS Tutorial. https://www.javatpoint.com/dbms-tutorial (Last
accessed on 18.07.2021)
9. DBMS. https://searchsqlserver.techtarget.com/definition/database-
management-system(Last accessed on 18.07.2021)
MOOC List
QUIZ
14
14. The database management system can be considered as the collection
of ______ that enables us to create and maintain the database.
15. _____ refers collection of the information stored in a database at a
specific time
16. The term "ODBC" stands for_____
17. The architecture of a database can be viewed as the ________
18. The Database Management Query language is generally designed for
the _____
19. _______ is the collection of the interrelated data and set of the
program to access them.
20. A database is the complex type of the _______
21. An advantage of the database management approach is _______
22. ____________ is the disadvantage of the file processing system
23. Redundancy means __________
24. Concurrent access means ______________
25. ___________ refer to the correctness and completeness of the data in a
database
26. Either all of its operations are executed or none is called _________
27. When data is processed, organized, structured or presented in a given
context so as to make it useful, it is called ____________
28. ____________is an information repository which stores data.
29. ___________ level deals with physical storage of data.
30. The process of hiding irrelevant details from user is called
____________
31. Example of Naive User is ___________.
32. A user who write software using tools such as Java, .Net, PHP etc. is
_________________
VIDEO LINKS
1. https://www.youtube.com/watch?v=T7AxM7Vqvaw
2. https://www.youtube.com/watch?v=6Iu45VZGQDk
3. https://www.youtube.com/watch?v=wjfeGxqAQOY&list=PLrjkTql3jn
m-CLxHftqLgkrZbM8fUt0vn
4. https://www.youtube.com/watch?v=ZaaSa1TtqXY
5. https://www.youtube.com/watch?v=lDpB9zF8LBw
6. https://www.youtube.com/watch?v=fSWAkJz_huQ
7. https://www.youtube.com/watch?v=cMUQznvYZ6w
15
8. https://www.youtube.com/watch?v=mqprM5YUdpk
9. https://www.youtube.com/watch?v=3EJlovevfcA&list=PLxCzCOWd7
aiFAN6I8CuViBuCdJgiOkT2Y&index=2
10. https://www.youtube.com/watch?v=ZtVw2iuFI2w&list=PLxCzCOWd
7aiFAN6I8CuViBuCdJgiOkT2Y&index=3
11. https://www.youtube.com/watch?v=VyvTabQHevw&list=PLxCzCO
Wd7aiFAN6I8CuViBuCdJgiOkT2Y&index=4
*****
16
2
DISTRIBUTED DATABASE SYSTEM
Unit Structure
2.1 Types of distributed databases
2.2 Distributed DBMS (DDBMS) Architectures
2.3 Architectural Models
2.4 Design alternatives
2.5 Design alternatives
2.6 Fragmentation
References
MOOCS
Quiz
Video Links
17
Figure 2: Homogenous distributed database
18
Types of Heterogeneous Distributed Databases
1. Federated
2. Un-federated
Advantages:
Organizational Structure
Shareability and Local Autonomy
Improved Availability
Improved Reliability
Improved Performance
Economics
Modular Growth
Disadvantages:
Complexity
Cost
Security
Integrity Control More Difficult
Lack of Standards
Lack of Experience
Database Design More Complex
19
2.2 DISTRIBUTED DBMS (DDBMS) ARCHITECTURES:
20
Figure 5: Multiple server multiple client
Fully Replicated:
In fully replicated layout, a copy of all the database tables is stored at each
site due to which queries are executed in a fast manner with negligible
communication cost. On the other side, massive redundancy in data incurs
enormous cost during update operations and is appropriate for the systems
where a large number of queries are to be handled with less number of
database updates.
23
Partially Replicated:
Copies of tables are stored at varied sites and the distribution is done in
accord to the frequency of access. This takes into attention the truth that
the frequency of having access to the tables range notably from site to site
and the number of copies of the tables depends on how frequently the
access queries execute and the site which generate the access queries.
Fragmented:
In this layout, a table is split into two or extra pieces known as fragments
or partitions with each fragment stored at varied sites providing increase in
parallelism and better disaster recovery. Various fragmentation techniques
are as follows:
Vertical fragmentation
Horizontal fragmentation
Hybrid fragmentation
Mixed Distribution:
This layout is a combination of fragmentation and partial replications.
Tables are initially fragmented either in horizontal or vertical form and are
replicated across the different sites in accord to the frequency of accessing
the fragments.
Disadvantages:
Increased Storage Requirements
Increased Cost and Complexity of Data Updating
Undesirable Application – Database coupling
2.6 FRAGMENTATION
Fragmentation is process of dividing a table into a set of smaller tables
called as fragments. Fragmentation is classified into three types:
24
horizontal, vertical, and hybrid (combination of horizontal and vertical).
Horizontal fragmentation can further be classified into two strategies:
primary horizontal fragmentation and derived horizontal fragmentation.
Disadvantages of Fragmentation:
Requirement of data from varied sites results in low access speed.
Recursive fragmentations need expensive techniques.
Lack of back-up copies renders the database ineffective.
STUDENT
Regd_No Stu_Name Course Address Semester Fees Marks
Now, the fees details are maintained in the accounts section. In this case,
the designer will fragment the database as follows −
CREATE TABLE Stu_Fees AS
SELECT Regd_No,Fees
FROM STUDENT;
25
Figure 10: Vertical fragmentation
REFERENCES
1. David Bell and Jane Grimson. Distributed Database Systems (1st. ed.).
Addison-Wesley Longman Publishing Co., Inc., USA. 1992.
2. Özsu MT, Valduriez P. Principles of distributed database systems.
Englewood Cliffs: Prentice Hall; 1999 Feb.
3. Dye C. Oracle distributed systems. O'Reilly & Associates, Inc.; 1999
Apr 1.
4. Ozsu MT, Valduriez P. Distributed Databases: Principles and
Systems.1999.
5. Tuples S. Database Internals.2002
6. Özsu, M. Tamer. . Distributed Database Systems. 2002.
7. Silberschatz A, Korth HF, Sudarshan S. Database system concepts.
New York: McGraw-Hill; 1997 Apr.
8. Özsu MT, Valduriez P. Distributed and parallel database systems.
ACM Computing Surveys (CSUR). 1996 Mar 1;28(1):125-8.
9. Van Alstyne MW, Brynjolfsson E, Madnick SE. Ownership principles
for distributed database design. 1992.
10. Valduriez P, Jimenez-Peris R, Özsu MT. Distributed Database
Systems: The Case for NewSQL. InTransactions on Large-Scale Data-
27
and Knowledge-Centered Systems XLVIII 2021 (pp. 1-15). Springer,
Berlin, Heidelberg.
11. Domaschka J, Hauser CB, Erb B. Reliability and availability
properties of distributed database systems. In2014 IEEE 18th
International Enterprise Distributed Object Computing Conference
2014 Sep 1 (pp. 226-233). IEEE.
12. Distributed databases. https://www.db-book.com/db4/slide-dir/ch19-
2.pdf (Last accessed on 18.07.2021)
13. Distributed database management systems.
https://cs.uwaterloo.ca/~tozsu/courses/cs856/F02/lecture-1-ho.pdf.
(Last accessed on 18.07.2021)
MOOCS
Video links
1. Distributed databases.
https://www.youtube.com/watch?v=QyR4TIbEJjo
2. DBMS - Distributed Database System.
https://www.youtube.com/watch?v=aUyqZxn12sY
3. Introduction to Distributed Databases.
https://www.youtube.com/watch?v=0_m5gPfzEYQ
4. Introduction to Distributed Database System.
https://www.youtube.com/watch?v=RKmK_vKZsq8&list=PLduM7bk
xBdOdjbMXkTRdsSlWQKR43nSmd
5. Distributed Databases. https://www.youtube.com/watch?v=J-
sj3GUrq9k
6. Centralised vs Distributed Databases.
https://www.youtube.com/watch?v=QjvjeQquon8
7. Learn System design : Distributed datastores.
https://www.youtube.com/watch?v=l9JSK9OBzA4
8. Architecture of Distributed database systems.
https://www.youtube.com/watch?v=vuApQk27Jus
9. Distributed Database Introduction.
https://www.youtube.com/watch?v=Q1RIpXS7lPc
*****
30
3
STRUCTURED DATA TYPES IN DBMS
Unit Structure
3.1 Structured data types
3.2 Operations on structured data
3.3 Nested relations
3.4 Structured types in Oracle
3.5 Database objects in DBMS
3.6 Object datatype categories
3.7 Object tables
3.8 Object Identifiers
3.9 REFs
3.10 Collection types
References
MOOCS
Quiz
Video Links
It is a user defined data type with elements which aren't atomic, are
divisible and can be used separately or as a single unit as needed. The
major advantage of using objects is the ability to define new data types
(Abstract Data Types).
Example:
CREATE TYPE phone AS
(Country_code NUMBER(4),
STD_Code NUMBER(5),
31
Phone_Number NUMBER(10))
Let us consider ‗Stud-Dept‘ schema where table ‗Student ‗is created with
four columns namely Student_No (a system generated column),
Student_Name (Name of the Student), Student_Address (Address of the
student) is a structured type column of type ‗Student_ADDRESS-T‘ and
ProjNo (Project number of the Student) and StuImage (Images of the
Student).
Suppose, there is a need to find the names of Students along with their
images who are living in ‗Namdevwada‘ of ‗Nizamabad‘.
i. Operations on Arrays:
Row type is a collection of fields values where each fields are accessed by
traditional notation. Example: address-t.city specify the attribute ‗city‘ of
the type address-t.
SELECT S Student_No,S.Student_Name
FROM Student S
WHERE S.Address.area =‘Namdevwada‘ AND
S.Address.city=‘Nizamabad‘
AND S.Address.city = ‗Nizamabad‘
Attributes having complex kinds like setof (base), bagof (base) etc are
known as ‗Nested Relation,. So ‗Unnesting‘ is a manner or transforming a
nested relation into 1NF relation. Let us consider ‘Student-dept‘ schema
33
wherein for each employee, we store the following information in Student
table:-
1. Student_No
2. Student_Name
3. Student_Address
4. Projno
The domains of some of the information stored for an Student are non
atomic as, Projno, specifies the number of projects worked on by the
Student . A Student may also have a set of projects to be worked on.
This new type can be used to define an attribute in any TABLEs or TYPEs
as follows;
CREATE TABLE Student
(Student_name VARCHAR2(20),
Addr ADDRESS,
Phone NUMBER(10));
This table Student will consist of 3 columns wherein the first one and the
third one are of regular datatypes VARCHAR, and NUMBER
respectively, and the second one is of the abstract type ADDRESS. The
table PERSON will look like as follows;
Advantages:
1. Adopted by machine learning algorithms
2. Adopted by business users
34
3. Increased access to other tools
Disadvantages:
1. Limited use
2. Limited storage
Examples:
Common examples of machine-generated structured data are weblog
statistics and point of sale data, such as barcodes and quantity.
Advantages:
1. Freedom of the native format
2. Faster accumulation rates
3. Data lake storage
Disadvantages:
1. Requires data science expertise
2. Specialized tools
Examples:
It lends itself well to determining how effective a marketing campaign is,
or to uncovering potential buying trends through social media and review
websites.
35
A good example of semi-structured data vs. structured data might be a tab
delimited document containing customer data versus a database
incorporating CRM tables. On the other side of the coin, semi-structured
has more hierarchy than unstructured data; the tab delimited file is more
specific than a list of remarks from a customer‘s instagram.
Output:
DESCRIBE dept;
Name Null? Type
DeptNo Number(2)
DName Varchar2(20)
Location Varchar2(20)
Example :
CREATE VIEW dsrao
AS SELECT Student_id ID_NUMBER, last_name Last_Name,
salary*12 Annual_Salary
FROM Student
WHERE department_id = 111;
Output :
SELECT *
FROM dsrao;
Example :
CREATE SEQUENCE dept_deptid_seq
INCREMENT BY 10
START WITH 120
MAXVALUE 9999
NOCACHE
NOCYCLE;
4. Index – Are used to create indexes in database and speed up the rows
with the aid of retrieval the usage of a pointer. Indexes can be created
explicitly or routinely and calls for a complete scan in the absence of a
index on a column. Indexes are logically and physically impartial of the
table they index. They can be created or dropped at any time and
haven‘t any effect on the base tables or different indexes.
37
Syntax :
CREATE INDEX index
ON table (column[, column]...);
Example :
CREATE INDEX emp_last_name_idx
ON employees(last_name);
Syntax :
CREATE [PUBLIC] SYNONYM synonym FOR object;
Example :
CREATE SYNONYM d_sum FOR dept_sum_vu;
Multimedia Datatypes:
Much efficiency of database systems arises from their optimized
management of fundamental data types like numbers, dates, and
characters. Facilities exist for comparing values, determining their
distributions, constructing efficient indexes, and performing other
optimizations. Text, video, sound, graphics, and spatial data are examples
of vital business entities that don‘t suit neatly into those basic kinds.
Oracle Enterprise Edition supports modeling and implementation of these
complex data types commonly known as multimedia data types.
38
3.6 OBJECT DATATYPE CATEGORIES
Object Types:
Object types are abstractions of the real-global entities and are a schema
object with three types of components specifically name, attributes and
methods. A structured data unit that matches the template is termed to be
an object.
For example, you can define a relational table to keep track of your
contacts:
CREATE TABLE contacts (
contact external_person
date DATE );
The contacts table is a relational table with an object type defining certainly
one of its columns. Objects that occupy columns of relational tables are
called column objects.
39
Types of Methods:
Methods of an item kind version the behavior of items and are extensively
categorised into member, static and comparison.
Example: x and y are PL/SQL variables that hold purchase order objects
and w and z are variables that hold numbers, the following two statements
can leave w and z with distinct values:
w = x.get_value();
z = y.get_value();
id 1000376
contact external_student ("John Smith","1-800-555-1212")
lineitems NULL
Comparison Methods:
Oracle has amenities for comparing two data items and determines which
is greater. Oracle affords two ways to define an order relationship among
objects of a given object type: map methods and order strategies.
Order methods are more general and are used to compare two objects of a
given object type. It returns -1 if the first is smaller, 0 if they are equal,
and 1 if the first is greater.
An object table is a unique form of table that holds objects and provides a
relational view of the attributes of those items.
For example, the following statement defines an object table for objects of
the external_person type defined earlier:
40
CREATE TABLE external_student _table OF external_student;
Example:
3.9 REFs
41
Scoped REFs:
In declaring a column type, collection element, or object type attribute to
be a REF, you can constrain it to contain only references to a specified
object table. Such a REF is called a scoped REF. Scoped REFs require less
storage space and permit more efficient access than unscoped REFs.
Dangling REFs:
It is feasible for the object identified with the aid of a REF to become
unavailable through either deletion of the object or a change in privileges.
Such a REF is referred to as dangling.
Dereference REFs:
Accessing the object stated by a REF is called dereferencing the REF.
Dereferencing a dangling REF consequences in a null object.
Oracle provides implicit dereferencing of REFs. For example, consider
the following:
CREATE TYPE person AS OBJECT (
name VARCHAR2(30),
manager REF person );
If x represents an object of type PERSON, then the expression:
x.manager.name
represents a string containing the name attribute of the person object referred
to by the manager attribute of x. The previous expression is a shortened
form of:
Obtain REFs:
You can obtain a REF to a row object by selecting the object from its
object table and applying the REF operator. For example, you can obtain
a REF to the purchase order with identification number 1000376 as
follows:
DECLARE OrderRef REF to purchase_order;
Array types and table types are schema objects. The corresponding data
units are referred to as VARRAYs and nested tables.
42
Collection types have constructor strategies. The call of the constructor
method is the call of the type, and its argument is a comma separated
listing of the new collection's elements. The constructor approach is a
feature. returning the new collection as its value.
VARRAYs:
An array is an ordered set of data elements. Each element has an index,
which is a number corresponding to the element's position in the array.
The number of elements in an array is the size of the array. Oracle permits
arrays to be of variable size, that‘s why they may be called VARRAYs. You
have to specify a maximum size while you declare the array type.
Creating an array type does not allocate space. It defines a datatype, which
you can use as:
The datatype of a column of a relational table
An object type attribute
A PL/SQL variable, parameter, or function return type.
A VARRAY is normally stored in line; that is, in the same tablespace as the
other data in its row. If it is sufficiently large, however, Oracle stores it as
a BLOB.
A table type definition does not allocate space. It defines a type, which
you can use as:
The datatype of a column of a relational table
An object kind attribute
A PL/SQL variable, parameter, or function return type
43
the enclosing relational or object table. For example, the following
declaration defines an object table for the object type purchase_order:
The typical use would be define instantiable subtypes for such a type, as
follows:
44
A subtype of a NOT INSTANTIABLE type can override any of the non-
instantiable methods of the supertype and provide concrete
implementations. If there are any non-instantiable methods remaining, the
subtype must also necessarily be declared NOT INSTANTIABLE.
45
main difference among object oriented database and object relational
database.
1. What is an OID?
2. What are the strategies for obtaining a legitimate OID?
3. Which association maintains an OID registry?
REFERENCES
MOOCS
46
5. Oracle SQL: An Introduction to the most popular database. Udemy.
https://www.udemy.com/course/oracle-sql-an-introduction-to-the-most-
popular-
database/?ranMID=39197&ranEAID=JVFxdTr9V80&ranSiteID=JVFx
dTr9V80-
gPUoUHGA.bk7GEc2CHkc5g&LSNPUBID=JVFxdTr9V80&utm_sou
rce=aff-campaign&utm_medium=udemyads
6. Oracle SQL Developer: Mastering its Features + Tips & Tricks.
Udemy. https://www.udemy.com/course/oracle-sql-developer-tips-and-
tricks/?LSNPUBID=JVFxdTr9V80&ranEAID=JVFxdTr9V80&ranMI
D=39197&ranSiteID=JVFxdTr9V80-
BJvtSlb2eHT3z05lbG2Tow&utm_medium=udemyads&utm_source=af
f-campaign
7. Oracle Database 12c Fundamentals. Pluralsight.
https://www.pluralsight.com/courses/oracle-database-12c-
fundamentals?clickid=Wrd1mUSpBxyLWCdRlKxBMx0uUkBTkN3Jq
S-
kwM0&irgwc=1&mpid=1193463&aid=7010a000001xAKZAA2&utm
_medium=digital_affiliate&utm_campaign=1193463&utm_source=im
pactradius
8. Step by Step Practical Oracle SQL with real life exercises. Udemy.
https://www.udemy.com/course/oracle-and-sql-step-by-step-
learning/?LSNPUBID=JVFxdTr9V80&ranEAID=JVFxdTr9V80&ran
MID=39197&ranSiteID=JVFxdTr9V80-
Qclzu0fxxjk7S80GoaFVfw&utm_medium=udemyads&utm_source=af
f-campaign
QUIZ
Video Links
*****
48
UNIT 2
4
DIMENSIONAL MODELLING
Unit Structure
4.0 Objectives
4.1 Dimensional Modelling
4.1.1Objectives of Dimensional Modelling
4.1.2 Advantages of Dimensional Modelling
4.1.3 Disadvantages of Dimensional Modelling
4.2 Elements of Dimensional Data Model
4.3 Steps of Dimensional Modelling
4.3.1 Fact Table
4.3.2 Dimension Tables
4.4 Benefits of Dimensional Modelling
4.5 Dimensional Models
4.6 Types of Data Warehouse Schema:
4.6.1 Star Schema
4.6.2 Snowflake Schema
4.6.3 Galaxy Schema
4.6.4 Star Cluster Schema
4.7 Star Schema Vs Snowflake Schema: Key Differences
4.8 Summary
4.0 OBJECTIVES
This chapter will enable the readers to understand the following concepts:
Meaning of Dimensional Modelling including its objectives,
advantages, and disadvantages
The steps in Dimensional Modelling
Understanding of Fact Tables and Dimension Tables
Benefits of Dimensional Modelling
Understanding of different schemas – Star, Snowflake, Galaxy and
Start Cluster schema
Key differences between the Star Schema and the Snowflake Schema
These dimensional and relational models have their unique way of data
storage that has specific advantages.
Fact:
Facts are business measurements. Facts are normally but not always
numeric values that could be aggregated. e.g., number of products sold
per quarter.
Facts are the measurements/metrics or facts from your business
process. For a Sales business process, a measurement would be
quarterly sales number
Dimension:
Dimensions are called contexts. Dimensions are business descriptors
that specify the facts, for example, product name, brand, quarter, etc.
Dimension provides the context surrounding a business process event.
In simple terms, they give who, what, where of a fact. In the Sales
business process, for the fact quarterly sales number, dimensions would
be
Who – Customer Names
Where – Location
What – Product Name
Attributes:
The Attributes are the various characteristics of the dimension in
dimensional data modelling.
Fact Table:
A fact table is a primary table in dimension modelling.
The model should describe the Why, How much, When/Where/Who and
What of your business process
52
To describe the business process, you can use plain text or use basic
Business Process Modelling Notation (BPMN) or Unified Modelling
Language (UML).
For example, the CEO at an MNC wants to find the sales for specific
products in different locations on a daily basis.So, the grain is "product
sale information by location by the day."
For example, the CEO at an MNC wants to find the sales for specific
products in different locations on a daily basis.
Dimensions: Product, Location and Time
Attributes: For Product: Product key (Foreign Key), Name, Type,
Specifications
Hierarchies: For Location: Country, State, City, Street Address, Name
53
Step 5) Build Schema:
In this step, you implement the Dimension Model. A schema is nothing
but the database structure (arrangement of tables). There are two popular
schemas
STAR SCHEMA
SNOWFLAKE SCHEMA
For example, a city and state can view a store summary in a fact table.
Item summary can be viewed by brand, color, etc. Customer information
can be viewed by name and address.
2.2.1 Table:
Location ID Product Code Customer ID Unit Sold
172321 22345623 2
82 212121 31211324 1
58 434543 10034213 3
In this example, Customer ID column in the facts table is the foreign keys that
join with the dimension table. By following the links, we can see that row 2 of
the fact table records the fact that customer 31211324, Gaurav, bought one items
at Location 82.
54
4.4 BENEFITS OF DIMENSIONAL MODELLING
55
Multidimensional Schema is especially designed to model data
warehouse systems. The schemas are designed to address the unique needs
of very large databases designed for the analytical purpose (OLAP).
For example, consider the data of a shop for items sold per quarter in the
city of Delhi. The data is shown in the table. In this 2D representation, the
sales for Delhi are shown for the time dimension (organized in quarters)
and the item dimension (classified according to the types of an item sold).
The fact or measure displayed in rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For
example, suppose the data according to time and item, as well as the
location is considered for the cities Chennai, Kolkata, Mumbai, and Delhi.
These 3D data are shown in the table. The 3D data of the table are
represented as a series of 2D tables.
56
4.6 TYPES OF DATA WAREHOUSE SCHEMA
In the following Star Schema example, the fact table is at the center which
contains keys to every dimension table like Dealer_ID, Model ID,
Date_ID, Product_ID, Branch_ID & other attributes like Units sold and
revenue.
57
The dimension table is joined to the fact table using a foreign key
The dimension table are not joined to each other
Fact table would contain key and measure
The Star schema is easy to understand and provides optimal disk usage.
The dimension tables are not normalized. For instance, in the above
figure, Country_ID does not have Country lookup table as an OLTP
design would have.
The schema is widely supported by BI Tools
58
4.6.3 Galaxy Schema:
A Galaxy Schema contains two fact table that share dimension tables
between them. It is also called Fact Constellation Schema. The schema is
viewed as a collection of stars hence the name Galaxy Schema.
As you can see in above example, there are two facts table
1. Expense
2. Revenue.
59
Figure 6 - Star Cluster Schema
60
4.8 SUMMARY
*****
61
5
DATA WAREHOUSE
Unit Structure
5.0 Objectives
5.1 Introduction to Data Warehouse
5.2 Evolution of Data Warehouse
5.3 Benefits of Data Warehouse
5.4 Data Warehouse Architecture
5.4.1 Basic Single-Tier Architecture
5.4.2 Two-Tier Architecture
5.4.3 Three-Tier Architecture
5.5 Properties of Data Warehouse Architectures
5.6 ETL Process in Data Warehouse
5.7 Cloud-based ETL Tools vs. Open Source ETL Tools
5.8 ETL and OLAP Data Warehouses
5.8.1 The Technical Aspects of ETL
5.9 Data Warehouse Design Approaches
5.9.1 Bill Inmon – Top-down Data Warehouse Design Approach
5.9.2 Ralph Kimball – Bottom-up Data Warehouse Design
Approach
5.10 Data Mart
5.10.1 Reasons for creating a data mart
5.11 Types of Data Marts
5.11.1 Dependent Data Marts
5.11.2 Independent Data Marts
5.11.3 Hybrid Data Marts
5.12 Characteristics of Data Mart
5.13 Summary
5.14 References for further reading
5.0 OBJECTIVES
This chapter will make the readers understand the following concepts:
Meaning of data warehouse
Concept behind Data Warehouse
History and Evolution of Data Warehouse
Different types of Data Warehouse Architectures
Properties of data warehouse
62
Concept of Data Staging
ETL process
Design approaches to Data Warehouse
Data Marts and their types
As organizations grow, they usually have multiple data sources that store
different kinds of information. However, for reporting purposes, the
organization needs to have a single view of the data from these different
sources. This is where the role of a Data Warehouse comes in. A Data
Warehouse helps to connect and analyse data that is stored in various
heterogeneous sources. The process by which this data is collected,
processed, loaded, and analysed to derive business insights is called Data
Warehousing.
The data that is present within various sources in the organization can
provide meaningful insights to the business users if analysed in a proper
way and can assist in making data as a strategic tool leading to
improvement of processes. Most of the databases that are attached to the
sources systems are transactional in nature. This means that these
databases are used typically for storing transactional data and running
operational reports on it. The data is not organized in a way where it can
provide strategic insights. A data warehouse is designed for generating
insights from the data and hence, helps to convert data into meaningful
information that can make a difference.
Data from various operational source systems is loaded onto the Data
Warehouse and is therefore a central repository of data from various
sources that can provide cross functional intelligence based on historic
data. Since the Data Warehouse is separated from the operational
databases, it removes the dependency of working with transactional data
for intelligent business decisions.
While the primary function of the Data Warehouse is to store data for
running intelligent analytics on the same, it can also be used as a central
repository where historic data from various sources is stored.
63
In most cases, the data is these disparate systems is stored in different
ways and hence cannot be taken as it is and loaded onto the data
warehouse. Also, the purpose for which a data warehouse is built is
different from the one for which the source system was built. In the case
of our insurance company above, the policy system was built to store
information with regards to the policies that are held by a customer. The
CRM system would have been designed to store the customer information
and the claims system was built to store information related to all the
claims made by the customers over the years. For use to be able to
determine which customers could potentially provide fraud claims, we
need to be able to cross reference information from all these source
systems and then make intelligent decisions based on the historic data.
Hence, the data has to come from various sources and has to be stored in a
way that makes it easy for the organization to run business intelligence
tools over it. There is a specific process to extract data from various source
systems, translate this into the format that can be uploaded onto the data
warehouse and the load the data on the data warehouse. This process for
extraction, translation and loading of data is explained in detail
subsequently in the chapter.
Besides the process of ensuring availability of the data in the right format
on the data warehouse, it is also important to have the right business
intelligence tools in place to be able to mine data and then make intelligent
predictions based on this data. This is done with the help of business
intelligence and data visualization tools that enable converting data into
meaningful information and then display this information in a way that is
easy for the end users to understand.
The kind of analysis that is done on the data can vary from high level
aggregated dashboards that provide a cockpit view to a more detailed
analysis that can provide as much drill down of information as possible.
Hence, it is important to ensure that the design of the data warehouse takes
into consideration the various uses of the data and the amount of
granularity that is needed for making business decisions.
Most times, the kind of analysis that is done using the data that is stored in
the data warehouse is time-related. This could mean trends around sales
numbers, inventory holding, profit from products or specific segments of
customers, etc. These trends can then be utilized to forecast the future with
the use of predictive tools and algorithms. The Data Ware house provides
64
the basic infrastructure and data that is need by such tools to be able to
help the end-users in their quest for information.
Data Marts:
Data marts are like a mini data warehouse consisting of data that is more
homogenous in nature rather than a varied and heterogeneous nature of a
data warehouse. Data marts are typically built for the use within an
department or business unit level rather than at the overall organizational
level. It could aggregate data from various systems within the same
department or business unit. Hence, data marts are typically smaller in size
than data warehouses.
Data Lakes:
A concept that has emerged more recently is the concept of data lakes that
store data in a raw format as opposed to a more structures format in the
case of a data warehouse. Typically, a data lake will not need much
transformation of data without loading onto the data lake. It is generally
used to store bulk data like social media feeds, clicks, etc. One of the
reasons as to why the data is not usually transformed before loading onto a
data lake is because it is not usually known what kind of analysis would be
carried out on the data. More often than not, a data scientist would be
required to make sense of the data and to derive meaningful information
by applying various models on the data.
65
Data Warehouse Meant for use at Moe than Data Mart
organizational level but less than Data Lake
across business units
Data Lake Meant for advanced Greater than Data Mart
and predictive and Data Warehouse
analytics
As the information systems within the organizations grew more and more
complex and evolved over time, the systems started to develop and handle
more and more amount of information. The need for an ability to analyze
the data coming out from the various systems became more evident over
time.
Later, in the late 1980s, IBM researchers developed the Business Data
Warehouse. Inmon Bill was considered as a father of data warehouse. He
had written about a variety of topics for building, usage, and maintenance
of the warehouse & the Corporate Information Factory.
66
This is made possible since the data has been extracted, translated, and
then loaded onto the data warehouse platform from various cross-
functional and cross-departmental source systems. Information that
provides such an integrated view of the data is extremely useful for the
senior management in making decisions at the organizational level.
A Data warehouse platform can take care of all such issues since the data
is already loaded and can be queried upon as desired. Thereby saving
precious time and effort for the organizational users.
4. Return on investment:
Building a data warehouse is usually an upfront cost for the organization.
However, the return that it provides in terms of information and the ability
to make right decisions at the right time provides a return on investment
that is usually manyfold with respect to the amount that has been invested
upfront. In the long run, a datawarehouse helps the organization in
multiple ways to generate new revenue and save costs.
6. Competitive edge:
A data warehouse is able to provide the top management within the
organization a capability to make business decisions that are based on data
that cuts across the organizational silos.It is therefore more reliable and the
decisions that are based on such data are able to provide a competitive
edge to the organization viz-a-viz their competition
As can be seen that the online transaction processing systems are usually
updated regularly based on the data and transactions that happen daily on
that system. In contrast, an online analytical processing system or the data
warehouse is usually updated through an ETL process that extracts the
data from the source systems on a regular basis, transforms the data into a
format that will be required for the data warehouse and then loads the data
onto the data warehouse as per the pre-defined processes.
68
It may be noticed that the data in the data warehouse is typically not real
time data and there is usually a delay in moving the data from these source
systems to the data warehouse. However, this is something that most
businesses are fine with as long as they get an integrated view of data from
across different functions of the organization and as long as the data is
automatically uploaded on the data warehouse for generation of these
insights on demand.
69
Figure 8 - Basic Data Warehouse Architecture
The reconciled layer is between the source data and data warehouse. It
creates a standard reference model for the whole enterprise. And, at the
same time it separates the problem of data extraction and integration from
datawarehouse. This layer is also directly used to perform better
operational tasks e.g. producing daily reports or generating data flows
periodically to benefit from cleaning and integration.
72
It is a process in which an ETL tool extracts the data from various data
source systems, transforms it in the staging area and then finally, loads it
into the Data Warehouse system.
Transformation:
The second step of the ETL process is transformation. In this step, a set
of rules or functions are applied on the extracted data to convert it into a
single standard format.All the data from multiple source systems is
normalized and converted to a single system format — improving data
quality and compliance. ETL yields transformed data through these
methods:
Filtering – loading only certain attributes into the data warehouse.
Cleaning – filling up the NULL values with some default values,
mapping U.S.A, United States and America into USA, etc.
Joining – joining multiple attributes into one.
Splitting – splitting a single attribute into multipe attributes.
73
Sorting – sorting tuples on the basis of some attribute (generally key-
attribbute).
Loading:
The third and final step of the ETL process is loading. In this step, the
transformed data is finally loaded into the data warehouse. Sometimes the
data is updated by loading into the data warehouse very frequently and
sometimes it is done after longer but regular intervals. The rate and period
of loading solely depends on the requirements and varies from system to
system.
ETL process can also use the pipelining concept i.e. as soon as some data
is extracted, it can be transformed and during that period some new data
can be extracted. And while the transformed data is being loaded into the
data warehouse, the already extracted data can be transformed.
Finally, data that has been extracted to a staging area and transformed is
loaded into your data warehouse. Depending upon your business needs,
data can be loaded in batches or all at once. The exact nature of the
loading will depend upon the data source, ETL tools, and various other
factors.
The block diagram of the pipelining of ETL process is shown below:
Open source ETL tools come in a variety of shapes and sizes. There are
ETL frameworks and libraries that you can use to build ETL pipelines in
Python. There are tools and frameworks you can leverage for GO and
Hadoop. Really, there is an open-source ETL tool out there for almost any
unique ETL need.
Data engineers have been using ETL for over two decades to integrate
diverse types of data into online analytical processing (OLAP) data
warehouses. The reason for doing this is simple: to make data analysis
easier.
These methodologies are a result of research from Bill Inmon and Ralph
Kimball.
76
In the top-down approach, the data warehouse is designed first and then
data mart are built on top of data warehouse.
Basically, Kimball model reverses the Inmon model i.e. Data marts are
directly loaded with the data from the source systems and then ETL
process is used to load in to Data Warehouse.
77
Figure 15 - Bottom up Approach
Data marts is the access layer of a data warehouse that is used to provide
users with data. Data warehouses typically house enterprise-wide data,
and information stored in a data mart usually belongs to a specific
department or team.
The key objective for data marts is to provide the business user with the
data that is most relevant, in the shortest possible amount of time. This
allows users to develop and follow a project, without needing to wait
long periods for queries to complete. Data marts are designed to meet the
demands of a specific group and have a comparatively narrow subject
area. Data marts may contain millions of records and require gigabytes
of storage.
5.13 SUMMARY
Reference books:
1. Ponniah, Paulraj, Data warehousing fundamentals: a comprehensive
guide for IT professionals, John Wiley & Sons, 2004.
2. Dunham, Margaret H, Data mining: Introductory and advanced topics,
Pearson Education India, 2006.
81
3. Gupta, Gopal K, Introduction to data mining with case studies, PHI
Learning Pvt. Ltd., 2014.
4. Han, Jiawei, Jian Pei, and Micheline Kamber, Data mining: concepts
and techniques, Second Edition, Elsevier, Morgan Kaufmann, 2011.
5. Ramakrishnan, Raghu, Johannes Gehrke, and Johannes Gehrke,
Database management systems, Vol. 3, McGraw-Hill, 2003
6. Elmasri, Ramez, and Shamkant B. Navathe, Fundamentals of Database
Systems, Pearson Education, 2008, (2015)
7. Silberschatz, Abraham, Henry F. Korth, and Shashank
Sudarshan,Database system concepts, Vol. 5,McGraw-Hill, 1997.
Web References:
1. https://www.guru99.com/data-mining-vs-datawarehouse.html
2. https://www.tutorialspoint.com/dwh/dwh_overview
3. https://www.geeksforgeeks.org/
4. https://blog.eduonix.com/internet-of-things/web-mining-text-mining-
depth-mining-guide
*****
82
6
OLAP IN THE DATA WAREHOUSE
Unit Structure
6.0 Objectives
6.1 What is OLAP
6.2 OLAP Cube
6.3 Basic analytical operations of OLAP
6.3.1 Roll-up
6.3.2 Drill-down
6.3.3 Slice
6.3.4 Pivot
6.4 Characteristics of OLAP Systems
6.5 Benefits of OLAP
6.5.1 Motivations for using OLAP
6.6 Types of OLAP Models
6.6.1 Relational OLAP
6.6.2 Multidimensional OLAP (MOLAP) Server
6.6.3 Hybrid OLAP (HOLAP) Server
6.6.4 Other Types
6.7 Difference between ROLAP, MOLAP, and HOLAP
6.8 Difference Between ROLAP and MOLAP
6.9 Summary
6.0 OBJECTIVES
This chapter will enable the readers to understand the following concepts:
An overview of what OLAP is
Meaning of OLAP cubes
The basic analytical operations of OLAP including Rollup, Drill down, Slide
& Dice and Pivot
Characteristics of an OLAP Systems
Types of OLAP systems that consist of Relational OLAP, Multi-dimensional
OLAP and Hybrid OLAP
Other types of OLAP systems
Advantages and disadvantages of each of the OLAP systems
Differences between the three major OLAP systems
OLAP databases are divided into one or more cubes. The cubes are
designed in such a way that creating and viewing reports become easy.
OLAP stands for Online Analytical Processing.
6.3.1 Roll-up:
Roll-up is also known as "consolidation" or "aggregation." The Roll-up
operation can be performed in 2 ways
Reducing dimensions
Climbing up concept hierarchy. Concept hierarchy is a system of
grouping things based on their order or level.
Consider the following diagram
6.3.2 Drill-down:
In drill-down data is fragmented into smaller parts. It is the opposite of the
rollup process. It can be done via
Moving down the concept hierarchy
Increasing a dimension
6.3.3 Slice:
Here, one dimension is selected, and a new sub-cube is created.Following
diagram explain how slice operation performed:
86
Figure 21 - Example of Slice
Dice:
This operation is similar to a slice. The difference in dice is you select 2 or
more dimensions that result in the creation of a sub-cube.
87
Figure 22 - Example of Dice
6.3.4 Pivot:
In Pivot, you rotate the data axes to provide a substitute presentation of data.
In the following example, the pivot is based on item types.
88
6.4 CHARACTERISTICS OF OLAP SYSTEMS
Fast:
It defines which the system targeted to deliver the most feedback to the
client within about five seconds, with the elementary analysis taking no
more than one second and very few taking more than 20 seconds.
Analysis:
It defines which the method can cope with any business logic and
statistical analysis that is relevant for the function and the user, keep it
easy enough for the target client. Although some pre-programming may be
needed we do not think it acceptable if all application definitions have to
be allow the user to define new Adhoc calculations as part of the analysis
and to document on the data in any desired method, without having to
program so we excludes products (like Oracle Discoverer) that do not
allow the user to define new Adhoc calculation as part of the analysis and
to document on the data in any desired product that do not allow adequate
end user-oriented calculation flexibility.
Share:
It defines which the system tools all the security requirements for
understanding and, if multiple write connection is needed, concurrent
update location at an appropriated level, not all functions need customer to
write data back, but for the increasing number which does, the system
should be able to manage multiple updates in a timely, secure manner.
Multidimensional:
This is the basic requirement. OLAP system must provide a
multidimensional conceptual view of the data, including full support for
hierarchies, as this is certainly the most logical method to analyze business
and organizations.
Information:
The system should be able to hold all the data needed by the applications.
Data sparsity should be handled in an efficient manner.
89
3. Accessibility: OLAP acts as a mediator between data warehouses and
front-end. The OLAP operations should be sitting between data
sources (e.g., data warehouses) and an OLAP front-end.
4. Storing OLAP results: OLAP results are kept separate from data
sources.
5. Uniform documenting performance: Increasing the number of
dimensions or database size should not significantly degrade the
reporting performance of the OLAP system.
6. OLAP provides for distinguishing between zero values and missing
values so that aggregates are computed correctly.
7. OLAP system should ignore all missing values and compute correct
aggregate values.
8. OLAP facilitate interactive query and complex analysis for the users.
9. OLAP allows users to drill down for greater details or roll up for
aggregations of metrics along a single business dimension or across
multiple dimensions.
10. OLAP provides the ability to perform intricate calculations and
comparisons.
11. OLAP presents results in several meaningful ways, including charts
and graphs.
Aggregating, grouping, and joining data are the most difficult types of
queries for a relational database to process. The magic behind OLAP
derives from its ability to pre-calculate and pre-aggregate data. Otherwise,
end users would be spending most of their time waiting for query results
to be returned by the database. However, it is also what causes OLAP-
based solutions to be extremely rigid and IT-intensive.
91
middleware to provide missing pieces.ROLAP servers contain
optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional tools and services.ROLAP technology
tends to have higher scalability than MOLAP technology.
ROLAP systems work primarily from the data that resides in a relational
database, where the base data and dimension tables are stored as relational
tables. This model permits the multidimensional analysis of data.
This technique relies on manipulating the data stored in the relational
database to give the presence of traditional OLAP's slicing and dicing
functionality. In essence, each method of slicing and dicing is equivalent
to adding a "WHERE" clause in the SQL statement.
ROLAP stands for Relational Online Analytical Processing. ROLAP
stores data in columns and rows (also known as relational tables) and
retrieves the information on demand through user submitted queries. A
ROLAP database can be accessed through complex SQL queries to
calculate information. ROLAP can handle large data volumes, but the
larger the data, the slower the processing times.
Because queries are made on-demand, ROLAP does not require the
storage and pre-computation of information. However, the disadvantage of
ROLAP implementations are the potential performance constraints and
scalability limitations that result from large and inefficient join operations
between large tables. Examples of popular ROLAP products include Meta
cube by Stanford Technology Group, Red Brick Warehouse by Red Brick
Systems, and AXSYS Suite by Information Advantage.
Relational OLAP Architecture:
ROLAP Architecture includes the following components
• Database server. ROLAP server. Front-end tool.
92
Relational OLAP (ROLAP) is the latest and fastest-growing OLAP
technology segment in the market. This method allows multiple
multidimensional views of two-dimensional relational tables to be created,
avoiding structuring record around the desired view.
Advantages:
Can handle large amounts of information: The data size limitation of
ROLAP technology is depends on the data size of the underlying
RDBMS. So, ROLAP itself does not restrict the data amount.
RDBMS already comes with a lot of features. So ROLAP technologies,
(works on top of the RDBMS) can control these functionalities.
Disadvantages:
Performance can be slow: Each ROLAP report is a SQL query (or
multiple SQL queries) in the relational database, the query time can be
prolonged if the underlying data size is large.
Limited by SQL functionalities: ROLAP technology relies on upon
developing SQL statements to query the relational database, and SQL
statements do not suit all needs.
MOLAP Architecture:
MOLAP Architecture includes the following components
Database server.
MOLAP server.
Front-end tool.
93
Figure 26 - MOLAP Architecture
94
An example would be the creation of sales data measured by several
dimensions (e.g., product and sales region) to be stored and maintained in
a persistent structure. This structure would be provided to reduce the
application overhead of performing calculations and building aggregation
during initialization. These structures can be automatically refreshed at
predetermined intervals established by an administrator.
Advantages:
Excellent Performance: A MOLAP cube is built for fast information
retrieval and is optimal for slicing and dicing operations.
Can perform complex calculations: All evaluation have been pre-
generated when the cube is created. Hence, complex calculations are
not only possible, but they return quickly.
Disadvantages
Limited in the amount of information it can handle: Because all
calculations are performed when the cube is built, it is not possible to
contain a large amount of data in the cube itself.
Requires additional investment: Cube technology is generally
proprietary and does not already exist in the organization. Therefore, to
adopt MOLAP technology, chances are other investments in human
and capital resources are needed.
95
Advantages of HOLAP:
HOLAP provide benefits of both MOLAP and ROLAP.
It provides fast access at all levels of aggregation.
HOLAP balances the disk space requirement, as it only stores the
aggregate information on the OLAP server and the detail record
remains in the relational database. So no duplicate copy of the detail
record is maintained.
Disadvantages of HOLAP:
HOLAP architecture is very complicated because it supports both
MOLAP and ROLAP servers.
97
Figure 28 - Difference between ROLAP, MOLAP and HOLAP
ROLAP MOLAP
ROLAP stands for Relational MOLAP stands for
Online Analytical Processing. Multidimensional Online
Analytical Processing.
It usually used when data It used when data warehouse
warehouse contains relational contains relational as well as non-
data. relational data.
It contains Analytical server. It contains the MDDB server.
It creates a multidimensional view It contains prefabricated data
of data dynamically. cubes.
It is very easy to implement It is difficult to implement.
It has a high response time It has less response time due to
prefabricated cubes.
6.9 SUMMARY
98
The Roll-up operation can be performed by Reducing dimensions or
climbing up concept hierarchy
The Drill down operation can be performed by moving down the
concept hierarchy or Increasing a dimension
Slice is when one dimension is selected, and a new sub-cube is created.
Dice is where two or more dimensions are selected as a new sub-cube
is created
In Pivot, you rotate the data axes to provide a substitute presentation of
data.
FASMI characteristics of OLAP methods - Fast, Analysis, Share,
Multi-dimensional and Information
OLAP helps in understanding and improving sales. It also helps in
understanding and improving the cost of doing business
Three major types of OLAP models are Relational OLAP, Multi-
dimensional OLAP and Hybrid OLAP
Relational OLAP systems are intermediate servers which stand in
between a relational back-end server and user frontend tools. They use
a relational or extended-relational DBMS to save and handle warehouse
data, and OLAP middleware to provide missing pieces
MOLAP structure primarily reads the precompiled data. MOLAP
structure has limited capabilities to dynamically create aggregations or
to evaluate results which have not been pre-calculated and stored.
HOLAP incorporates the best features of MOLAP and ROLAP into a
single architecture. HOLAP systems save more substantial quantities of
detailed data in the relational tables while the aggregations are stored in
the pre-calculated cubes.
*****
99
Module III
7
DATA MINING AND PREPROCESSING
INTRODUCTION TO DATA MINING
Unit Structure
7.0 Objectives
7.1 Introduction
7.2 Data Mining Applications
7.3 Knowledge Discovery In Data (Kdd) Process
7.4 Architecture Of Data Mining System / Components Of Data
Mining System
7.5 Issues And Challenges In Data Mining
7.6 Summary
7.7 Exercises
6.8 References
7.0 OBJECTIVES
7.1 INTRODUCTION
We say that today is the age of Big Data. The sheer volume of data being
generated today is exploding.The rate of data creation or generation is
mind boggling. Mobile phones, social media, imaging technologies which
are used for medical diagnosis, non-traditional IT devices like RFID
readers, GPS navigation systems —all these are among the fastest growing
sources of data. Now keeping up with this huge influx of data is difficult,
but what is more challenging is analysing vast amounts of this generated
data, to identify meaningful patterns and extract useful information. Data
in its original form is crude, unrefined so it must be broken down,
analysed to have some value. So, Data Mining is finding insightful
information which is hidden in the data.
100
Data Mining, (sometimes also known as Knowledge Discovery in Data
(KDD)), is an automatic or semi-automatic ‗mining‘ process used for
extracting useful data from a large set of raw data. It analyses large
amount of scattered information to find meaningful constructs from it and
turns it into knowledge. It checks for anomalies or irregularities in the
data, identifies patterns or correlations among millions of records and then
converts it into knowledge about future trends and predictions. It covers a
wide variety of domains and techniques including Database Technology,
Multivariate Statistics, Engineering and Economics (provides methods for
Pattern recognition and predictive modelling), ML (Machine Learning),
Artificial Intelligence, Information Science, Neural Networks, Data
Visualization many more.
Data Mining and big data are used almost everywhere. Data Mining is
increasingly used by companies having solid consumer focus like in
retail sales, advertising and marketing, financial institutions,
bioinformatics etc.Almost all Commercial companies use data mining
and big data to gain insights into their customers, processes, staff, and
products. Many companies use mining to offer customers a better user
experience, as well as to cross-sell, increase the sale, and customize their
products.
101
Financial institutions use data mining and analysis to predict stock
markets, determine the risk of lending money, and learn how to attract
new clients for their services.
Credit card companies monitor the spending habits of the customer and
can easily recognize duplicitous purchases with a certain degree of
accuracy using rules derived by processing billions of transactions.
106
development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions, which are processed in
parallel. The results are then merged. The high cost of some data
mining processes promotes the need for incremental data mining
algorithms.
All these issues and challenges stimulate and motivate the researchers and
experts to investigate this data mining field deeper.
7.6 SUMMARY
7.7 EXERCISES
7.8 REFERENCES
*****
108
8
DATA MINING AND PREPROCESSING
DATA PREPROCESSING
Unit Structure
8.0 Objectives
8.1 Introduction To Data Preprocessing
8.2 Introduction To Data Cleaning
8.3 Data Integration
8.4 Data Transformation
8.5 Summary
8.4 Data Transformation
8.5 Summary
8.6 Exercises
8.7 References
8.0 OBJECTIVES
109
Data Preprocessing is one of the most vital steps in the data mining
process. Data Preprocessing involves Data Cleaning, Data Integration,
Data Transformation, Data Reduction etc. These techniques are useful
for removing the noisy data and preparing the quality data which gives
efficient result of the analysis. These techniques, when applied before the
data mining process will definitely improve the quality of the patterns
mined as well as the time needed for actual mining process. Note that
these techniques are not mutually exclusive.
Before using the data for analysis, we need it to be organized and sanitized
properly. In the following section we will learn the various techniques
which are used to get a clean dataset.
110
Data Cleaning can be implemented in many ways by adjusting the
missing values, identifying and removing outliers, removing
redundant rows or deleting irrelevant records.
Techniques from 4 to 6 bias the data. There is a high chance that the
filled-in value is incorrect. However, technique 6 is used heavily as it uses
the large percentage of information from the available data to predict
missing values. Due to inference-based tools or decision tree induction
there is a greater probability that the relationship between the missing
value and the other attributes is preserved.
111
8.2.2 Noisy Data:
Noisy data is the data with a large amount of additional meaningless
information in it. In general, noise is a random error or variance which
may include faulty data gathering equipment, technology limitations,
resource limitations and data entry problems. Due to noise, algorithms
often miss out data patterns. Noisy data can be handled by the following
methods –
1. Binning: In this method, the data values are sorted in an order, then
grouped into ‗bins‘ or buckets. Then each value in a particular bin is
smoothed using its neighbourhood i.e., its surrounding values. It is said
that binning method performs local smoothing as it looks up at its
surrounding values to smooth the values of the attribute.
Smoothing by bin means – Here, all the values of a bin are replaced by
the mean of the values from that bin.
Mean of 4, 8, 9, 15 = 9
Mean of 21, 21, 24, 25 = 23
Mean of 26, 28, 28, 34 = 29
Therefore,this way results in the following bins -
Bin1: 9, 9, 9, 9
Bin2: 23, 23, 23, 23
Bin3: 29, 29, 29, 29
Smoothing by bin medians – Here, all the values of a bin are replaced by
the median of the values from that bin.
Median of 4, 8, 9, 15 = 9
Median of 21, 21, 24, 25 = 23
Median of 26, 28, 28, 34 = 28
Therefore,this way results in the following bins -
Bin1: 9, 9, 9, 9
Bin2: 23, 23, 23, 23
Bin3: 28, 28, 28, 28
Smoothing by bin boundaries – Here, all the values of a bin are replaced
by the closet boundary of the values from that bin. Therefore,this way
results in the following bins -
Bin1: 4, 4, 4, 15
Bin2: 21, 21, 25, 25
Bin3: 26, 26, 26, 34
112
Alternatively, bins may be equal-width or of equal-depth. Equal-width
binning divides the range into N intervals of equal size. Here, outliers may
dominate the result. Equal-depth binning divides the range into N
intervals, each containing approximately same number of records. Here
skewed data is also handled well.
113
8.3 DATA INTEGRATION
Data comes from diverse sources which we need to integrate. Data varies
in size, type, format and structure, ranging from databases, spreadsheets,
Excel files to text documents. Data Integration technique combines
data from diverse sources into a coherent data store and provides a
unified view of that data.
Data
Source 1
While performing the data integration you have to deal with several issues.
Major issues faced during Data Integration are listed below –
Note that Correlation does not imply causation. It means that if there is a
correlation between two attributes that does not necessarily imply that one
is the cause of the other.
115
Here, E is the expected value, MR is the row marginal for the cell you are
calculating an expected value for, MC is the column marginal and n is the
sample size.
Below is the 2x2 contingency table which depicts two rows for age
categories and two columns for voting behaviour.
Now, as we have the observed value and the expected value, we can easily
calculate chi-square.
χ2 = [(24-33)2 /33] + [(31-22)2 / 22] + [(36-27)2 /27] + [(9-18)2 / 18] =
2.45+3.68+3+4.5 = 13.63
Now, we need to use the chi-square distribution table and find the critical
value at the intersection of the degrees of freedom (df =1) and the level of
significance which is 0.01. Our critical value is 6.63 which is smaller than
χ2value 13.63. Therefore, we can conclude that voter age and voter turnout
are related. However, we cannot determine how much they are related
using this test.
116
Detection and resolution of data value conflicts: During data fusion
process we need to pay attention to the units of measurement of the
datasets. This may be due to differences in representation, scaling or
encoding. For instance, if we are to study prices of fuel in different
parts of the world then some datasets may contain price per gallon and
others may contain price per litre. An attribute in one system may be
recorded at a lower level of abstraction than the same attribute in
another. Having different levels of aggregation is similar to having
different types of measurements. For instance, one dataset may contain
data per week whereas the other dataset may contain data per work-
week. These types of errors are easy to detect and fix.
The diversity of the data sources poses a real challenge in the data
integration process. Intelligent and vigilant integration of data will
definitely lead to correct insights and speedy data mining process.
The next task after cleansing and integration is transforming your data so
it takes a suitable form for data mining. When data is homogeneous and
well-structured, it is easier to analyze and look for patterns.
117
1. Min-Max Normalization – This method transforms the original data
linearly. Suppose minF and maxF are the minimum and maximum
values of an attribute F. This method maps a value v of F to v‘ in the
range [ new_minF, new_maxF] using the following formula
v' = (1 – 0) + 0 = 0.11
z= = 2.50
3. Decimal Scaling – It normalizes the values of an attribute by changing
the position of their decimal points. The number of points by which the
decimal point is moved can be determined by the absolute maximum
value of attribute. Decimal Scaling formula is
For instance, Values of an attribute varies from -99 to 99. The maximum
absolute value is 99. For normalizing the values, we divide the numbers by
100 (i.e., j=2) so that values come out to be as 0.98, 0.97 and so on.
118
Attribute Construction – Here, new attributes are created from an
existing set of attributes. Attribute construction can discover missing
information about relationships between data attributes which can be
important for knowledge discovery.
8.5 SUMMARY
Data Cleaning deals with missing values in the dataset, sanitizes the data
by removing the noise, identifies the outliers and remedies the
inconsistencies.
Data Transformation methods transforms the data into required forms for
mining.
8.6 EXERCISES
1. Mostly in real-world data, tuples have missing values. What are the
various methods to deal with missing values?
2. What are the various issues to be considered w.r.t data integration?
3. What are the normalization methods? Also explain the value ranges for
each of them.
4. Suppose a group of 12 sales price records has been sorted as follows –
5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215
Partition them into 3 bins by using the following methods –
a) Equal-width partitioning
b) Equal-frequency partitioning
c) Clustering
8.7 REFERENCES
Han, Jiawei, Jian Pei, and Micheline Kamber, Data mining: concepts
and techniques, Second Edition, Elsevier, Morgan Kaufmann, 2011.
*****
119
MODULE III
9
DATA MINING AND PREPROCESSING
Data Reduction
Unit Structure
9.0 Objectives
9.1 Introduction To Data Reduction
9.2 Introduction To Data Discretization And Concept Hierarchy
Generation
9.3 Data Discretization And Concept Hierarchy Generation For
Numerical Data
9.4 Concept Hierarchy Generation For Categorical Data
9.5 Summary
9.6 Exercises
9.7 References
9.0 OBJECTIVES
In real world we usually deal with big data. So, it takes a long time for
analyzing and mining this big data. In some case it may not be practically
feasible to analyze such a huge amount of data. Data reduction method
results in a simplified and condensed description of the original data
that is much smaller in size/quantity but retains the quality of the
original data. The strategy of data reduction decreases the sheer volume
of data but retains the integrity of the data. Analysis and mining on such
smaller data is practically feasible and mostly results in same analytical
outcomes.
120
Data Reduction methods are given below –
1. Data Cube Aggregation – Here, data is grouped in a more manageable
format. Mostly it is a summary of the data. Aggregation operations are
applied to the data in the construction of a data cube.
Mining on a reduced data set may also make the discovered pattern easier
to understand. The question arises about how to select the attributes to be
removed. Statistical significance tests are used so that such attributes can
be selected.
For example, retain all wavelet coefficients larger than some particular
threshold and the remaining coefficients are set to 0. The general
procedure for applying a discrete wavelet transform uses a hierarchical
pyramid algorithm that halves the data in each iteration, resulting in fast
computational speed.
Wavelet transforms are well suited for data cube, sparse data or data
which is highly skewed. Wavelet transform is often used in image
compression, computer vision, analysis of time-series data and data
cleansing.
Principal Components Analysis (PCA) – It is a statistical process
which transforms the observations of the correlated features into a set
of linearly uncorrelated features with the help of orthogonal
transformation. These new transformed features are called Principal
Components.
122
The basic process of PCA is as follows –
a) For each attribute to fall within the same range, the input data is
normalized. So, attributes with large domains do not dominate
attributes with smaller domains.
b) PCA computes k orthonormal vectors that provide a base for the
normalized input data. These unit vectors are called as Principal
Components. The input data is a linear combination of the principal
components.
c) Principal components are sorted in order of decreasing strength. The
principal components essentially serve as a new set of axes for the data,
by providing important information about variance. Observe in the
below figure the direction in which the data varies the most actually
falls along the red line. This is the direction with the most variation in
the data. So, it‘s the first principal component (PC1). The direction
along which the data varies the most out of all directions that are
uncorrelated with the first direction is shown using the blue line. That‘s
the second principal component (PC2).
PCA can also be used for finding hidden patterns if data has high
dimensions. Some fields where PCA is used is AI, Computer vision and
image compression.
123
representations of the data. These methods include histograms, clustering,
sampling and data cube aggregation.
Parametric Methods:
a) Regression: Regression is of two types – Simple Linear regression and
Multiple Linear regression. Regression model is Simple Linear
regression when there is only single independent attribute, however, if
there are multiple independent attributes then the model is Multiple
Linear Regression. In Simple Linear regression the data are modelled to
fit a straight line.
Non-Parametric Methods:
a) Histograms: It is popular method of data reduction. Here, data is
represented in terms of frequency. It uses binning to approximate data
distribution. A Histogram for an attribute partitions the data distribution
of the attribute into disjoint subsets or buckets. The buckets are
displayed on a horizontal axis while the height and area represent the
average frequency of the values depicted by the bucket. Many a times
buckets represent continuous ranges for the given attribute.
124
There are several partitioning rules for buckets –
o Equal-width – Here, the width of each bucket range is same.
o Equal-frequency – Here, the bucket is created in such a way so that
the number of contiguous data samples in each bucket are roughly the
same.
o V-optimal – It is based on the concept of minimizing a quantity called
as weighted variance. This method of partitioning does a better job of
estimating the bucket contents. V-optimal Histogram attempts to have
the smallest variance possible among the buckets.
o Max-Diff – Here, we consider the difference between each pair of
adjacent values. A bucket boundary is established between each pair for
pairs having the β -1 largest differences, where β is the user specified
number of buckets.
would be
125
There are many methods using which we can sample a large data set D
containing N tuples –
o Simple Random Sample Without Replacement of size (SRSWOR) –
Here, ‗S‘ number of tuples are drawn from N tuples such that S<N in
the dataset D. The probability of drawing any tuple from the dataset D
is 1/N which means all the tuples have an equal chance of getting
selected in the sample.
o Simple Random Sample with Replacement of size (SRSWR) – It is
similar to SRSWOR but the tuple is drawn from dataset D, is recorded
and then replaces back into the dataset D so that it can be drawn again.
126
8.2 INTRODUCTION TO DATA DISCRETIZATION
AND CONCEPT HIERARCHY GENERATION
Before Discretization
Age 10,11,13,14,17,19,30,31,32,38,40,42,70,72,73,75
After Discretization:
127
Discretization can be performed recursively on an attribute to provide
a hierarchical or multi resolution partitioning of the attribute values
known as Concept Hierarchy.
There may be more than one concept hierarchy for a given attribute or
dimension, based on different user viewpoints.
128
Entropy-based Discretization: Entropy is one of the most popularly
used discretization technique. We always want to make meaningful
splits or partitions in our continuous data. Entropy-base discretization
helps to split our data at points where we will gain the most insights
when we give it to our data mining systems. Entropy describes how
consistently a potential split will match up with a classifier. Lower
entropy is better and a Zero value entropy is the best.
Income<=50000 Income>50000
Age<25 4 6
The above data will result in a high entropy value, almost closer to 1.
Based on the above data we cannot be sure that if a person is below 25
years of age, then he will have income greater than 50000. Because data
indicates that only 6/10 make more than 50000 and the rest makes below
it. Now, let‘s change our data values.
Income<=50000 Income>50000
Age<25 9 1
Now, this data will give a lower entropy value as it provides us more
information on relation between age and income.
So, if you observe entropy value moved from 0.971 to 0.469. We would
have 0 entropy value if we had 10 in one category and 0 in the other
category.
129
Entropy-based discretization performs the following algorithm –
1. Calculate Entropy value for your data.
2. For each potential split in your data
o Calculate Entropy in each potential bin
o Find the net entropy for your split
o Calculate entropy gain
3. Select the split with highest entropy gain
4. Recursively perform the partition on each split until a termination
criterion is met. Terminate once you have reached a specified number
of bins or terminate once the entropy gain falls below a certain limit.
We want to perform splits which improve the insights we get from our
data. So, we want to perform splits that maximize the insights we get from
our data. Entropy gain measures that. So, we need to find and maximize
entropy gain to perform splits.
The formula indicates that our information across two bins is equal to the
ratio of the bin‘s size multiplied by that bin‘s entropy.
We will discretize the above given data by first calculating entropy of the
data set.
130
Now, we will iterate through and see which splits give us the maximum
entropy gain. To find a split, we average two neighbouring values in the
list.
Split 1: 4.5
4 and 5 are the neighbouring values in the list. Suppose we split at (4+5)/2
= 4.5
Now we get 2 bins as follows
Now, we need to calculate entropy for each bin and find the information
gain of this split.
Entropy (D<=4.5) = - ( log2( ) + log2( )) = 0 + 0 = 0
Entropy (D>4.5) = - ( log2( ) + log2( )) = 0.311 + 0.5 = 0.811
Net entropy is InfoA (Dnew) = (0) + (0.811) = 0.6488
Entropy gain is Gain(Dnew) = 0.971 – 0.6488 = 0.322
Entropy (D<=6.5) = 1
Entropy (D>6.5) = 0.917
Net entropy is InfoA (Dnew) = 0.944
Entropy Gain is Gain(Dnew) = 0.971 – 0.944 = 0.27
This is less gain than we had in the earlier split (0.322) so our best split is
still at 4.5. Let‘s check the next split at 10.
Split 4: 13.5 –This split will also result in lower entropy gain.
Conclusion – So, now after calculating entropy gains for various splits,
we conclude that the best split is Split3. So, we will partition the data at
10.
The entropy and information gain measures are also used for decision tree
induction.
Interval Merging by χ2 Analysis – Chi merge is a simple algorithm
which uses χ2 (chi-square) statistic to discretize numeric attributes.
It is a supervised bottom-up data discretization technique. Here, we
find the best neighbouring intervals and merge them to form larger
intervals. This process is recursive in nature. The basic idea is that for
accurate discretization, the relative class frequencies should be fairly
consistent within an interval. If two adjacent intervals have a very
similar distribution of classes, then the intervals can be merged.
Otherwise, they should remain separate. It treats intervals as discrete
categories. Initially, in the ChiMerge method each distinct value of a
numerical attribute A is considered to be one interval.χ2 test is
performed for every pair of adjacent intervals.Adjacent intervals with
the least χ2 values are merged together, since low χ2 values for a pair
indicate similar class distributions. This merge process proceeds
recursively until a predefined stopping criterion is met like significance
level, max-interval, max inconsistency etc.
132
The 3-4-5 rule is as follows –
o If an interval covers 3, 6, 7 or 9 distinct values at most significant digit,
then create 3 intervals. Here, there can be 3 equal-width intervals for
3,6,9; and 3 intervals in the grouping of 2-3-2 each for 7.
o If it covers 2,4 or 8 distinct values at most significant digit, then create
4 sub-intervals of equal-width.
o If it covers 1,5 or 10 distinct values at the most significant digit, then
partition the range into 5 equal-width intervals.
For instance, breaking up annual salaries in the range of into ranges like
50000 – 100000 are often more desirable than ranges like 51263 – 98765.
Now, we will round down the Low and High at MSD. So Low = -1000000
(rounding down -159876 to nearest million gives -1000000) and High =
2000000 (rounding 183876 to nearest million gives 2000000). So here
Range is 2000000 – (-1000000) = 3000000. We will consider only MSD
here, so range of this interval is 3.
Now, as the interval covers 3 distinct values at MSD, we will divide this
interval into 3 equal-width size intervals.
Interval 1: (-1000000 to 0]
Interval 2: (0 to 1000000]
Interval 3: (1000000 to 2000000]
133
Methods for generation of concept hierarchies for categorical data are as
follows –
Specification of a partial ordering of attributes explicitly at the
schema level by users or experts – A user can easily define a concept
hierarchy by specifying ordering of the attributes at schema level.
For instance, dimension ‗location‘ may contain a group of attributes
like street, city, state and country. A hierarchy can be defined by
specifying the total ordering among these attributes at schema level
such as
Street < City < State < Country
Specification of a portion of a hierarchy by explicit data grouping –
We can easily specify explicit groupings for a small portion of
intermediate-level data.
For instance, after specifying that state and country form a hierarchy at
schema level, a user can define some intermediate levels manually such
as
{Jaisalmer, Jaipur, Udaipur} < Rajasthan
Specification of a set of attributes, but not of their partial ordering
– A user can specify a set of attributes forming a concept hierarchy, but
may not explicitly state their partial ordering. The system can then try
to automatically generate the attribute ordering so as to construct a
meaningful concept hierarchy. A concept hierarchy can be
automatically generated based on the number of distinct values per
attribute in the given attribute set. The attribute with the most distinct
values is placed at the lowest level in the hierarchy. The lower the
number of distinct values an attribute has, the higher it is placed in the
hierarchy. You can observe in the below given diagram ‗street‘ is
placed at the lowest level as it has largest number of distinct values.
134
specification. For instance, a user may include only street and city in
the hierarchy specification of dimension ‗location‘. To handle such
partially defined hierarchies, it is suggested to embed data semantics in
the database schema so that attributes with tight semantic connections
can be pinned together.
9.5 SUMMARY
9.6 EXERCISES
1. Which method according to you is the best method for data reduction?
2. Write a short note on data cube aggregation.
3. What is concept hierarchy in data mining?
4. How concept hierarchy is generated for numeric data?
5. How concept hierarchy is generated for categorical data?
9.7 REFERENCES
*****
135
UNIT IV
10
ASSOCIATION RULES
Unit structure
10.0 Objectives
10.1 Introduction
10.2 Association Rule Mining
10.3 Support and Confidence
10.3.1 Support
10.3.2 Confidence
10.3.3 Lift
10.4 Frequent Pattern Mining
10.4.1 Market Basket Analysis
10.4.2 Medical Diagnosis
10.4.3 Census Data
10.4.4 Protein Sequence
10.5 Market Basket Analysis
10.5.1 Implementation of MBA
10.6 Apriori Algorithm
10.6.1 Apriori Property
10.6.2 Steps in Apriori
10.6.3 Example of Apriori
10.6.4 Apriori Pseudo Code
10.6.5 Advantages and Disadvantages
10.6.6 Method to Improve Apriori Efficiency
10.6.7 Applications of Apriori
10.7 Associative Classification- Rule Mining
10.7.1 Typical Associative Classification Methods
10.7.2 Rules for Support and confidence in Associative
Classification
10.8 Conclusion
10.9 Summary
10.10 References
10.0 OBJECTIVES
In this chapter we will describe a class of unsupervised learning models
that can be used when the dataset of interest does not include a target
136
attribute. These are methods that derive association rulesthe aim of which
is to identify regular patterns and recurrences within a large set of
transactions. They are fairly simple and intuitive and are frequently used
to investigate sales transactions in market basket analysis and navigation
paths within websites.
Association rule mining represents a data mining technique and its goal is
to find interesting association or correlation relationships among a large
set of data items. With massive amounts of data continuously being
collected and stored in databases, many companies are becoming
interested in mining association rules from their databases to increase their
profits.
The Main objective of data mining is to find out the new, unknown and
unpredictable information from the used database, which is useful and
helps in decision making. There are a number of techniques used in data
mining to identify the frequent pattern and mining rules includes clusters
analysis, anomaly detection, association rule mining etc. In this Chapter
we provide an overview of association rule research.
10.1 INTRODUCTION
137
analysis, medical diagnosis, census data, fraud detection in web and DNA
data analysis etc.
We can use Association Rules in any dataset where features take only two
values i.e., 0/1. Some examples are listed below:
Market Basket Analysis is a popular application of Association Rules.
People who visit webpage X are likely to visit webpage Y.
People who have age-group [30,40] & income [>$100k] are likely to
own home.
138
Association rules are usually required to satisfy a user-specified minimum
support and a user-specified minimum confidence at the same time.
Association rule generation is usually split up into two separate steps:
A minimum support threshold is applied to find all frequent itemsets in
a database.
A minimum confidence constraint is applied to these frequent itemsets
in order to form rules.
While the second step is straightforward, the first step needs more
attention.
A set of transactions process aims to find the rules that enable us to predict
the occurrence of a specific item based on the occurrence of other items in
the transaction.
Example:
“If a customer buys bread, he‟s 70% likely of buying milk.”
In the above association rule, bread is the antecedent and milk is the
consequent. These types of relationships where we can find out some
association or relation between two items is known as single cardinality. It
is all about creating rules, and if the number of items increases, then
cardinality also increases accordingly. So, to measure the associations
between thousands of data items, there are several metrics. If the above
rule is a result of a thorough analysis of some data sets, it cannot be only
used to improve customer service but also improve the company‘s
revenue.
Association rules are created by thoroughly analyzing data and looking for
frequent if/then patterns. Then, depending on the following two
parameters, the important relationships are observed:
10.3.1 Support:
Support indicates how frequently the if/then relationship appears in the
database. It is the frequency of A or how frequently an item appears in the
dataset. It is defined as the fraction of the transaction T that contains the
139
itemset X. If there are X datasets, then for transactions T, it can be written
as:
10.3.2 Confidence:
Confidence tells about the number of times these relationships have been
found to be true. It indicates how often the rule has been found to be true.
Or how often the items X and Y occur together in the dataset when the
occurrence of X is already given. It is the ratio of the transaction that
contains X and Y to the number of records that contain X.
10.3.3 Lift:
Lift is the strength of any rule, which can be defined as below formula:
140
Frequent Pattern Mining is an analytical process that finds frequent
patterns, associations, or causal structures from data sets found in various
kinds of databases such as relational databases, transactional databases,
and other data repositories. Frequent pattern: a pattern (a set of items,
subsequences, substructures, etc.) that occurs frequently in a data set.
For instance, a set of items, such as pen and ink, often appears together in
a set of data transactions, is called a recurrent item set. Purchasing a
personal computer, later a digital camera, and then a hard disk, if all these
events repeatedly occur in the history of shopping database, it is a
(frequent) sequential pattern. If the occurrence of a substructure is regular
in a graph database, it is called a (frequent) structural pattern.
Given a set of transactions, we can find rules that will predict the
occurrence of an item based on the occurrences of other items in the
transaction.
Before we start defining the rule, let us first see the basic definitions.
Let‘s look at some areas where Association Rule Mining has helped
quite a lot:
142
To understand the value of this applied technique, let‘s consider two
business use cases.
Assume that, there are large number of items like Tea, Coffee, Milk,
Sugar. Among these, the customer buys the subset of items as per the
requirement and market gets the information of items which customer has
purchased together. So, the market uses this information to put the items
on different positions (or locations).
Thus, if an item set is not a recurrent item set, then item set will not use to
create large item set. Apriori procedure is the most recurrently used
algorithm among the association rules algorithms that were used at the
analysis phase. The problems occur in apriori algorithm are that it scans
the databases again and again to check the recurrent item sets and it also
generate infrequent itemsets.
Strong associations have been observed among the purchased item sets
group with regard to the purchase behaviour of the customers of the retail
store. The customer‘s shopping information analyzed by using the
association rules mining with the apriori algorithm. As a result of the
analysis, strong and useful association rules were determined between the
product groups with regard to understanding what kind of purchase
behaviour customer‘s exhibit within a certain shopping visit from both in-
category and from different product categories for the specialty store
144
Potato Chips as consequent => Can be used to determine what should
be done to boost its sales.
Bagels in the antecedent => Can be used to see which products would
be affected if the store discontinues selling bagels.
Bagels in antecedent and Potato chips in consequent => Can be used
to see what products should be sold with Bagels to promote sale of
Potato chips!
Apriori algorithm was the first algorithm that was proposed for frequent
itemset mining. It uses prior(a-prior) knowledge of frequent itemset
properties. A minimum threshold is set on the expert advice or user
understanding.
145
achieved. A minimum support threshold is given in the problem or it is
assumed by the user.
Step 1: In the first iteration of the algorithm, each item is taken as a 1-
itemsets candidate. The algorithm will count the occurrences of
each item.
Step 2: Let there be some minimum support, min_sup ( eg 2). The set of
1 – itemsets whose occurrence is satisfying the min sup are
determined. Only those candidates which count more than or
equal to min_sup, are taken ahead for the next iteration and the
others are pruned.
Step 3: Next, 2-itemset frequent items with min_sup are discovered. For
this in the join step, the 2-itemset is generated by forming a group
of 2 by combining items with itself.
Step 4: The 2-itemset candidates are pruned using min-sup threshold
value. Now the table will have 2 –itemsets with min-sup only.
Step 5: The next iteration will form 3 –itemsets using join and prune step.
This iteration will follow antimonotone property where the
subsets of 3-itemsets, that is the 2 –itemset subsets of each group
fall in min_sup. If all 2-itemset subsets are frequent then the
superset will be frequent otherwise it is pruned.
Step 6: Next step will follow making 4-itemset by joining 3-itemset with
itself and pruning if its subset does not meet the min_sup criteria.
The algorithm is stopped when the most frequent itemset is
achieved.
146
TABLE-1
Solution:
Support threshold=50% => 0.5*6= 3 => min_sup=3
2. Prune Step: TABLE -2 shows that I5 item does not meet min_sup=3,
thus it is deleted, only I1, I2, I3, I4 meet min_sup count.
Table-3
Item Count
I1 4
I2 5
I3 4
I4 4
3. Join Step: Form 2-itemset. From TABLE-1 find out the occurrences of
2-itemset.
Table-4
Item Count
I1,I2 4
I1,I3 3
I1,I4 2
I2,I3 4
147
I2,I4 3
I3,I4 2
4. Prune Step: TABLE -4 shows that item set {I1, I4} and {I3, I4} does
not meet min_sup, thus it is deleted.
Table-5
Item Count
I1,I2 4
I1,I3 3
I2,I3 4
I2,I4 3
5. Join and Prune Step: Form 3-itemset. From the TABLE- 1 find out
occurrences of 3-itemset. From TABLE-5, find out the 2-itemset subsets
which support min_sup.
We can see for itemset {I1, I2, I3} subsets, {I1, I2}, {I1, I3}, {I2, I3} are
occurring in TABLE-5 thus {I1, I2, I3} is frequent.
We can see for itemset {I1, I2, I4} subsets, {I1, I2}, {I1, I4}, {I2, I4}, {I1,
I4} is not frequent, as it is not occurring in TABLE-5 thus {I1, I2, I4} is
not frequent, hence it is deleted.
Table-6
Item
I1,I2,I3
I1,I2,I4
I1,I3,I4
I2,I3,I4
This shows that all the above association rules are strong if minimum
confidence threshold is 60%.
Advantages
1. Easy to understand algorithm
2. Join and Prune steps are easy to implement on large itemsets in large
databases
Disadvantages
1. It requires high computation if the itemsets are very large and the
minimum support is kept very low.
2. The entire database needs to be scanned.
150
rule sorting, rule pruning, classifier building and class allocation for test
cases.
Associative Classification:
Association rules are generated and analyzed for use in classification
Search for strong associations between frequent patterns (conjunctions
of attribute-value pairs) and class labels
Classification: Based on evaluating a set of rules in the form of
P1 ^ p2 … ^ pl à ―Aclass = C‖ (conf, sup)
It explores highly confident associations among multiple attributes and
may overcome some constraints introduced by decision-tree induction,
which considers only one attribute at a time
10.9 SUMMARY
In this chapter we have presented some needed concepts for dealing with
association rules, recalled previous efforts concerning association rule
mining. Presented an efficient algorithm for identifying association rules
of interest. Introduced detailed efforts on mining association rules;
Association rule mining: support and confidence and frequent item sets,
market basket analysis, Apriori algorithm and Associative classification.
10.10 REFERENCES
*****
152
Unit V
11
CLASSIFICATION – I
Unit structure
11.0 Objectives
11.1 Introduction
11.2 Classification
11.2.1Training and Testing
11.2.2Categories of Classification
11.2.3Associated Tools and Languages
11.2.4Advantages and Disadvantages
11.3 Classification of Data Mining
11.3.1Two Important steps
a. Model Construction
b. Model Usage
11.3.2Classification Methods
11.4 Statistical-based Algorithms
11.4.1 Regression
11.4.2 Terminologies Related to the Regression Analysis
11.4.3 Use of Regression Analysis
11.4.4 Types of Regression
a. Linear Regression
b. Logistic Regression
c. Polynomial Regression
d. Support Vector Regression
e. Decision Tree Regression
f. Random Forest Regression
g. Ridge Regression
h. Lasso Regression
11.5 Naïve Bayesian Classification
11.5.1 Working of Naïve Bayes
11.5.2 Advantages and Disadvantages
11.5.3 Types of Naïve Bayes Model
11.6 Distance-based algorithm
11.6.1 K Nearest Neighbor
11.6.2 Working of KNN Algorithm
153
11.6.3 Advantages and Disadvantages
11.7 Conclusion
11.8 Summary
11.9 References
11.0 OBJECTIVES
11.1 INTRODUCTION
154
In this chapter, we will review the major classification methods:
classification trees, Bayesian methods, neuralnetworks, logistic regression
and support vector machines. Statistical-based algorithms- Regression,
Naïve Bayesian classification, Distance-based algorithm- K Nearest
Neighbour, Decision Tree-based algorithms -ID3, C4.5, CART and so on.
11.2 CLASSIFICATION
Classification is a data analysis task, i.e. the process of finding a model
that describes and distinguishes data classes and concepts. Classification
is the problem of identifying to which of a set of categories
(subpopulations), a new observation belongs to, on the basis of a training
set of data containing observations and whose categories membership is
known.
Same is the case with the data, it should be trained in order to get the
accurate and best results.
There are certain data types associated with data mining that actually
tells us the format of the file (whether it is in text format or in numerical
format).
Example: One needs to choose some material but of different colors. So,
the color might be Yellow, Green, Black, Red.
Different Colors: Red, Green, Black, Yellow
Ordinal: Values that must have some meaningful order.
Example: Suppose there are grade sheets of few students which might
contain different grades as per their performance such as A, B, C, D
Grades: A, B, C, D
Continuous: May have an infinite number of values, it is in float
type
Example: Measuring the weight of few Students in a sequence or
orderly manner i.e. 50, 51, 52, 53
Weight: 50, 51, 52, 53
Discrete: Finite number of values.
Example: Marks of a Student in a few subjects: 65, 70, 75, 80, 90
Marks: 65, 70, 75, 80, 90
Suppose there are few students and the Result of them are as follows :
3. Generative: It models the distribution of individual classes and tries
to learn the model that generates the data behind the scenes by
estimating assumptions and distributions of the model. Used to predict
the unseen data.
156
It seems to be that in Class A(i.e. in 25% of data), 20 out of 25 emails
are spam and rest not.
And in Class B(i.e. in 75% of data), 70 out of 75 emails are not spam and
rest are spam.
So, if the email contains the word cheap, what is the probability of it
being spam ?? (= 80%)
Real–Life Examples:
Market Basket Analysis:
It is a modeling technique that has been associated with frequent
transactions of buying some combination of items.
Example: Amazon and many other Retailers use this technique.
While viewing some products, certain suggestions for the
commodities are shown that some people have bought in the past.
Weather Forecasting:
Changing Patterns in weather conditions needs to be observed based
on parameters such as temperature, humidity, wind direction. This
keen observation also requires the use of previous records in order to
predict it accurately.
Disadvantages:
Privacy: When the data is either are chances that a company may give
some information about their customers to other vendors or use this
information for their profit.
157
10.3 CLASSIFICATION OF DATA MINING SYSTEMS
b. Model usage
The constructed model is used to perform classification of unknown
objects.
A class label of test sample is compared with the resultant class label.
Accuracy of model is compared by calculating the percentage of test set
samples, that are correctly classified by the constructed model.
Test sample data and training data sample are always different.
158
11.3.2 Classification methods:
Classification is one of the most commonly used technique when it comes
to classifying large sets of data.This method of data analysis includes
algorithms for supervised learning adapted to the data quality.The
objective is to learn the relation which links a variable of interest, of
qualitative type, to the other observed variables, possibly for the purpose
of prediction. The algorithm that performs the classification is the
classifier while the observations are the instances. The classification
method uses algorithms such as decision tree to obtain useful information.
Companies use this approach to learn about the behavior and preferences
of their customers. With classification, you can distinguish between data
that is useful to your goal and data that is not relevant.
The study of classification in statistics is vast, and there are several types
of classification algorithms you can use depending on the dataset you‘re
working with. Below are the most common algorithms in Data Mining.
a) Statistical-based algorithms- Regression
b) Naïve Bayesian classification
c) Distance-based algorithm- K Nearest Neighbour
d) Decision Tree-based algorithms -ID3
e) C4.5
f) CART
There are two main phases present to work on classification. The first
can easily identify the statistical community. The second, ―modern‖ phase
concentrated on more flexible classes of models. In which many of which
attempt has to take. That provides an estimate of the joint distribution of
the feature within each class. That can, in turn, provide a classification
rule.
11.4.1 Regression:
Regression analysis is a statistical method to model the relationship
between a dependent (target) and independent (predictor) variables with
one or more independent variables. More specifically, Regression analysis
helps us to understand how the value of the dependent variable is changing
corresponding to an independent variable when other independent
variables are held fixed. It predicts continuous/real values such
as temperature, age, salary, price, etc.
In Regression, we plot a graph between the variables which best fits the
given datapoints, using this plot, the machine learning model can make
predictions about the data. In simple words, "Regression shows a line or
curve that passes through all the datapoints on target-predictor graph in
such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line
tells whether a model has captured a strong relationship or not.
a. Linear Regression:
It is one of the most widely known modeling techniques, as it is amongst
the first elite regression analysis methods picked up by people at the time
of learning predictive modeling. Here, the dependent variable is
continuous and independent variable is more often continuous or discreet
with a linear regression line.
161
independent variable and in a simple linear regression, there is only one
independent variable. Thus, linear regression is best to be used only when
there is a linear relationship between the independent and a dependent
variable
b. LogisticRegression:
Logistic regression is commonly used to determine the probability of
event=Success and event=Failure. Whenever the dependent variable is
binary like 0/1, True/False, Yes/No logistic regression is used. Thus, it can
be said that logistic regression is used to analyze either the close-ended
questions in a survey or the questions demanding numeric response in a
survey.
Please note, logistic regression does not need a linear relationship between
a dependent and an independent variable just like linear regression. The
logistic regression applies a non-linear log transformation for predicting
the odds‘ ratio; therefore, it easily handles various types of relationships
between a dependent and an independent variable.
c. Polynomial Regression:
Polynomial regression is commonly used to analyze the curvilinear data
and this happens when the power of an independent variable is more than
1. In this regression analysis method, the best fit line is never a ‗straight-
line‘ but always a ‗curve line‘ fitting into the data points.
162
Please note, polynomial regression is better to be used when few of the
variables have exponents and few do not have any. Additionally, it can
model non-linearly separable data offering the liberty to choose the exact
exponent for each variable and that too with full control over the modeling
features available.
163
represents the final decision or result. A decision tree is constructed
starting from the root node/parent node (dataset), which splits into left and
right child nodes (subsets of dataset). These child nodes are further
divided into their children node, and themselves become the parent node
of those nodes.
g. Ridge Regression:
Ridge regression is one of the most robust versions of linear regression in
which a small amount of bias is introduced so that we can get better long
term predictions.
h. Lasso Regression:
Lasso regression is another regularization technique to reduce the
complexity of the model. It is similar to the Ridge Regression except that
penalty term contains only the absolute weights instead of a square of
weights. Since it takes absolute values, hence, it can shrink the slope to 0,
whereas Ridge Regression can only shrink it near to 0.
Bayes‘ Theorem
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is
used to determine the probability of a hypothesis with prior knowledge.
It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the
observed event B.
165
11.5.1 Working of Naive Bayes Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the
below example:
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
166
Likelihood table weather condition:
Weather No Yes
Overcast 0 5 5/14= 0.35
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71
167
It is used in medical data classification.
It can be used in real-time predictions because Naïve Bayes Classifier
is an eager learner.
It is used in Text classification such as Spam filtering and Sentiment
analysis.
168
11.6.2 Working of KNN Algorithm:
K-nearest neighbors (KNN) algorithm uses ‗feature similarity‘ to predict
the values of new datapoints which further means that the new data point
will be assigned a value based on how closely it matches the points in the
training set. We can understand its working with the help of following
steps −
Step 1 : For implementing any algorithm, we need dataset. So during the
first step of KNN, we must load the training as well as test data.
Step 2: Next, we need to choose the value of K i.e. the nearest data
points. K can be any integer.
Step 3: For each point in the test data do the following −
3.1 − Calculate the distance between test data and each row of
training data with the help of any of the method namely:
Euclidean, Manhattan or Hamming distance. The most
commonly used method to calculate distance is Euclidean.
3.2 − Now, based on the distance value, sort them in ascending
order.
3.3 − Next, it will choose the top K rows from the sorted array.
3.4 − Now, it will assign a class to the test point based on most
frequent class of these rows.
Step 4 – End
Example:
The following is an example to understand the concept of K and working
of KNN algorithm
Now, we need to classify new data point with black dot (at point 60,60) into blue
or red class. We are assuming K = 3 i.e. it would find three nearest data points. It
is shown in the next diagram.
169
We can see in the above diagram the three nearest neighbors of the data point
with black dot. Among those three, two of them lies in Red class hence the black
dot will also be assigned in red class.
In the case of KNN, indeed it doesn‘t ―compare‖ the new unclassified data with
all other, actually he performs a mathematical calculation to measure the distance
between the data to makes the classification.
Calculating distance:
To calculate the distance between two points (your new sample and all the data
you have in your dataset) is very simple, as said before, there are several ways to
get this value, in this article we will use the Euclidean distance.
Using this formula that you will check the distance between 1 point and 1 other
point in your dataset, one by one in all your dataset, the smaller the result of this
calculation is the most similar between these two data.
To make it simple,
170
Let‘s use the example from the previous worksheet but now with 1
unclassified data, thas‘s the information we want to discover.
We have 5 data (lines) in this example, each sample (data/line) have your
attributes (characteristics), let‘s imaginethat all these are images, each line
would be an image and each column would be an image‘s pixel.
Let‘s take the first line, which is the data that we want to classify and let‘s
measure the Euclidean distance to line 2
1 — Subtraction
Let‘s subtract each attribute (column) from row 1 with the attributes from
row 2, example:
(1–2) = -1
2 — Exponentiation:
After you had subtract column 1 from row 1, with column 1 from row 2,
we will get the squared root, so the result numbers are aways positive,
example:
(1–2)² = (-1)²(-1)² = 1
3 — Sum
After you have done step 2, for all the row 1's columns and row 2‘s
columns, we will sum all these results, let‘s make an example using the
spreadsheet‘s columns‘s image and we will have the following result:
171
4 —Square root:
After performed step 3, we will get our subtractions‘s sum‘s square root.
In step 3, the result was 8, so let‘s take the square root of number 8:
√8 = 2,83 ou 8^(½) = 2,83
(Note: ou – Euclidean distance.)
Now you have the Euclidean distance from line 1 to line 2, look, it was not
so difficult, you could do it in a simple paper!
Now, you only need to make these for all dataset‘s lines, from line 1 to all
other lines, when you do this, you will have the Euclidean distance from
line 1 to all other lines, then you will sort it to get the ―k‖(e.g. k = 3)
smallest distances, so you will check which is the class that most appears,
the class that appears the most times will be the class that you will use to
classify the line 1 (which was not classified before).
Advantages:
It is very simple algorithm to understand and interpret.
It is very useful for nonlinear data because there is no assumption about
data in this algorithm.
It is a versatile algorithm as we can use it for classification as well as
regression.
It has relatively high accuracy but there are much better supervised
learning models than KNN.
Disadvantages:
It is computationally a bit expensive algorithm because it stores all the
training data.
High memory storage required as compared to other supervised
learning algorithms.
Prediction is slow in case of big N.
It is very sensitive to the scale of data as well as irrelevant features.
11.7 CONCLUSION
172
data mining technique to be used in a certain application. There are
definite differences in the types of problems that are conductive to each
technique
11.8 SUMMARY
The objectives listed above would have been achieved if readers can gain
a good understanding of data mining and be able to develop data mining
applications. There is no doubt that data mining can be a very powerful
technology and methodology for generating information from raw data to
address business and other problems. This usefulness, however, will not
be realised unless knowledge of data mining is put to good use.
11.9 REFERENCES
*****
173
12
CLASSIFICATION – II
Unit Structure
12.0 Objectives
12.1 Introduction
12.2 Decision Tree
12.2.1 Decision Tree Terminologies
12.2.2 Decision Tree Algorithm
12.2.3 Decision Tree Example
12.2.4 Attribute Selection Measures
a. Information Gain
b. Gini Index
c. Gain Ratio
12.2.5 Overfitting in Decision Trees
a. Pruning Decision Trees
b. Random Forest
12.2.6Better Linear of Tree-based Model
12.2.7Advantages and Disadvantages
12.3 Iterative Dichotomiser 3 (ID3)
12.3.1History of ID3
12.3.2 Algorithm of ID3
12.3.3 Advantages and Disadvantages
12.4 C4.5
12.4.1Algorithm of C4.5
12.4.2Pseudocode of C4.5
12.4.3Advantages
12.5 CART (Classification and Regression Tree)
12.5.1Classification Tree
12.5.2Regression Tree
12.5.3Difference between Classification and Regression Trees
12.5.4Advantages of Classification and Regression Trees
12.5.5 Limitations of Classification and Regression Trees
12.6 Conclusion
12.7 Summary
12.8 References
174
12.0 OBJECTIVES
After going through this lesson you will be able to learn following things.
1. Learn about the decision tree algorithm for classification problems.
2. Decision tree-based classification algorithms serve as the fundamental
step in application of the decision tree method, which is a predictive
modeling technique for classification of data.
3. This chapter provides a broad overview of decision tree-based
algorithms that are among the most commonly used methods for
constructing classifiers.
4. You will also learn the various decision tree methods like ID3, C4.5,
CART etc. in detail.
12.1 INTRODUCTION
In this chapter, you will learn Tree based algorithms which are considered
to be one of the best and mostly used supervised classification methods.
Tree based algorithms empower predictive models with high accuracy,
stability and ease of interpretation. Unlike linear models, they map non-
linear relationships quite well. They are adaptable at solving any kind of
problem at hand.
A decision tree is a plan that includes a root node, branches, and leaf
nodes. Every internal node characterizes an examination on an attribute,
each division characterizes the consequence of an examination, and each
leaf node grasps a class tag. The primary node in the tree is the root node.
Decision Trees usually mimic human thinking ability while making a
decision, so it is easy to understand. The logic behind the decision tree can
be easily understood because it shows a tree-like structure.
Using a decision tree, we can visualize the decisions that make it easy to
understand and thus it is a popular data mining classification technique.
The goal of using a Decision Tree is to create a training model that can use
to predict the class or value of the target variable by learning simple
decision rules inferred from prior data (training data). In Decision Trees,
for predicting a class label for a record we start from the root of the tree.
We compare the values of the root attribute with the record‘s attribute. On
175
the basis of comparison, we follow the branch corresponding to that value
and jump to the next node.
For the next node, the algorithm again compares the attribute value with
the other sub-nodes and move further. It continues the process until it
reaches the leaf node of the tree. The complete process can be better
understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best
attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
176
Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is
reached where you cannot further classify the nodes and called
the final node as a leaf node.
12.2.3 Example:
Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the
decision tree starts with the root node (Salary attribute by ASM). The root
node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision
node further gets split into one decision node (Cab facility) and one leaf
node. Finally, the decision node splits into two leaf nodes (Accepted offers
and Declined offer). Consider the below diagram:
a. Information Gaim:
This method is the main method that is used to build decision trees. It
reduces the information that is required to classify the tuples. It reduces
177
the number of tests that are needed to classify the given tuple. The
attribute with the highest information gain is selected.
Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
It calculates how much information a feature provides us about a class.
According to the value of information gain, we split the node and build
the decision tree.
A decision tree algorithm always tries to maximize the value of
information gain, and a node/attribute having the highest information
gain is split first. It can be calculated using the below formula:
Where,
S= Total number of samples
P(yes)= probability of yes
P(no)= probability of no
b. Gini Index:
Gini index is a measure of impurity or purity used while creating a
decision tree in the CART (Classification and Regression Tree)
algorithm.
An attribute with the low Gini index should be preferred as compared
to the high Gini index.
It only creates binary splits, and the CART algorithm uses the Gini
index to create binary splits.
Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2
c. Gain Ratio
Information gain is biased towards choosing attributes with a large
number of values as root nodes. It means it prefers the attribute with a
large number of distinct values.
C4.5, an improvement of ID3, uses Gain ratio which is a modification
of Information gain that reduces its bias and is usually the best option.
Gain ratio overcomes the problem with information gain by taking into
account the number of branches that would result before making the
split.
178
It corrects information gain by taking the intrinsic information of a split
into account.
179
12.2.7 Advantages and Disadvantages:
ID3 is a precursor to the C4.5 Algorithm. The ID3 follows the Occam‘s
razor principle. Attempts to create the smallest possible decision tree. It
uses a top-down greedy approach to build a decision tree. In simple
words, the top-down approach means that we start building the tree from
the top and the greedy approach means that at each iteration we select
the best feature at the present moment to create a node.
13.3.2 ID 3 Algorithm:
Calculate the entropy of every attribute using the data set
Split the set into subsets using the attribute for which entropy is
minimum (or, equivalently, information gain is maximum)
Make a decision tree node containing that attribute
Recurse on subsets using remaining attributes
180
Entropy:
In order to define information gain precisely, we need to discuss
entropy first.
A formula to calculate the homogeneity of a sample.
A completely homogeneous sample has entropy of 0 (leaf node).
An equally divided sample has entropy of 1.
Example
If S is a collection of 14 examples with 9 YES and 5 NO examples
Then,
Entropy(S) = - (9/14) Log2 (9/14) - (5/14) Log2 (5/14) = 0.940
Disadvantage of ID3:
Data may be over-fitted or overclassified, if a small sample is tested.
Only one attribute at a time is tested for making a decision.
Classifying continuous data may be computationally expensive, as
many trees must be generated to see where to break the continuum.
12.4 C4.5
181
This algorithm uses either Information gain or Gain ratio to decide upon
the classifying attribute. It is a direct improvement from the ID3
algorithm as it can handle both continuous and missing attribute
values.C4.5 is given a set of data representing things that are already
classified. When we generate the decision trees with the help of C4.5
algorithm, then it can be used for classification of the dataset, and that is
the main reason due to which C4.5 is also known as a statistical classifier.
C4.5 algorithm is an algorithm to form a decision tree by counting the
value of the gain, where the biggest gains are to be used as an initial node
or the root node. C4.5 algorithms step in building a decision tree as
follows:
Select the attribute with the largest gain value as the root.
Create a branch for each value.
For the case of the branches.
Repeat the process for each branch until all cases the branches have
the same class.
*Note: Entropy and Gain formula given in__________.
Data set 1
182
Decision tree generated by C4.5 on the data set
183
12.4.2 Pseudocode:
Check for the above base cases.
For each attribute a, find the normalised information gain ratio from
splitting on a.
Let a_best be the attribute with the highest normalized information
gain.
Create a decision node that splits on a_best.
Recur on the sublists obtained by splitting on a_best, and add those
nodes as children of node.
Further, it is important to know that C4.5 is not the best algorithm in all
cases, but it is very useful in some situations.
In such cases, there are multiple values for the categorical dependent
variable. Here‘s what a classic classification tree looks like.
Classification Trees
This will depend on both continuous factors like square footage as well as
categorical factors like the style of home, area in which the property is
located, and so on.
Regression Trees
185
12.5.3 Difference Between Classification and Regression Trees:
Decision trees are easily understood and there are several classification
and regression trees ppts to make things even simpler. However, it‘s
important to understand that there are some fundamental differences
between classification and regression trees.
When to use Classification and Regression Trees:
Classification trees are used when the dataset needs to be split into classes
that belong to the response variable. In many cases, the classes Yes or No.
In other words, they are just two and mutually exclusive. In some cases,
there may be more than two classes in which case a variant of the
classification tree algorithm is used.
Regression trees, on the other hand, are used when the response variable is
continuous. For instance, if the response variable is something like the
price of a property or the temperature of the day, a regression tree is used.
In other words, regression trees are used for prediction-type problems
while classification trees are used for classification-type problems.
How Classification and Regression Trees Work:
A classification tree splits the dataset based on the homogeneity of data.
Say, for instance, there are two variables; income and age; which
determine whether or not a consumer will buy a particular kind of phone.
If the training data shows that 95% of people who are older than 30 bought
the phone, the data gets split there and age becomes a top node in the tree.
This split makes the data ―95% pure‖. Measures of impurity like entropy
or Gini index are used to quantify the homogeneity of the data when it
comes to classification trees.
In a regression tree, a regression model is fit to the target variable using
each of the independent variables. After this, the data is split at several
points for each independent variable.
At each such point, the error between the predicted values and actual
values is squared to get ―A Sum of Squared Errors‖ (SSE). The SSE is
compared across the variables and the variable or point which has the
lowest SSE is chosen as the split point. This process is continued
recursively.
CART Working
186
12.5.4 Advantages of Classification and Regression Trees:
The purpose of the analysis conducted by any classification or regression
tree is to create a set of if-else conditions that allow for the accurate
prediction or classification of a case.
187
12.5.5 Limitations of Classification and Regression Trees:
There are many classification and regression tree examples where the use
of a decision tree has not led to the optimal result. Here are some of the
limitations of classification and regression trees.
i. Overfitting:
Overfitting occurs when the tree takes into account a lot of noise that
exists in the data and comes up with an inaccurate result.
12.6 CONCLUSION
12.7 SUMMARY
In this Chapter, we have mentioned one of the most common decision tree
algorithm named as ID3. They can use nominal attributes whereas most of
common machine learning algorithms cannot. However, it is required to
transform numeric attributes to nominal in ID3. Besides, its evolved
version C4.5 exists which can handle nominal data. CART methodology
are one of the oldest and most fundamental algorithms. All these are
excellent for data mining tasks because they require very little data pre-
processing. Decision tree models are easy to understand and implement
which gives them a strong advantage when compared to other analytical
models.
12.8 REFERENCES
*****
189
MODULE VI
13
ADVANCED DATABASE
MANAGEMENT SYSTEM
Unit Structure
13.1 What is Clustering?
13.2 Requirements of Clustering
13.3 Clustering Vs Classification
13.4 Types of Clusters
13.5 Distinctions between Sets of Clusters
13.6 What is Cluster Analysis?
13.7 Applications of Cluster Analysis
13.8 What kind of classification is not considered a cluster analysis?
13.9 General Algorithmic Issues
13.10 Clustering Methods
13.11 Clustering Algorithm Applications
13.12 Summary
13.13 Reference for further reading
13.14 Model Questions
Suppose, you are the head of a rental store and wish to understand
preferences of your customers to scale up your business. Is it possible for
you to look at details of each costumer and devise a unique business
strategy for each one of them? What you can do is to cluster all of your
costumers into say 10 groups based on their purchasing habits and use a
separate strategy for costumers in each of these 10 groups which is called
clustering
190
Points to Remember:
A cluster of data objects can be treated as one group.
While doing cluster analysis, first the set of data is partitioned into
groups based on data similarity and then labels are assigned to the
groups.
The main advantage of clustering is that, it is adaptable to changes and
helps single out useful features that distinguish different groups.
This section is to make you learn about the requirements for clustering as
a data mining tool, as well as aspects that can be used for comparing
clustering methods. The following are typical requirements of clustering in
data mining.
Scalability:
Many clustering algorithms work well on small data sets containing fewer
than several hundred data objects; however, a large database may contain
millions or even billions of objects, particularly in Web search scenarios.
Clustering on only a sample of a given large data set may lead to biased
results. Therefore, highly scalable clustering algorithms are needed.
191
Discovery of clusters with arbitrary shape:
Many clustering algorithms determine clusters based on Euclidean or
Manhattan distance measures (Chapter 2). Algorithms based on such
distance measures tend to find spherical clusters with similar size and
density. However, a cluster could be of any shape. Consider sensors, for
example, which are often deployed for environment surveillance. Cluster
analysis on sensor readings can detect interesting phenomena. We may
want to use clustering to find the frontier of a running forest fire, which is
often not spherical. It is important to develop algorithms that can detect
clusters of arbitrary shape.
What is Clustering?:
Basically, clustering involves grouping data with respect to their
similarities. It is primarily concerned with distance measures and
clustering algorithms which calculate the difference between data and
divide them systematically.
For instance, students with similar learning styles are grouped together
and are taught separately from those with differing learning approaches.
In data mining, clustering is most commonly referred to as ―unsupervised
learning technic‖ as the grouping is based on a natural or inherent
characteristic. It is applied in several scientific fields such as
information technology, biology, criminology, and medicine.
Characteristics of Clustering:
No Exact Definition:
Clustering has no precise definition that is why there are various clustering
algorithms or cluster models. Roughly speaking, the two kinds of
clustering are hard and soft. Hard clustering is concerned with labeling an
object as simply belonging to a cluster or not. In contrast, soft clustering
193
or fuzzy clustering specifies the degree as to how something belongs to a
certain group.
Difficult to be Evaluated:
The validation or assessment of results from clustering analysis is often
difficult to ascertain due to its inherent inexactness.
Unsupervised:
As it is an unsupervised learning strategy, the analysis is merely based on
current features; thus, no stringent regulation is needed.
What is Classification?:
Classification entails assigning labels to existing situations or classes;
hence, the term ―classification‖. For example, students exhibiting certain
learning characteristics are classified as visual learners.Classification is
also known as ―supervised learning technic‖ wherein machines learn from
already labeled or classified data. It is highly applicable in pattern
recognition, statistics, and biometrics.
Characteristics of Classification:
Utilizes a ―Classifier‖:
To analyze data, a classifier is a defined algorithm that concretely maps
information to a specific class. For example, a classification algorithm
would train a model to identify whether a certain cell is malignant or
benign.
Supervised:
Classification is a supervised learning technic as it assigns previously
determined identities based on comparable features. It deduces a function
from a labeled training set.
Supervision:
The main difference is that clustering is unsupervised and is considered as
―self-learning‖ whereas classification is supervised as it depends on
predefined labels.
194
Labeling:
Clustering works with unlabeled data as it does not need training. On the
other hand, classification deals with both unlabeled and labeled data in its
processes.
Goal:
Clustering groups objects with the aim to narrow down relations as well as
learn novel information from hidden patterns while classification seeks to
determine which explicit group a certain object belongs to.
Specifics:
While classification does not specify what needs to be learned, clustering
specifies the required improvement as it points out the differences by
considering the similarities between data.
Phases:
Generally, clustering only consists of a single phase (grouping) while
classification has two stages, training (model learns from training data set)
and testing (target class is predicted).
Boundary Conditions:
Determining the boundary conditions is highly important in the
classification process as compared to clustering. For instance, knowing the
percentage range of ―low‖ as compared to ―moderate‖ and ―high‖ is
needed in establishing the classification.
Prediction:
As compared to clustering, classification is more involved with prediction
as it particularly aims to identity target classes. For instance, this may be
applied in ―facial key points detection‖ as it can be used in predicting
whether a certain witness is lying or not.
Complexity:
Since classification consists of more stages, deals with prediction, and
involves degrees or levels, its‘ nature is more complicated as compared to
clustering which is mainly concerned with grouping similar attributes.
Hard Clustering:
In hard clustering, each data point either belongs to a cluster completely or
not. For example, in the above example each customer is put into one
group out of the 10 groups.
Soft Clustering:
In soft clustering, instead of putting each data point into a separate cluster,
a probability or likelihood of that data point to be in those clusters is
assigned. For example, from the above scenario each costumer is assigned
a probability to be in either of 10 clusters of the retail store.
These clusters need not be globular but, can have any shape.
196
Sometimes a threshold is used to specify that all the objects in a cluster
must sufficiently close to one another. Definition of a cluster is
satisfied only when the data contains natural clusters.
If the data is numerical, the prototype of the cluster is often a centroid i.e.,
the average of all the points in the cluster.
If the data has categorical attributes, the prototype of the cluster is often
a medoid i.e., the most representative point of the cluster.
―Center-Based‖ Clusters can also be referred asPrototype based clusters.
These clusters tend to be globular.
K-Means and K-Medoids are the examples of Prototype Based
Clustering algorithms
Two objects are connected only if they are within a specified distance
of each other.
Each point in a cluster is closer to at least one point in the same cluster
than to any point in a different cluster.
Useful when clusters are irregular and intertwined.
Clique is another type of Graph Based
Agglomerative hierarchical clustering has close relation with Graph
based clustering technique.
197
Density based cluster definition:
Interval-Scaled Variables:
Interval-scaled variables are continuous measurements of a roughly linear
scale. Typical examples include weight and height, latitude and longitude
coordinates (e.g., when clustering houses), and weather temperature.The
measurement unit used can affect the clustering analysis. For example,
changing measurement units from meters to inches for height, or from
kilograms to pounds for weight, may lead to a very different clustering
structure.
199
To help avoid dependence on the choice of measurement units, the data
should be standardized. Standardizing measurements attempts to give all
variables an equal weight.
Binary Variables:
A binary variable is a variable that can take only 2 values. For example,
generally, gender variables can take 2 variables male and female.
Contingency Table For Binary Data
Ordinal Variables:
An ordinal variable can be discrete or continuous. In this order is
important, e.g., rank. It can be treated like interval-scaled By replacing xif
by their rank,
By mapping the range of each variable onto [0, 1] by replacing the i-th
object in the f-th variable by, Then compute the dissimilarity using
methods for interval-scaled variables.
Ratio-Scaled Intervals:
Ratio-scaled variable: It is a positive measurement on a nonlinear scale,
approximately at an exponential scale, such as Ae^Bt or A^e-Bt.
200
Methods:
First, treat them like interval-scaled variables — not a good choice!
(why?)
Then apply logarithmic transformation i.e.y = log(x)
Finally, treat them as continuous ordinal data treat their rank as
interval-scaled.
Data Matrix:
This represents n objects, such as persons, with p variables (also called
measurements or attributes), such as age, height, weight, gender, race and
so on. The structure is in the form of a relational table, or n-by-p matrix (n
objects x p variables).
The Data Matrix is often called a two-mode matrix since the rows and
columns of this represent the different entities.
Dissimilarity Matrix:
This stores a collection of proximities that are available for all pairs of n
objects. It is often represented by a n – by – n table, where d(i,j) is the
measured difference or dissimilarity between objects i and j. In general,
d(i,j) is a non-negative number that is close to 0 when objects i and j are
higher similar or ―near‖ each other and becomes larger the more they
differ. Since d(i,j) = d(j,i) and d(i,i) =0.
This is also called as one mode matrix since the rows and columns of this
represent the same entity.
Graph Partitioning:
The type of classification where areas are not the same and are only
classified based on mutual synergy and relevance is not cluster analysis.
Results of a query:
In this type of classification, the groups are created based on the
specification given from external sources. It is not counted as a Cluster
Analysis.
Simple Segmentation:
Division of names into separate groups of registration based on the last
name does not qualify as Cluster Analysis.
Supervised Classification:
These types of classifications which is classified using label information
cannot be said as Cluster Analysis because cluster analysis involves group
based on the pattern.
Assessment of Results
How Many Clusters?
Data Preparation
Proximity Measures
Handling Outliers
Assessment of Results:
The data mining clustering process starts with the assessment of whether
any cluster tendency has a place at all, and correspondingly includes,
appropriate attribute selection, and in many cases feature construction. It
finishes with the validation and evaluation of the resulting clustering
system. The clustering system can be assessed by an expert, or by a
202
particular automated procedure. Traditionally, the first type of assessment
relates to two issues:
Cluster interpretability,
Cluster visualization.
Interpretability depends on the technique used.
Data Preparation:
Irrelevant attributes make chances of a successful clustering futile,
because they negatively affect proximity measures and eliminate
clustering tendency. Therefore, sound exploratory data analysis (EDA) is
essential.
Proximity Measures:
Both hierarchical and partitioning methods use different distances and
similarity measures
Handling Outliers:
Applications that derive their data from measurements have an associated
amount of noise, which can be viewed as outliers. Alternately, outliers can
be viewed as legitimate records having abnormal behavior. In general,
clustering techniques do not distinguish between the two: neither noise nor
abnormalities fit into clusters. Correspondingly, the preferable way to deal
with outliers in partitioning the data is to keep one extra set of outliers, so
as not to pollute factual clusters. There are multiple ways of how
descriptive learning handles outliers. If a summarization or data
preprocessing phase is present, it usually takes care of outliers.
203
13.10 CLUSTERING METHODS
Hierarchical Method
Partitioning Method
Density-based Method
Grid-Based Method
Model-Based Method
K- MEANS METHOD
PARTITIONING METHOD
EXPECTATION
MAXIMIZATION METHOD
DENSITY BASED METHOD
Hierarchical Methods:
A Hierarchical clustering method works via grouping data into a tree of
clusters. Hierarchical clustering begins by treating every data points as a
separate cluster. Then, it repeatedly executes the subsequent steps:
Identify the 2 clusters which can be closest together, and
Merge the 2 maximum comparable clusters.
Continue these steps until all the clusters are merged together.
204
Divisive Clustering
Agglomerative — Bottom up approach. Start with many small clusters
and merge them together to create bigger clusters.
Divisive — Top down approach. Start with a single cluster than break
it up into smaller clusters.
Pros:
No apriori information about the number of clusters required.
Easy to implement and gives best result in some cases.
Cons
Algorithm can never undo what was done previously.
Time complexity of at least O(n2 log n) is required, where „n‟ is the
number of data points.
Based on the type of distance matrix chosen for merging different
algorithms can suffer with one or more of the following:
Sensitivity to noise and outliers
Breaking large clusters
Difficulty handling different sized clusters and convex shapes
No objective function is directly minimized
Sometimes it is difficult to identify the correct number of clusters by
the dendogram.
205
Algorithm for Agglomerative Hierarchical Clustering:
Calculate the similarity of one cluster with all the other clusters
(calculate proximity matrix)
Consider every data point as a individual cluster
Merge the clusters which are highly similar or close to each other.
Recalculate the proximity matrix for each cluster
Repeat Step 3 and 4 until only a single cluster remains.
Step-2:
In the second step comparable clusters are merged together to form a
single cluster. Let‘s say cluster (B) and cluster (C) are very similar to
each other therefore merge them in the second step similarly with
cluster (D) and (E) and at last, get the clusters
[(A), (BC), (DE), (F)
Step-3:
Recalculate the proximity according to the algorithm and merge the
two nearest clusters([(DE), (F)]) together to form new clusters as [(A),
(BC), (DEF)]
Step-4:
Repeating the same process; The clusters DEF and BC are comparable
206
and merged together to form a new cluster. Finally left with clusters
[(A), (BCDEF)].
Step-5:
At last the two remaining clusters are merged together to form a single
cluster [(ABCDEF)].
There are some methods which are used to calculate the similarity between
two clusters:
Distance between two closest points in two clusters.
Distance between two farthest points in two clusters.
The average distance between all points in the two clusters.
Distance between centroids of two clusters.
Linkage Criteria:
Similar to gradient descent, certain parameters can be tweaked to get drastically
different results.
207
Single Linkage:
The distance between two clusters is the shortest distance between two points in
each cluster
Complete Linkage:
The distance between two clusters is the longest distance between two points in
each cluster
Average Linkage:
The distance between clusters is the average distance between each point in one
cluster to every point in other cluster
Ward Linkage:
The distance between clusters is the sum of squared differences within all clusters
208
Distance Metric:
The method used to calculate the distance between data points will affect the end
result.
Euclidean Distance:
The shortest distance between two points. For example, if x=(a,b) and y=(c,d), the
Euclidean distance between x and y is √(a−c)²+(b−d)²
Manhattan Distance:
Imagine you were in the downtown center of a big city and you wanted to get
from point A to point B. You wouldn‘t be able to cut across buildings, rather
you‘d have to make your way by walking along the various streets. For example,
if x=(a,b) and y=(c,d), the Manhattan distance between x and y is |a−c|+|b−d|
209
Divisive:
The Divisive Hierarchical clustering is precisely opposite of the Agglomerative
Hierarchical clustering. The divisive clustering algorithm is a top-down
clustering approach in which all the data points are taken as a single cluster and
in every iteration, the data points are seperated from the clusters which aren‘t
comparable. In the end, it is left with N clusters.
1st Image:
All the data points belong to one cluster, 2nd Image: 1 cluster is separated
from the previous single cluster, 3rd Image: Further 1 cluster is separated
from the previous set of clusters.
In the above sample dataset, it is observed that there is 3 cluster that is far
separated from each other so stopped after getting 3 clusters. Even
separating for further more clusters are done, below is the obtained result.
210
Sample dataset separated into 4 clusters
How to choose which cluster to split?
Check the sum of squared errors of each cluster and choose the one with
the largest value. In the below 2-dimension dataset, currently, the data
points are separated into 2 clusters, for further separating it to form the 3rd
cluster find the sum of squared errors (SSE) for each of the points in a red
cluster and blue cluster.
211
Here are the two approaches that are used to improve the quality of
hierarchical clustering −
Perform careful analysis of object linkages at each hierarchical
partitioning.
Integrate hierarchical agglomeration by first using a hierarchical
agglomerative algorithm to group objects into micro-clusters, and then
performing macro-clustering on the micro-clusters.
212
5. It is comparatively easier The clusters are difficult to read
to read and understand and understand as compared to
Hierarchical clustering.
6. It is relatively unstable It is a relatively stable technique.
than Non Hierarchical
clustering.
Partitioning Method:
Suppose we are given a database of ‗n‘ objects and the partitioning
method constructs ‗k‘ partition of data. Each partition will represent a
cluster and k ≤ n. It means that it will classify the data into k groups,
which satisfy the following requirements −
Each group contains at least one object.
Each object must belong to exactly one group.
Points to remember:
For a given number of partitions (say k), the partitioning method will
create an initial partitioning.
Then it uses the iterative relocation technique to improve the
partitioning by moving objects from one group to other.
It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabeled dataset on its own
without the need for any training. It is a centroid-based algorithm, where
each cluster is associated with a centroid. The main aim of this algorithm
is to minimize the sum of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into
k-number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
Determines the best value for K center points or centroids by an iterative
process.
Assigns each data point to its closest k-center. Those data points which are
near to the particular k-center, create a cluster.
213
Hence each cluster has datapoints with some commonalities, and it is
away from other clusters.
214
o Now assign each data point of the scatter plot to its closest K-point or
centroid. Compute it by applying some mathematics that calculate the
distance between two points and draw a median between both the
centroids.
From the above image, it is clear that points left side of the line is near to
the K1 or blue centroid, and points to the right of the line are close to the
yellow centroid. Color them as blue and yellow for clear visualization.
o Find the closest cluster and repeat the process by choosing a new
centroid. To choose the new centroids, compute the center of gravity
of these centroids, and will find new centroids as below:
215
o Next, reassign each datapoint to the new centroid. For this repeat the
same process of finding a median line. The median will be like below
image:
From the above image, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned
to new centroids.
216
o As new centroids are formed again draw the median line and reassign
the data points. So, the image will be:
o As per the above image; there are no dissimilar data points on either
side of the line, which means the model is formed. Consider the below
image:
As the model is ready, now remove the assumed centroids, and the two
final clusters will be as shown in the below image:
217
How to choose the value of "K number of clusters" in K-means
Clustering?
The performance of the K-means clustering algorithm depends upon
highly efficient clusters that it forms. But choosing the optimal number of
clusters is a big task. There are some different ways to find the optimal
number of clusters, but here the discussionis on the most appropriate
method to find the number of clusters or value of K.
Elbow Method:
The Elbow method is one of the most popular ways to find the optimal
number of clusters. This method uses the concept of WCSS
value. WCSS stands for Within Cluster Sum of Squares, which defines
the total variations within a cluster. The formula to calculate the value of
WCSS (for 3 clusters) is given below:
To find the optimal value of clusters, the elbow method follows the below
steps:
It executes the K-means clustering on a given dataset for different K
values (ranges from 1-10).
For each value of K, calculates the WCSS value.
Plots a curve between calculated WCSS values and the number of
clusters K.
The sharp point of bend or a point of the plot looks like an arm, then
that point is considered as the best value of K.
218
Since the graph shows the sharp bend, which looks like an elbow, hence it
is known as the elbow method. The graph for the elbow method looks like
the below image:
Applications:
K-means algorithm is very popular and used in a variety of applications
such as market segmentation, document clustering, image segmentation
and image compression, etc. The goal usually when we undergo a cluster
analysis is either:
Get a meaningful intuition of the structure of the data we‘re dealing
with.
Cluster-then-predict where different models will be built for different
subgroups if there is a wide variation in the behaviors of different
subgroups. An example is clustering patients into different subgroups
and building a model for each subgroup to predict the probability of the
risk of having heart attack.
Algorithm:
1. Initialize: select k random points out of the n data points as the
medoids.
2. Associate each data point to the closest medoid by using any common
distance metric methods.
3. While the cost decreases:
For each medoid m, for each data o point which is not a medoid:
1. Swap m and o, associate each data point to the closest medoid,
recompute the cost.
2. If the total cost is more than that in the previous step, undo the swap.
220
If a graph is drawn using the above data points, we obtain the following:
Step 1:
221
Each point is assigned to that cluster whose dissimilarity is less. So, the
points 1, 2, 5 go to cluster C1 and 0, 3, 6, 7, 8 go to cluster C2.
As the swap cost is not less than zero, we undo the swap. Hence (3,
4) and (7, 4) are the final medoids. The clustering would be in the
following way
Advantages:
1. It is simple to understand and easy to implement.
2. K-Medoid Algorithm is fast and converges in a fixed number of
steps.
3. PAM is less sensitive to outliers than other partitioning algorithms.
Disadvantages:
1. The main disadvantage of K-Medoid algorithms is that it is not
suitable for clustering non-spherical (arbitrary shaped) groups of
objects. This is because it relies on minimizing the distances
between the non-medoid objects and the medoid (the cluster centre)
222
– briefly, it uses compactness as clustering criteria instead of
connectivity.
2. It may obtain different results for different runs on the same dataset
because the first k medoids are chosen randomly.
Partitional :
data points are divided into finite number of partitions (non- overlapping
subsets) i.e., each data point is assigned to exactly one subset
Hierarchical:
data points placed into a set of nested clusters are organized into a
hierarchical tree i.e., tree expresses a continuum of similarities and
clustering
Density-based Method:
This method is based on the notion of density. The basic idea is to
continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold, i.e., for each data point within a
given cluster, the radius of a given cluster has to contain at least a
minimum number of points.
Grid-based Method:
In this, the objects together form a grid. The object space is quantized into
finite number of cells that form a grid structure.
Advantages:
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each dimension in the
quantized space.
Model-based methods:
In this method, a model is hypothesized for each cluster to find the best fit
of data for a given model. This method locates the clusters by clustering
the density function. It reflects spatial distribution of the data points.
This method also provides a way to automatically determine the number
of clusters based on standard statistics, taking outlier or noise into account.
It therefore yields robust clustering methods.
Constraint-based Method:
In this method, the clustering is performed by the incorporation of user or
application-oriented constraints. A constraint refers to the user expectation
or the properties of desired clustering results. Constraints provide us with
an interactive way of communication with the clustering process.
Constraints can be specified by the user or the application requirement.
223
13.11 CLUSTERING ALGORITHM APPLICATIONS
Recommendation engines:
The recommendation engines is a widely used method for providing
automated personalized suggestions about products, services and
information where collaborative filtering is one of the
famous recommendation system and techniques. In this method, the
clustering algorithm provide an idea of like-minded users. The
computation/estimation as data provided by several users is leveraged for
improving the performance of collaborative filtering methods. And this
can be implemented for rendering recommendations in diverse
applications.
Even though dealing with extensive data clustering is suitable as the first
step for narrowing the choice of underlying relevant neighbours in
collaborative filtering algorithms that also enhances the performance of
complex recommendation engines. Essentially, each cluster will be
assigned to specific preferences on the basis of customers‘ choices that
belong to the cluster. And then, within each cluster, customers would
receive recommendations estimated at the cluster level.
For example, once the groups are created, you can conduct a test campaign
on each group by sending marketing copy and according to response, you
can send more target messages (consisting information about products and
services) to them in future.
224
Under the customer segmentation application, various clusters of
customers are made with respect to their particular attributes. On the basis
of user-based analysis, a company can identify potential customers for
their products or services.
What the problem is: Fake news is being created and spread at a rapid rate
due to technology innovations such as social media. The issue gained
attention recently during the 2016 US presidential campaign. During this
campaign, the term Fake News was referenced an unprecedented number
of times.
226
How clustering works: In a paper recently published by two computer
science students at the University of California, Riverside, they are using
clustering algorithms to identify fake news based on the content. The way
that the algorithm works is by taking in the content of the fake news
article, the corpus, examining the words used and then clustering them.
These clusters are what helps the algorithm determine which pieces are
genuine and which are fake news. Certain words are found more
commonly in sensationalized, click-bait articles. When you see a high
percentage of specific terms in an article, it gives a higher probability of
the material being fake news.
Spam filter:
You know the junk folder in your email inbox? It is the place where
emails that have been identified as spam by the algorithm. Many machine
learning courses, such as Andrew Ng‘s famed Coursera course, use the
spam filter as an example of unsupervised learning and clustering.
What the problem is: Spam emails are at best an annoying part of
modern day marketing techniques, and at worst, an example of people
phishing for your personal data. To avoid getting these emails in your
main inbox, email companies use algorithms. The purpose of these
algorithms is to flag an email as spam correctly or not.
These groups can then be classified to identify which are spam. Including
clustering in the classification process improves the accuracy of the filter
to 97%. This is excellent news for people who want to be sure they‘re not
missing out on your favorite newsletters and offers.
What the problem is: As more and more services begin to use APIs on
your application, or as your website grows, it is important you know
where the traffic is coming from. For example, you want to be able to
block harmful traffic and double down on areas driving growth. However,
it is hard to know which is which when it comes to classifying the traffic.
227
traffic sources, you are able to grow your site and plan capacity
effectively.
What is the problem: You need to look into fraudulent driving activity.
The challenge is how do you identify what is true and which is false?
How clustering works: By analysing the GPS logs, the algorithm is able
to group similar behaviors. Based on the characteristics of the groups you
are then able to classify them into those that are real and which are
fraudulent.
Document analysis:
There are many different reasons why you would want to run an analysis
on a document. In this scenario, you want to be able to organize the
documents quickly and efficiently.
What the problem is: Imagine you are limited in time and need to
organize information held in documents quickly. To be able to complete
this ask you need to: understand the theme of the text, compare it with
other documents and classify it.
How clustering works: Hierarchical clustering has been used to solve this
problem. The algorithm is able to look at the text and group it into
different themes. Using this technique, you can cluster and organize
similar documents quickly using the characteristics identified in the
paragraph.
What is the problem: Who should you have in your team? Which players
are going to perform best for your team and allow you to beat the
competition? The challenge at the start of the season is that there is very
little if any data available to help you identify the winning players.
228
13.12 SUMMARY
229
12. What is the difference between hierarchical clustering and non hierarchical
clustering?
13. Discuss in detail various types of data that are considered in the cluster
analysis.?
14. Given two objects represented by the tuples (22,1,42,10) and (20,0,36,8)
Compute the Manhattan distance between the two objects.
https://www.Javatpoint.com
https://www.Geeksforgeeks.org
https://tutorialspoint.com
https://datafloq.com/read/7-innovative-uses-of-clustering-algorithms/6224
https://en.wikipedia.org/wiki/Cluster_analysis
*****
230
14
WEB MINING
Unit Structure
14.0 Objectives
14.1 Introduction
14.2 An Overview
14.2.1 What is Web mining?
14.2.2 Applications of web mining
14.2.3 Types of techniques of mining
14.2.4 Difference Between Web Content, Web Structure, and Web
Usage Mining
14.2.5 Comparison between data mining and web mining
14.3 Future Trends
14.3.1 To Adopt a Framework
14.3.2 A Systematic Approach
14.3.3 Emerging Standards to Address the Redundancy and Quality
of Web-Based Content
14.3.4 Development of Tools and Standards
14.3.5 Use of Intelligent Software
14.4 Web Personalization
14.4.1 Personalization Process
14.4.2 Data Acquisition
14.4.3 Data Analysis
14.5 Tools and Standards
14.6 Trends and challenges in personalization
14.7 Let us Sum Up
14.8 List of References
14.9 Bibliography
14.10 Chapter End Exercises
14.0 OBJECTIVES
14.1 INTRODUCTION
Web mining is pushing the World Wide Web toward a more valuable
environments in which clients/users can rapidly and effectively discover
the data they need. It incorporates the disclosure and examination of
information, reports, and interactive media from the World Wide Web.
Web mining utilizes archive content, hyperlink design, and utilization
insights to help clients in gathering their data needs. The actual Web and
web search tools contain relationship data about records. In this, Content
mining is coming first. Discovering keywords and discovering the
connection between a Web page content and a query content will be
content mining. Hyperlinks give data about different records on the Web
thought to be critical to another report. These connections add profundity
to the report, giving the multi-dimensionality that describes the Web.
Mining this connection structure is the second space of Web mining. At
long last, there is a relationship to different records on the Web that are
recognized by past look. These relationships are recorded in logs of
searches and gets to.
Mining these logs is the third space of Web mining. Understanding the
client is likewise a significant piece of Web mining. Examination of the
client‘s past meetings favoured showcase of data, and communicated
inclinations might impact the Web pages returned in response to a query.
Web mining is interdisciplinary in nature, spreading over across such
fields as data recovery, natural language preparing, data extraction, AI,
data set, data mining, data warehousing, UI plan, and visual
representation. Strategies for mining the Web have down to earth
application in m-commerce, online business, e-government, e-learning,
distance learning, virtual associations, knowledge management and digital
libraries.
14.2 AN OVERVIEW
The only difference between data and Web warehousing is that within
the latter, the underlying database is that the entire World Wide Web.
As a readily accessible resource, the online may be a huge data
warehouse that contains volatile information that's gathered and
extracted into something valuable to be used within the organization
situation. Using traditional data processing methodologies and
techniques (Tech Reference, 2003), the Web mining is that the process
of extracting data from the web and sorting them into identifiable
patterns and relationships.
234
Method Machine Proprietar Proprietary Machine
Learning y algorithm learning
Statistical algorithm Statistical
(Including Associatio Associatio
NLP) n rules n Rules
They are:
(1) to catch, classify, and share both data and information.
(2) to focus on the synergistic endeavours among individuals and
networks with an accentuation on learning and preparing; and
(3) to focus on the information and expertise utilized in the everyday work
environment.
237
utilization. The ramifications are that if programming items are created to
help these exercises, the items might be restrictive in nature. The
American National Standards Institute (ANSI) is one of the bodies that
propose U.S. guidelines be acknowledged by the International Standards
Organization (ISO). Every guideline association is comprised of
volunteers who create, evaluate, and present a norm for formal
endorsement. There can be contending principles that are created, and
some must be accommodated. This could affect the turn of events and
temporary endeavours expected to construct powerful scientific
classifications and ontologies. This infers an efficient methodology for
capturing explicit and verifiable information into a well created
infrastructure.
The Web has become a gigantic store of data and continues to develop
dramatically under no control, while the human capacity to discover,
peruse and comprehend content remaining parts steady. Furnishing
individuals with admittance to data isn't the issue; the issue is that
individuals with fluctuating requirements and inclinations explore through
enormous Web structures, missing the objective of their request. Web
personalization is quite possibly the most encouraging methodologies for
easing this data over-burden, giving custom-made Web encounters. This
part investigates the various essences of personalization, follows back its
underlying foundations and follows its encouraging. It portrays the
modules ordinarily including a personalization interaction, shows its
nearby connection to Web mining, portrays the specialized issues that
238
emerge, suggests arrangements whenever the situation allows, and talks
about the adequacy of personalization and the connected concerns.
Besides, the part represents latest things in the field proposing headings
that might prompt new logical outcomes.
Data that get from additional handling the noticed and respect use
normalities (estimations of recurrence of choosing a
choice/connect/administration, creation of ideas/proposals dependent on
circumstance activity relationships, or varieties of this methodology, for
example recording activity arrangements).
These sections in the majority of the cases are taken out from the log
information, as they don't uncover genuine utilization data. In any case, an
official conclusion on the most ideal approach to deal with them relies
upon the particular application. In the wake of cleaning, log sections are
typically parsed into information fields for simpler control. Aside from
eliminating passages from the log information, by and large information
planning likewise incorporates improving the use data by adding the
missing snaps to the client clickstream. The explanation directing this
assignment is customer and intermediary storing, which cause numerous
solicitations not to be recorded in the worker logs and to be served by the
reserved site visits. The way toward re-establishing the total snap stream is
called way fruition and it is the last advance for pre-handling utilization
information.
Missing site visit solicitations can be identified when the referrer page
document for a site hit isn't essential for the past site hit. The lone sound
approach to have the total client way is by utilizing either a product
specialist or an altered program on the customer side. In any remaining
cases the accessible arrangements (utilizing for example, aside from the
referrer field, information about the connection design of the site) are
heuristic in nature and can't ensure exactness. With the exception of the
way fulfilment issue, there stays a bunch of other specialized impediments
that should be defeated during information planning and pre-preparing.
All the more explicitly, a significant such issue is client ID. Various
strategies are sent for client distinguishing proof and the general appraisal
is that the more precise a strategy is, the higher the protection intrusion
issue it faces. Expecting that every IP address/specialist pair distinguishes
a novel client isn't generally the situation, as numerous clients might
utilize a similar PC to get to the Web and a similar client might get to the
Web from different PCs. An implanted meeting ID requires dynamic
locales and keeping in mind that it recognizes the different clients from a
similar IP/Agent, it neglects to distinguish similar client from various IPs.
Treats and programming specialists achieve the two destinations yet are
normally not very much acknowledged (or even dismissed and
incapacitated) by most clients.
241
Enlistment additionally gives dependable ID yet not all clients will go
through such a technique or review logins and passwords. On the other
hand, adjusted programs might give exact records of client conduct even
across Websites, however they are not a practical arrangement in most of
cases as they require establishment and just a set number of clients will
introduce and utilize them. To wrap things up, there emerges the issue of
meeting ID. Unimportant arrangements tackle this by setting a base time
limit and accepting that resulting demands from a similar client surpassing
it have a place with various meetings (or utilize a greatest edge for
finishing up individually). Example Discovery Pattern revelation expects
to identify fascinating examples with regards to the pre-handled Web
utilization information by sending measurable and information mining
strategies. These strategies typically involve (Eirinaki and Vazirgiannis,
2003):
• Clustering:
A technique utilized for gathering things that have comparative attributes.
For our situation things may either be clients (that exhibit comparative
online conduct) or pages (that are comparability used by clients).
• Classification:
A process that figures out how to allot information things to one of a few
predefined classes. Classes normally address distinctive user profiles, and
classification is performed utilizing chosen highlights with high
discriminative capacity as alludes to the arrangement of classes portraying
each profile.
242
This methodology is for sure better than other more conventional
strategies, (for example, collective or content-based sifting) as far as both
adaptability and dependence on target input information (and not, for
example, on abstract client appraisals). In any case, utilization-based
personalization can likewise be dangerous when little use information are
free relating to certain articles, or when the site content changes
consistently. Mobasher et al. (2000a) claims that for more viable
personalization, both utilization and content credits of a site should be
coordinated into the information examination stage and be utilized
consistently as the premise of all personalization choices.
From the past, clearly customizing the Web insight for clients by tending
to address necessities and inclinations is a difficult task for the Web
industry. Electronic applications (e.g., portals, internet business
destinations, e-learning conditions, and so forth) can work on their
presentation by utilizing appealing new tools, for example, dynamic
suggestions dependent on singular qualities and recorded navigational
history. Nonetheless, the inquiry that emerges is the means by which this
can be really refined. Both the web industry and specialists from assorted
logical regions have zeroed in on different parts of the point. The
exploration draws near, and the business apparatuses that convey
customized Web encounters dependent on business rules, Website content
and construction, just as the client conduct recorded in Web log
documents are various.
The framework suggests pages from groups that intently match the current
meeting. For customizing a site as per the necessities of every client,
Spiliopoulou (2000) portrays an interaction dependent on finding and
breaking down client navigational examples. Mining these examples, we
can acquire knowledge into a website‘s use and optimality regarding its
present client populace. Utilization designs removed from Web
information have been applied to a wide scope of uses. WebSIFT (Cooley
et al., 1997, 1999b, 2000) is a site data channel framework that joins use,
content, and design data about a website. The data channel consequently
recognizes the found examples that have a serious level of emotional
intriguing quality.
Site guests should be persuaded that any gathered data will stay private
and secure. P3P empowers Websites to communicate their protection
rehearses in a standard arrangement that can be recovered naturally and
deciphered effectively by client specialists. P3P client specialists permit
clients to be educated regarding site rehearses (in both machine and
intelligible arrangements) and to mechanize dynamic dependent on these
practices when proper. Along these lines, clients need not read the security
approaches at each site they visit. Nonetheless, while P3P gives a standard
system to portraying security rehearses, it doesn't guarantee that Websites
really follow them.
245
Open Profiling Standard (OPS)6 is a proposed standard by Netscape that
empowers Web personalization. It permits clients to keep profile records
on their hard drives, which can be gotten to by approved Web workers.
The clients approach these records and can handle the introduced data.
These records can supplant treats and manual online enlistment.
The OPS has been analyzed by the W3C, and its key thoughts have been
consolidated into P3P. Client Profile Exchange (CPEX)7 is an open norm
for working with the security empowered trade of client data across
dissimilar endeavor applications and frameworks. It coordinates on the
web/disconnected client information in a XML-based information model
for use inside different venture applications both on and off the Web,
coming about in an arranged, client centered climate. The CPEX working
gathering plans to foster open-source reference execution and engineer
rules to speed selection of the norm among sellers1.6
246
2. Web Mining and Social Networking,Techniques and Applications by
Authors: Xu, Guandong, Zhang, Yanchun, Li, Lin
3. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data
(Data-Centric Systems and Applications) by Bing Lu
4. Mining the Web by Soumen Chakrabarti
14.9 BIBLIOGRAPHY
1. Wang Bin and Liu Zhijing, "Web mining research," Proceedings Fifth
International Conference on Computational Intelligence and
Multimedia Applications. ICCIMA 2003, 2003, pp. 84-89, doi:
10.1109/ICCIMA.2003.1238105.
2. G. Dileep Kumar, Manohar Gosul,Web Mining Research and Future
Directions,Advances in Network Security and, Applications, 2011,
Volume 196, ISBN : 978-3-642-22539-0
3. Lorentzen, D.G. Webometrics benefitting from web mining? An
investigation of methods and applications of two research
fields. Scientometrics 99, 409–445 (2014).
4. Jeong, D.H, Hwang, M., Kim, J., Song, S.K., Jung, H., Peters, C.,
Pietras, N., Kim, D.W.: Information Service Quality Evaluation Model
from the Users Perspective, The 2nd International Semantic
Technology (JIST) Conference 2012, Nara, Japan, 2012.
5. Helena Ahonen, Oskari Heinonen, Mika Klemettinen, A. Inkeri
Verkamo, (1997), Applying Data Mining Techniques in Text Analysis,
Report C-1997-23, Department of Computer Science,University of
Helsinki, 1997
6. Web Mining: Applications and Techniques by Anthony Scime State
University of New York College at Brockport, USA
7. Web Mining and Social Networking, Techniques and Applications
by Authors: Xu, Guandong, Zhang, Yanchun, Li, Lin
8. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data
(Data-Centric Systems and Applications) by Bing Lu
9. Mining the Web by Soumen Chakrabarti
247
6. Explain Web structure mining.
7. What is Web Usage Mining?
8. Compare Data mining and Web Mining?
9. List out difference Between Web Content, Web Structure, and Web
Usage Mining.
10. Explain Web Personalization.
*****
248
15
TEXT MINING
Unit Structure
15.0 Objectives
15.1 Introduction
15.2 An Overview
15.2.1 What is Text mining?
15.2.2 Text Mining Techniques
15.2.3 Information Retrieval Basics
15.2.4 Text Databases and Information Retrieval
15.3 Basic methods of Text Retrieval
15.3.1 Text Retrieval Methods
15.3.2 Boolean Retrieval Model
15.3.3 Vector Space Model
15.4 Usage of Text Mining
15.5 Areas of text mining in data mining
15.5.1 Information Extraction
15.5.2 Natural Language Processing
15.5.3 Data Mining and IR
15.6 Text Mining Process
15.7 Text Mining Approaches in Data Mining
15.7.1 Keyword-based Association Analysis
15.7.2 Document Classification Analysis
15.8 Numericizing text
15.8.1 Stemming algorithms
15.8.2 Support for different languages
15.8.3 Exclude certain character
15.8.4 Include lists, exclude lists (stop-words)
15.9 What is Natural Language Processing (NLP)?
15.9.1 Machine Learning and Natural Language Processing
15.10 Big Data and the Limitations of Keyword Search
15.11 Ontologies, Vocabularies and Custom Dictionaries
15.12 Enterprise-Level Natural Language Processing
15.13 Analytical Tools
15.14 Scalability
15.15 Issues in Text Mining Field
15.16 Let us Sum Up
249
15.17 List of References
15.18 Bibliography
15.19 Chapter End Exercises
15.0 OBJECTIVES
15.1 INTRODUCTION
Text mining is a minor departure from a field called data mining that
attempts to discover intriguing examples from huge data sets. Text data
sets are quickly becoming because of the expanding measure of data
accessible in electronic structure, like electronic publications, different
sorts of electronic records, email, and the World Wide Web. These days
the vast majority of the data in government, industry, business, and
different organizations are put away electronically, as text information
bases.
15.2 AN OVERVIEW
Information put away in most content data sets are semi organized
information in that they are neither totally unstructured nor totally
organized. For instance, a report might contain a couple of organized
fields, like title, creators, distribution date, and class, etc, yet in addition
contain some generally unstructured content parts, like theoretical and
substance. There has been a lot of studies on the demonstrating and
execution of semi organized information in late data set examination. In
addition, data recovery strategies, for example, text ordering techniques,
have been created to deal with unstructured reports. Customary data
recovery strategies become insufficient for the undeniably tremendous
measures of text information. Ordinarily, just a little part of the numerous
accessible archives will be applicable to a given individual client.
All content mining approaches use data recovery components. Surely, the
qualification between data recovery techniques and text mining is
obscured. In the following area data recovery essentials are examined.
Various complex expansions to fundamental data recovery progressed in
the lawful field are portrayed. We then, at that point talk about instances
of data extraction, text outline, text arrangement and text grouping in law.
Legitimate data recovery thinks about looking through both organized and
unstructured substance. For organized data, the semantics can be
251
obviously not set in stone and can be portrayed with straightforward and
clear ideas. This data classification contains, for example, distinguishing
proof information of the writings, information for rendition the board and
the capacity and job of specific parts. This information are regularly
included the type of metadata (i.e information that depict different
information) to the reports. Unstructured data regularly happens in normal
language messages or in different configurations like sound and video and
for the most part has an intricate semantics.
The listed terms chosen concern single words and multi word stages and
are expected to mirror the substance of the content. She asserts that a
pervasive cycle of choosing regular language file terms from messages
that mirror its substance is made out of the accompanying advances:
1. Lexical investigation the content is parsed, and singular words are
perceived.
2. The expulsion of stopwords – a book recovery framework regularly
relates a stop list with a bunch of reports. A stop list is a bunch of
words that are considered unessential, (for example, a , the, for) or if
nothing else unimportant for the given question.
3. The discretionary decrease of the leftover words to their stem structure
A gathering of various words might have a similar word stem.
The content recovery framework needs to recognize gatherings of
words that have a little syntactic variety from one another and just
utilize single word from each gathering of. just use penetrate rather
than breaks, penetrate, penetrated. There are various strategies for
stemming, a considerable lot of which depend upon phonetic
information on the assortment's language.
4. The discretionary definition of expressions as list terms. Strategies of
expression acknowledgment utilize the measurements of co-events of
words or depend upon semantic information on the assortment's
language.
5. The alternative substitution of words, word stems or expressions by
their thesaurus class terms – A thesaurus replaces the individual words
or expressions of a book by more uniform ideas.
6. The calculation of the significance marker or term wight of each
leftover word stem or word, thesaurus class term or expressions term.
252
Information retrieval (IR):
A field created in corresponding with data set frameworks Information is
coordinated into (an enormous number of) records IR manages the issue of
finding significant archives as for the client information or inclination
IR Systems and DBMS manage various issues Typical DBMS issues are
update, exchange the executives, complex items
For the most part, not reasonable to fulfill data need Useful just in quite
certain area where clients have a major skill
253
15.3.3 Vector Space Model:
A record and a question are addressed as vectors in high dimensional
space comparing to every one of the keywords. Pertinence is estimated
with a suitable closeness measure characterized over the vector space.
Issues:
254
when a particular model is incorporated as a segment of a bigger
framework.
255
15.5.3 Data Mining and IR:
Information mining alludes to the extraction of valuable information,
concealed examples from huge informational indexes. Information mining
instruments can foresee practices and future patterns that permit
organizations to settle on a superior information driven choice.
Information mining instruments can be utilized to determine numerous
business issues that have generally been too tedious.
Data recovery manages recovering valuable information from information
that is put away in our frameworks. Then again, as a similarity, we can see
web indexes that occur on sites, for example, online business destinations
or some other locales as a feature of data recovery.
The content mining market has encountered remarkable development and
selection throughout the most recent couple of years and furthermore
expected to acquire critical development and reception in the coming
future. One of the essential purposes for the reception of text mining is
higher rivalry in the business market, numerous associations looking for
esteem added answers for contend with different associations. With
expanding finish in business and adjusting client points of view,
associations are making colossal speculations to discover an answer that is
fit for breaking down client and contender information to further develop
intensity.
The essential wellspring of information is internet business sites, web-
based media stages, distributed articles, study, and some more. The bigger
piece of the produced information is unstructured, which makes it trying
and costly for the associations to examine with the assistance of
individuals. This test coordinates with the remarkable development in
information age has prompted the development of logical instruments. It
isn't simply ready to deal with enormous volumes of text information yet
in addition helps in dynamic purposes. Text mining programming enables
a client to draw valuable data from a tremendous arrangement of
information accessible sources.
Text Pre-processing:
Pre-processing is a huge task and a critical step in Text Mining, Natural
Language Processing (NLP), and information retrieval (IR). In the field of
text mining, information pre-handling is utilized for extracting useful data
and information from unstructured data. IR involves picking which
records in an assortment ought to be recovered to satisfy the client's need.
Feature selection:
Feature determination is a huge piece of data mining. It can be
characterized as the process toward decreasing the contribution of
handling or tracking down the fundamental data sources. This is
additionally called as ―variable selection‖.
Data Mining:
Presently, in this progression, the text mining strategy converges with the
traditional interaction. Exemplary Data Mining systems are utilized in the
primary information base.
Evaluate:
Afterward, it evaluates the results. Once the result is evaluated, the result
abandon.
Applications:
Applications explained below:
Risk Management:
Risk Management is a deliberate and sensible technique of examining,
distinguishing, treating, and observing the dangers implied in any activity
or process in associations. Inadequate danger examination is normally a
main source of frustration. It is especially obvious in the monetary
associations where appropriation of Risk Management Software
dependent on text mining innovation can adequately improve the capacity
to decrease risks. It empowers the organization of millions of sources and
petabytes of text documents and enabling to associate the information. It
assists with getting to the suitable information at the ideal opportunity.
257
to the text-based information from various sources like client input,
overviews, client calls, and so on. The essential target of text analysis is to
decrease the reaction time of the organizations and help to address the
complaints of the customer quickly and productively
Business Intelligence:
Organizations and business firms have begun to utilize text mining
techniques as a significant part of their business intelligence. Other than
giving huge experiences into customerbehaviour and patterns, text mining
procedures likewise support organizations to analyze the characteristics
and weakness of their rival's along these lines, giving them an upper hand
on the lookout.
These are the following text mining approaches that are used in data
mining.
258
15.8 NUMERICIZING TEXT
259
Today‘s natural language processing systems can examine limitless
measures of text-based information without weariness and in a steady, fair
way. They can comprehend ideas inside complex settings and interpret
ambiguities of language to remove key realities and connections or give
outlines. Given the immense amount of unstructured information that is
created each day, from electronic health records (EHRs) to social media
posts, this type of computerization has gotten basic to investigating text-
based information effectively.
260
Web search tools, text examination devices and normal language
preparing arrangements become considerably more impressive when
conveyed with area explicit ontologies. Ontologies empower the genuine
importance of the content to be seen, in any event, when it is
communicated in an unexpected way (for example Tylenol versus
Acetaminophen). NLP methods broaden the force of ontologies, for
instance by permitting coordinating of terms with various spellings
(Estrogen or Estrogen), and by considering setting ("SCT" can allude to
the quality, "Secretin", or to "Step Climbing Test").
261
The capacity to recognize, tag and search in explicit record segments
(regions), for instance: centering an inquiry to eliminate commotion
from a paper's reference area.
Linguistic handling to distinguish the significant units inside text like
sentences, thing, and action word assembles with the connections
between them.
Semantic apparatuses that distinguish ideas inside the content like
medications and sicknesses and standardize to ideas from standard
ontologies. Notwithstanding center life science and medical care
ontologies like MedDRA and MeSH, the capacity to add their own
word references is a prerequisite for some associations.
Pattern acknowledgment to find and recognize classifications of data,
not effectively characterized with a word reference approach. These
incorporate dates, mathematical data, biomedical terms (for example
focus, volume, dose, energy) and quality/protein transformations.
The capacity to handle inserted tables inside the content, regardless of
whether designed utilizing HTML or XML, or as free content.
15.14 SCALABILITY
Numerous issues happen during the content mining interaction and impact
the productivity and adequacy of dynamic. Intricacies can emerge at the
middle of the road phase of text mining. In pre-handling stage different
guidelines and guidelines are characterized to normalize the content that
make text mining measure proficient. Prior to applying design
investigation on the record there is a need to change over unstructured
information into moderate structure however at this stage mining measure
has its own confusions.
It involves loads of exertion and time to create and send modules in all
fields independently. Words having same spelling yet give different
significance, for instance, fly and fly. Text mining apparatuses considered
both as comparative while one is action word and other is thing. Syntactic
standards as indicated by the nature and setting is as yet an open issue in
the field of text mining.
By and large, positioning is liked and more basic since pertinence involves
degree and regardless of whether we can choose the right archives, it's as
yet attractive to rank them. Therefore, most existing examination in data
recovery has accepted that the objective is to foster a decent positioning
capacity. We will cover various approaches to rank reports later. These are
additionally called recovery models.
All recovery frameworks have some normal segments. One of them is the
tokenizer, which has to do with planning a book to a flood of
tokens/terms. This has to do with the more broad issue of addressing text
in the framework in some structure so we can coordinate with a question
with a report. The overwhelming procedure for text portrayal is to address
a book as a "sack of terms". Tokenization has to do with deciding the
terms.
263
inquiry whether one ought to do stemming, and the appropriate response
exceptionally relies upon explicit applications.
264
15.18 BIBLIOGRAPHY
*****
265
16
INFORMATION RETRIEVAL
Unit Structure
16.0 Objectives
16.1 Introduction
16.2 An Overview
16.2.1 What is Information Retrieval?
16.2.2 What is an IR Model?
16.2.3 Components of Information Retrieval/ IR Model
16.3 Difference Between Information Retrieval and Data Retrieval
16.4 User Interaction with Information Retrieval System
16.5 Past, Present, and Future of Information Retrieval
16.5.1 IR on the Web
16.5.2 Why is IR difficult?
16.6 Functional Overview
16.6.1 Item Normalization
16.6.2 Selective Dissemination (Distribution, Spreading) of
Information
16.6.3 Document Database Search
16.6.4 Multimedia Database Search
16.7 Application areas within IR
16.7.1 Cross language retrieval
16.7.2 Speech/broadcast retrieval
16.7.3 Text Categorization
16.7.4 Text Summarization
16.7.5 Structured document element retrieval
16.8 Web Information Retrieval Models
16.8.1 Vector Model
16.8.2 Vector space model
16.8. 3 Probabilistic Model
16.9 Let us Sum Up
16.10 List of References
16.11 Bibliography
16.12 Chapter End Exercises
16.0 OBJECTIVES
Knowledge about Information Retrieval
266
Various IR models
Components of IR
User Interaction
Application Area
16.1 INTRODUCTION
An IR framework can address, store, sort out, and access data things. A
bunch of catchphrases are needed to look. Catchphrases are the thing
individuals are looking for in web crawlers. These keywords sum up the
portrayal of the data.
16.2 AN OVERVIEW
The framework looks more than billions of records put away on large
number of PCs. A spam channel, manual or programmed implies are given
by email program to grouping the sends so it very well may be put
straightforwardly into specific envelopes. For instance, Information
Retrieval can be the point at which a client enters an inquiry into the
framework.
267
place with a jargon V. An IR model decides the inquiry report
coordinating with work as per four fundamental methodologies:
Figure 1: IR Model
Acquisition:
In this progression, the choice of records and different items from
different web assets that comprise of text-based archives happens. The
necessary information is gathered by web crawlers and put away in the
data set.
Representation:
It comprises of ordering that contains free-text terms, controlled jargon,
manual and programmed strategies also. model: Abstracting contains
summing up and Bibliographic portrayal that contains creator, title,
sources, information, and metadata.
Query:
An IR cycle begins when a client enters an inquiry into the framework.
Questions are formal explanations of data needs, for instance, search
strings in web search tools. In data recovery, an inquiry doesn't
remarkably recognize a solitary article in the assortment. All things
being equal, a few articles might coordinate with the inquiry, maybe with
various levels of pertinence.
268
16.3 DIFFERENCE BETWEEN INFORMATION
RETRIEVAL AND DATA RETRIEVAL
269
The User Task:
The data initially should be converted into an inquiry by the client. In the
data recovery framework, there is a bunch of words that pass on the
semantics of the data that is required though, in an information recovery
framework, a question articulation is utilized to pass on the limitations
which are fulfilled by the items. Model: A client needs to look for
something yet winds up looking with something else. This implies that
the client is perusing and not looking. The above figure shows the
collaboration of the client through various errands.
1. Early Developments:
As there was an increment in the requirement for a great deal of data, it
became important to construct information designs to get quicker access.
The file is the information structure for quicker recovery of data. Over
hundreds of years manual order of chains of importance was
accomplished for records.
270
Multimedia documents
Great variation of document quality
Multilingual problem
The handling tokens and their portrayal are utilized to characterize the
accessible content from the complete got text. shows the standardization
cycle. Normalizing the information takes the diverse outer organizations
of information and plays out the interpretation to the arrangements worthy
to the framework. A framework might have a solitary configuration for all
things or permit numerous organizations. One illustration of normalization
could be interpretation of unknown dialects into Unicode. Each language
has an alternate inner twofold encoding for the characters in the language.
One standard encoding that covers English, French, Spanish, and so forth
is ISO-Latin.
To help the clients in creating lists, particularly the expert indexers, the
framework gives an interaction called Automatic File Build (AFB). Multi-
media adds an additional measurement to the standardization interaction.
As well as normalizing the printed input, the multi-media input likewise
271
should be normalized. There are a ton of alternatives to the principles
being applied to the standardization. In the event that the information is
video the reasonable advanced norms will be either MPEG-2, MPEG-1,
AVI or Real Media. MPEG (Motion Picture Expert Group) guidelines are
the most all-inclusive principles for better video where Real Media is the
most well-known norm for lower quality video being utilized on the
Internet. Sound guidelines are normally WAV or Real Media (Real
Audio). Pictures differ from JPEG to BMP.
The following cycle is to parse the thing into coherent sub-divisions that
have importance to the client. This cycle, called "Drafting," is apparent to
the client and used to build the accuracy of a hunt and streamline the
showcase. A run of the mill thing is partitioned into zones, which might
cover and can be various levelled, like Title, Author, Abstract, Main Text,
Conclusion, and References. The drafting data is passed to the handling
token recognizable proof activity to store the data, permitting searches to
be confined to a particular zone. For instance, on the off chance that the
client is keen on articles talking about "Einstein" the inquiry ought to
exclude the Bibliography, which could incorporate references to articles
composed by "Einstein." Systems decide words by separating input
images into 3 classes:
1) Valid word symbols
2) Inter-word symbols
3) Special processing symbols.
272
16.6.2 Selective Dissemination (Distribution, Spreading) of
Information:
The Selective Dissemination of Information (Mail) Process gives the
ability to powerfully analyse recently got things in the data framework
against standing proclamations of interest of clients and convey the thing
to those clients whose assertion of interest coordinates with the substance
of the thing. The Mail interaction is made out of the hunt cycle, client
explanations of interest (Profiles) and client mail records. As everything is
gotten, it is prepared against each client's profile. A profile contains an
ordinarily wide hunt proclamation alongside a rundown of client mail
records that will get the report if the pursuit explanation in the profile is
fulfilled. Particular Dissemination of Information has not yet been applied
to sight and sound sources.
Each client can have at least one Private Index documents prompting an
extremely enormous number of records. Every Private Index record
references just a little subset of the complete number of things in the
Document Database. Public Index records are kept up with by proficient
library administrations staff and regularly file each thing in the Document
Database. There are few Public Index records. These records approach
records (i.e., arrangements of clients and their advantages) that permit
anybody to look or recover information. Private Index documents
commonly have extremely restricted admittance records
274
strategies are enormously expected to address the always developing
measure of text information accessible online to both better assist with
finding important data and to devour applicable data quicker.
Advantages:
- Its term-weighting scheme improves retrieval performance
– Its partial matching strategy allows retrieval of documents that
approximate the query conditions
– Its cosine ranking formula sorts the documents according to their
degree of similarity to the query
Disadvantage:
– The assumption of mutual independence between index terms
276
14. Give brief notes about user Relevance feedback method and how it is
used in query expansion
15. What is the use of Link analysis?
16.11 BIBLIOGRAPHY
1. Sanjib Kumar Sahu, D. P. Mahapatra, R. C. Balabantaray,
"Analytical study on intelligent information retrieval system using
semantic network", Computing Communication and Automation
(ICCCA) 2016 International Conference on, pp. 704-710, 2016.
2. Federico Bergenti, Enrico Franchi, Agostino Poggi, Collaboration
and the Semantic Web, pp. 83, 2012.
3. Introduction to information retrieval - Book by Christopher D.
Manning, Hinrich Schütze, and Prabhakar Raghavan
4. Information Retrieval: Implementing and Evaluating Search Engines
- Book by Charles L. A. Clarke, Gordon Cormack, and Stefan
Büttcher
5. Introduction to Modern Information Retrieval - Book by Gobinda G.
Chowdhury
6. Stefan Buettcher, Charles L. A. Clarke, Gordon V. Cormack,
Information Retrieval: Implementing and Evaluating Search
Engines, The MIT Press, 2010.
7. OphirFrieder ―Information Retrieval: Algorithms and Heuristics:
The Information Retrieval Series ―, 2nd Edition, Springer, 2004.
8. Manu Konchady, ―Building Search Applications: Lucene, Ling
Pipe‖, and First Edition, Gate Mustru Publishing, 2008.
*****
277