0% found this document useful (0 votes)
107 views

DWDM LAB Final Manualtest

The document describes steps to build a data warehouse including creating dimension and fact tables and populating them with sample data. It discusses different data modeling schemas like star schema, snowflake schema, and fact constellation schema. It also describes the ETL process involving extract, transform, and load steps to move data from source to the data warehouse.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views

DWDM LAB Final Manualtest

The document describes steps to build a data warehouse including creating dimension and fact tables and populating them with sample data. It discusses different data modeling schemas like star schema, snowflake schema, and fact constellation schema. It also describes the ETL process involving extract, transform, and load steps to move data from source to the data warehouse.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 134

DEPARTMENT OF COMPUTER SCIENCE &

ENGINEERING

UNIT-1 BUILD DATA WAREHOUSE AND EXPLORE WEKA


A. Build a Data warehouse/Data Mart (using open source tools like
Pentaho Data Integration tool, Pentaho Business Analytics; or other
data warehouse tools like Microsoft-ssis, Informatica, Business Objects,
etc)

(i). Identify source tables and populate sample data.

Create the Data Warehouse

We create 3 dimension tables and 1 fact table in the data warehouse. DimDate, DimCustomer,
DimVan and FactHire. We populate the 3 dimensions but we’ll leave the fact table empty.

The following script is used to create and populate dim and fact tables .

Create data warehouse

create database TopHireDW

go

use TopHireDW

go

Create Date Dimension

if exists (select * from sys.tables where name = 'DimDate')

drop table DimDate

go

create table DimDate

DATA WAREHOUSING AND DATA MINING LAB MANUAL


1
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

( DateKey int not null primary key,

[Year] varchar(7), [Month] varchar(7), [Date] date, DateString varchar(10))

go

Populate Date Dimension

truncate table DimDate

go

declare @i int, @Date date, @StartDate date, @EndDate date, @DateKey int,@DateString
varchar(10), @Year varchar(4), @Month varchar(7), @Date1 varchar(20)

set @StartDate = '2006-01-01'

set @EndDate = '2016-12-31'

set @Date = @StartDate

insert into DimDate (DateKey, [Year], [Month], [Date], DateString)

values (0, 'Unknown', 'Unknown', '0001-01-01', 'Unknown') --The unknown row

while @Date <= @EndDate


begin

set @DateString = convert(varchar(10), @Date, 20)

set @DateKey = convert(int, replace(@DateString,'-',''))

set @Year = left(@DateString,4)

set @Month = left(@DateString, 7)

insert into DimDate (DateKey, [Year], [Month], [Date], DateString)

values (@DateKey, @Year, @Month, @Date, @DateString)

DATA WAREHOUSING AND DATA MINING LAB MANUAL


2
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

set @Date = dateadd(d, 1, @Date)

end

go

select * from DimDate

Create Customer dimension

if exists (select * from sys.tables where name = 'DimCustomer')

drop table DimCustomer

go

create table DimCustomer

( CustomerKey int not null identity(1,1) primary key,CustomerId varchar(20) not null,
CustomerName varchar(30), DateOfBirth date, Town varchar(50), TelephoneNo varchar(30),
DrivingLicenceNo varchar(30), Occupation varchar(30) )
go

insert into DimCustomer (CustomerId, CustomerName, DateOfBirth, Town, TelephoneNo,


DrivingLicenceNo, Occupation)

select * from HireBase.dbo.Customer

select * from DimCustomer

Create Van dimension

if exists (select * from sys.tables where name = 'DimVan')

drop table DimVan

DATA WAREHOUSING AND DATA MINING LAB MANUAL


3
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

go

create table DimVan


( VanKey int not null identity(1,1) primary key, RegNo varchar(10) not null, Make varchar(30),
Model varchar(30), [Year] varchar(4), Colour varchar(20), CC int, Class varchar(10) )
go

insert into DimVan (RegNo, Make, Model, [Year], Colour, CC, Class)

select * from HireBase.dbo.Van

go

select * from DimVan

Create Hirefact table

if exists (select * from sys.tables where name = 'FactHire')

drop table FactHire

go

create table FactHire

( SnapshotDateKey int not null, --Daily periodic snapshot fact table

HireDateKey int not null, CustomerKey int not null, VanKey int not null, --Dimension Keys

HireId varchar(10) not null, --Degenerate Dimension

NoOfDays int, VanHire money, SatNavHire money,

Insurance money, DamageWaiver money, TotalBill money

go

DATA WAREHOUSING AND DATA MINING LAB MANUAL


4
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

select * from FactHire.

(ii). Design multi-dimensional data models namely Star,snowflake and Fact


constellation schemas for any one enterprise (ex.Banking, Insurance,
Finance, Healthcare, Manufacturing, Automobile, etc)

Schema Definition

Multidimensional schema is defined using Data Mining Query Language (DMQL). The two
primitives, cube definition and dimension definition, can be used for defining the data
warehouses and data marts.

Star Schema

• Each dimension in a star schema is represented with only one-dimension table.

• This dimension table contains the set of attributes.

• The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.

• There is a fact table at the center. It contains the keys to each of four dimensions.

• The fact table also contains the attributes, namely dollars sold and units sold.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


5
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: Star schema

Snowflake Schema
● Some dimension tables in the Snowflake schema are normalized.
● The normalization splits up the data into additional tables.
● Unlike Star schema, the dimensions table in a snowflake schema is normalized. For
● example, the item dimension table in star schema is normalized and split into two
● dimension tables, namely item and supplier table.
● Now the item dimension table contains the attributes item_key, item_name, type, brand,
● and supplier-key.
● The supplier key is linked to the supplier dimension table. The supplier dimension table
● contains the attributes supplier_key and supplier_type.

Fig: Snowflake Schema

DATA WAREHOUSING AND DATA MINING LAB MANUAL


6
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fact Constellation Schema

● A fact constellation has multiple fact tables. It is also known as galaxy schema.

• The following diagram shows two fact tables, namely sales and shipping.

• The sales fact table is same as that in the star schema.

• The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key,from_location, to_location.

• The shipping fact table also contains two measures, namely dollars sold and units
sold.

• It is also possible to share dimension tables between fact tables. For example,
time, item, and location dimension tables are shared between the sales and
shipping fact table.

Fig: Fact constellation schema

(iii). Write ETL scripts and implement using data warehouse tools

ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers
a process of how the data are loaded from the source system to the data warehouse.
Extraction–transformation–loading (ETL) tools are pieces of software responsible for the
extraction of data from several sources, its cleansing, customization, reformatting, integration,
and insertion into a data warehouse.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


7
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Building the ETL process is potentially one of the biggest tasks of building a warehouse;
it is complex, time consuming, and consumes most of data warehouse project’s implementation
efforts, costs, and resources.

Building a data warehouse requires focusing closely on understanding three main areas:

1. Source Area- The source area has standard models such as entity relationship diagram.

2. Destination Area- The destination area has standard models such as star schema.

3. Mapping Area- But the mapping area has not a standard model till now.

Abbreviations

● ETL-extraction–transformation–loading

● DW-data warehouse

● DM- data mart

● OLAP- on-line analytical processing

● DS-data sources

● ODS- operational data store

● DSA- data staging area

● DBMS- database management system

● OLTP-on-line transaction processing

● CDC-change data capture

● SCD-slowly changing dimension

● FCME- first-class modeling elements

● EMD-entity mapping diagra

DATA WAREHOUSING AND DATA MINING LAB MANUAL


8
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

ETL Process:

Extract

The Extract step covers the data extraction from the source system and makes it
accessible for further processing. The main objective of the extract step is to retrieve all the
required data from the source system with as little resources as possible. The extract step should
be designed in a way that it does not negatively affect the source system in terms or performance,
response time or any kind of locking.

Transform

The transform step applies a set of rules to transform the data from the source to the
target. This includes converting any measured data to the same dimension (i.e. conformed
dimension) using the same units so that they can later be joined. The transformation step also
requires joining data from several sources, generating aggregates, generating surrogate keys,
sorting, deriving new calculated values, and applying advanced validation rules.

Load

During the load step, it is necessary to ensure that the load is performed correctly and
with as little resources as possible. The target of the Load process is often a database. In order to
make the load process efficient, it is helpful to disable any constraints and indexes before the
load and enable them back only after the load completes. The referential integrity needs to be
maintained by ETL tool to ensure consistency.

LEAVE 1 PAPER(2 PAGES)

(iv). Perform various OLAP operations such slice,dice,rollup,drill down and


pivot.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


9
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Ans: OLAP OPERATIONS

Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It
allows managers, and analysts to get an insight of the information through fast, consistent, and
interactive access to information.

OLAP operations in multidimensional data.

Here is the list of OLAP operations:

● Roll-up
● Drill-down
● Slice and dice
● Pivot (rotate) Roll-up

The following diagram illustrates how roll-up works.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


10
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

DATA WAREHOUSING AND DATA MINING LAB MANUAL


11
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

● Roll-up is performed by climbing up a concept hierarchy for the dimension location.


● Initially the concept hierarchy was "street < city < province < country".
● On rolling up, the data is aggregated by ascending the location hierarchy from the level of city
to the level of country.
● The data is grouped into cities rather than countries.
● When roll-up is performed, one or more dimensions from the data cube are removed.

Drill-down

Drill-down is the reverse operation of roll-up. It is performed by either of the following ways:

● By stepping down a concept hierarchy for a dimension

● By introducing a new dimension.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


12
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

The following diagram illustrates how drill-down works:

DATA WAREHOUSING AND DATA MINING LAB MANUAL


13
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

● Drill-down is performed by stepping down a concept hierarchy for the dimension time.
● Initially the concept hierarchy was "day < month < quarter < year."
● On drilling down, the time dimension is descended from the level of quarter to the level of
month.
● When drill-down is performed, one or more dimensions from the data cube are added.
● It navigates the data from less detailed data to highly detailed data.
Slice

The slice operation selects one particular dimension from a given cube and provides a new sub-cube.
Consider the following diagram that shows how slice works.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


14
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

● Here Slice is performed for the dimension "time" using the criterion time = "Q1".
● It will form a new sub-cube by selecting one or more dimensions.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


15
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Dice

Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the
following diagram that shows the dice operation.

The dice operation on the cube based on the following selection criteria involves three dimensions.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


16
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

● (location = "Toronto" or "Vancouver")

● (time = "Q1" or "Q2")

● (item =" Mobile" or "Modem")

DATA WAREHOUSING AND DATA MINING LAB MANUAL


17
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Pivot

The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an
alternative presentation of data. Consider the following diagram that shows the pivot operation.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


18
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

A.(v). Explore visualization features of the tool for analysis like identifying
trends etc.
Ans:Visualization Features:

WEKA’s visualization allows you to visualize a 2-D plot of the current working relation. Visualization
is very useful in practice, it helps to determine difficulty of the learning problem. WEKA can visualize
single attributes (1-d) and pairs of attributes (2-d), rotate 3-d visualizations (Xgobi-style). WEKA has
“Jitter” option to deal with nominal attributes and to detect “hidden” data points.

● Access To Visualization From The Classifier, Cluster And Attribute Selection Panel Is
Available From A Popup Menu. Click The Right Mouse Button Over An Entry In The Result
List To Bring Up The Menu. You Will Be Presented With Options For Viewing Or Saving The
Text Output And --- Depending On The Scheme --- Further Options For Visualizing Errors,
Clusters, Trees Etc.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


19
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

To open Visualization screen, click ‘Visualize’ tab.

Select a square that corresponds to the attributes you would like to visualize. For example, let’s choose
‘outlook’ for X – axis and ‘play’ for Y – axis. Click anywhere inside the square that corresponds to
‘play on the left and ‘outlook’ at the top

DATA WAREHOUSING AND DATA MINING LAB MANUAL


20
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Changing the View:

In the visualization window, beneath the X-axis selector there is a drop-down list,

‘Colour’, for choosing the color scheme. This allows you to choose the color of points based on the
attribute selected. Below the plot area, there is a legend that describes what values the colors
correspond to. In your example, red represents ‘no’, while blue represents ‘yes’. For better visibility
you should change the color of label ‘yes’. Left-click on ‘yes’ in the ‘Class colour’ box and select
lighter color from the color palette.

To the right of the plot area there are series of horizontal strips. Each strip represents an
attribute, and the dots within it show the distribution values of the attribute. You can choose

what axes are used in the main graph by clicking on these strips (left-click changes X-axis, right-click
changes Y-axis).

The software sets X - axis to ‘Outlook’ attribute and Y - axis to ‘Play’. The instances are spread out
in the plot area and concentration points are not visible. Keep sliding ‘Jitter’, a random displacement
given to all points in the plot, to the right, until you can spot concentration points.

The results are shown below. But on this screen we changed ‘Colour’ to temperature. Besides
‘outlook’ and ‘play’, this allows you to see the ‘temperature’ corresponding to the

‘outlook’. It will affect your result because if you see ‘outlook’ = ‘sunny’ and ‘play’ = ‘no’ to
explain the result, you need to see the ‘temperature’ – if it is too hot, you do not want to play.
Change ‘Colour’ to ‘windy’, you can see that if it is windy, you do not want to play as well.

Selecting Instances

Sometimes it is helpful to select a subset of the data using visualization tool. A special case
is the ‘UserClassifier’, which lets you to build your own classifier by interactively selecting
instances. Below the Y – axis there is a drop-down list that allows you to choose a selection method.
A group of points on the graph can be selected in four ways [2]

DATA WAREHOUSING AND DATA MINING LAB MANUAL


21
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

1. Select Instance. Click on an individual data point. It brings up a window listing

attributes of the point. If more than one point will appear at the same location, more than one
set of attributes will be shown.

2. Rectangle. You can create a rectangle by dragging it around the point

DATA WAREHOUSING AND DATA MINING LAB MANUAL


22
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

DATA WAREHOUSING AND DATA MINING LAB MANUAL


23
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

3. Polygon. You can select several points by building a free-form polygon. Left-click on the graph to
add vertices to the polygon and right-click to complete it.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


24
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

4. Polyline. To distinguish the points on one side from the once on another, you can build a polyline.
Left-click on the graph to add vertices to the polyline and right-click to finish.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


25
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

(v). Explore Visualization features of the tool for analysis like identifying trends etc.

B. Explore WEKA Data Mining/Machine Learning Toolkit

(i) Downloading and/or installation of WEKA data mining toolkit

Operating system in computer is Ubuntu operating system.

Step1:

Go to the Applications Menu-> Ubuntu software center->Search for Weka and click Install
button.

Fig: Ubuntu Software Center

DATA WAREHOUSING AND DATA MINING LAB MANUAL


26
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: Weka software availability search in Ubuntu Software Center

DATA WAREHOUSING AND DATA MINING LAB MANUAL


27
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Weka installation process asks for authentication of the administrator.

Fig: Authentication provision window in installation process

Weka software would be installed in few minutes.

Fig: Weka installation status confirmation window

Step2: Launching WEKA GUI Tool.

The following navigation is appropriate for launching weka tool.

Applications Menu->Science->weka

DATA WAREHOUSING AND DATA MINING LAB MANUAL


28
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: Weka GUI Launching window

(ii). Understand the features of WEKA toolkit such as Explorer, Knowledge Flow
interface,Experimenter,command-line interface.

Introduction:

Weka is a workbench that contains a collection of visualization tools and algorithms for data
analysis and predictive modeling, together with graphical user interfaces for easy access to these
functions.

Advantages of Weka include:

Free availability under the GNU General Public License.

Portability, since it is fully implemented in the Java programming language and thus runs
on almost any modern computing platform.

A comprehensive collection of data preprocessing and modeling techniques.


Ease of use due to its graphical user interfaces.

Description:

Open the program. Once the program has been loaded on the user’s machine it is opened by
navigating to the programs start option and that will depend on the user’s operating system. Figure
1.1 is an example of the initial opening screen on a computer.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


29
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

There are four options available on this initial screen:

Fig: Weka GUI Applications

1. Explorer - the graphical interface used to conduct experimentation on raw data

After clicking the Explorer button the weka explorer interface appears.Inside the weka explorer window
there are six tabs:

1. Preprocess- used to choose the data file to be used by the application.

● Open File- allows for the user to select files residing on the local machine or recorded
medium
● Open URL- provides a mechanism to locate a file or data source from a different location
specified by the user
● Open Database- allows the user to retrieve files or data from a database source provided by
user

DATA WAREHOUSING AND DATA MINING LAB MANUAL


30
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: Preprocessor Environment

2. Classify- used to test and train different learning schemes on the preprocessed data file under
experimentation. Again there are several options to be selected inside of the classify tab. Test
option gives the user the choice of using four different test mode scenarios on the data set.

1. Use training set

2. Supplied training set

3. Cross validation

4. Split percentage

DATA WAREHOUSING AND DATA MINING LAB MANUAL


31
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig:Classifiation of data and its options in Weka

3. Cluster- used to apply different tools that identify clusters within the data file. The Cluster tab
opens the process that is used to identify commonalties or clusters of occurrences within the data
set and produce information for the user to analyze.

Fig: Cluser analysis options and using training data set in Weka tool

4. Association- used to apply different rules to the data file that identify association within the data.
The associate tab opens a window to select the options for associations within the data set.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


32
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: Choosing Apriori data set

5. Select attributes-used to apply different rules to reveal changes based on selected attributes
inclusion or exclusion from the experiment

Fig: Selection of Attributes in weka tool

6. Visualize- used to see what the various manipulation produced on the data set in a 2D format, in
scatter plot and bar graph output.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


33
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: Visualization in Weka tool

2. Experimenter - this option allows users to conduct different experimental variations on data sets
and perform statistical manipulation. The Weka Experiment Environment enables the user to
create, run, modify, and analyze experiments in a more convenient manner than is possible when
processing the schemes individually. For example, the user can create an experiment that runs
several schemes against a series of datasets and then analyze the results to determine if one of the
schemes is (statistically) better than the other schemes. Results destination: ARFF file, CSV file,
JDBC database.

Experiment type: Cross-validation (default), Train/Test Percentage Split (data randomized).

Iteration control: Number of repetitions, Data sets first/Algorithms first.

Algorithms: filters

3. Knowledge Flow -basically the same functionality as Explorer with drag and drop functionality.
The advantage of this option is that it supports incremental learning from previous results

DATA WAREHOUSING AND DATA MINING LAB MANUAL


34
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

4. Simple CLI - provides users without a graphic interface option the ability to execute commands
from a terminal window.

(iv). Study the arff file format

An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of
instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project at the
Department of Computer Science of The University of Waikato for use with the Weka machine learning
software in WEKA, each data entry is an instance of the java class weka. core. Instance, and each instance
consists of a For loading datasets in WEKA, WEKA can load ARFF files. Attribute Relation File Format
has two sections:

1) The Header section defines relation (dataset) name, attribute name, and type.

2) The Data section lists the data instances.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


35
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: ARFF file

The figure above is from the german credit data that shows an ARFF file. Lines beginning with a % sign
are comments. And there are three basic keywords:

“@relation “ in Header section followed with relation name.

“@attribute “ in Header section followed with attribute names and their types( or sizes).

“@data “ in Data section followed with the list of data instances .

The external representation of an Instances class Consists of :

− A header: Describes the attribute types

− Data section: Comma separated list of data

(v). Explore the available datasets in weka tool.

Click the “Open file...” button to open a data set and double click on the “data” directory.Weka provides a
number of small common machine learning datasets that you can use to practice on.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


36
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: Sample data sets provided by Weka tool

(vi). Load a data set ( ex.Weather data set , credit data set)

(vii). Load data set


and observe the
following

DATA WAREHOUSING AND DATA MINING LAB MANUAL


37
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

a. List the Attribute names and their types

Title: German Credit data

Attribute Names Type of Attribute

Status of existing checking account qualitative

Duration in month numerical

Credit history qualitative

Purpose qualitative

Credit amount numerical

Savings account/bonds qualitative

Present employment since qualitative

DATA WAREHOUSING AND DATA MINING LAB MANUAL


38
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

b. Number of records in data set are 1000.

c. Class attribute is Credit history.

d. Summary

Correctly Classified Instances 855 85.5 %

Incorrectly Classified Instances 145 14.5 %

Total Number of Instances 1000

UNIT - 2

Perform data pre-processing tasks and Demonstrate performing


association rule mining on data sets

A. Explore various options available in Weka for preprocessing data and


apply ( like Discretization Filters, Resample filter, etc) on each dataset.

The navigation flow for preprocess of data set with out any filter in Weka tool is as follows :

The GUI WEKA launcher -- > Explorer --> Preprocess --> Open file (Browse system for data set) -->
select ALL attributes .

DATA WAREHOUSING AND DATA MINING LAB MANUAL


39
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

A histogram which represents all attributes appear in visualize area.

Fig:Preprocessing the data without any filter.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


40
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

The navigation flow for preprocess of data set with Discretize filter in Weka tool is as follows :

The GUI WEKA launcher -- > Explorer --> Preprocess -->

Open file (Browse system for data set) --> Choose --> weka --> Filters --> Supervised --> attributes -->
Discretiz --> ALL attributes .

A histogram which represents credit history appear in visualize area.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


41
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: preprocess with discretization

The navigation flow for preprocess of data set with Resample filter in Weka tool is as follows :

The GUI WEKA launcher -- > Explorer --> Preprocess -->

Open file (Browse system for data set) --> Choose --> weka --> Filters --> Supervised --> instance -->
Resample --> Select Credit History .

A histogram which represents credit history appear in visualize area.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


42
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: Preprocess of data with Resample

B. Load a data set into weka and run Aprori algorithm with different
support and confidence values. Study the rules generated.

The navigation flow for applying Apriori algorithm on a data set Weka tool is as follows :

DATA WAREHOUSING AND DATA MINING LAB MANUAL


43
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

The GUI WEKA launcher -- > Explorer --> Associate -->

Open file (Browse system for data set) --> Choose --> weka --> Associations-->Apriori --> Start .

A histogram which represents credit history appear in visualize area.

Fig: Apriori ssociation

DATA WAREHOUSING AND DATA MINING LAB MANUAL


44
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: Apriori Association continuation

UNTI-3

Demonstrate performing classification on data sets


A. Load dataset into Weka and run J48 classification algorithm.Study the classifier output.Compute
entropy values, Kappa statistic.

The navigation flow for classification of data set with J48 classifier in Weka tool is as follows :

The GUI WEKA launcher -- > Explorer --> Classify -->

Open file (Browse system for data set) --> Choose --> weka --> Trees --> J48 .

DATA WAREHOUSING AND DATA MINING LAB MANUAL


45
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: Classification of a dataset using classifier J48

B. Extract if-then rules from the decision tree generated by the classifier, observe the confusion
matrix and derive accuracy, F-measure,TPrate, FPrate, Precision and recall values. Apply
cross-validation strategy with values and compare accuracy results.

Fig: Classification of a dataset using classifier J48 continuation.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


46
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

C. Load dataset into Weka and run Naive-bayes classification algorithm.Study the classifier
output.

The navigation flow for classification of data set with Naive-bayes classifier in Weka tool is as
follows :The GUI WEKA launcher -- > Explorer --> Classify -->Open file (Browse system for data set)
--> Choose --> weka --> bayes --> Naive-Bayes

Fig: classification using Naive-bayes classifier

DATA WAREHOUSING AND DATA MINING LAB MANUAL


47
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: classification using Naive-bayes classifier continuation..

DATA WAREHOUSING AND DATA MINING LAB MANUAL


48
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: classification using Naive-bayes classifier continuation..

D.Plot ROC Curves

DATA WAREHOUSING AND DATA MINING LAB MANUAL


49
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: Classifier run on german credit dataset

DATA WAREHOUSING AND DATA MINING LAB MANUAL


50
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig : ROC curve for german credit dataset

visualize threshould curves

DATA WAREHOUSING AND DATA MINING LAB MANUAL


51
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: No credits or All paid

DATA WAREHOUSING AND DATA MINING LAB MANUAL


52
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: All paid

DATA WAREHOUSING AND DATA MINING LAB MANUAL


53
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: Existing paid

DATA WAREHOUSING AND DATA MINING LAB MANUAL


54
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Fig: Delayed previously

Fig: Critical or Other existing credit

E. Compare classification results of ID3,J48, Naïve-Bayes and k-NN classifiers for each dataset
, and reduce which classifier is performing best and poor for each dataset and justify.

Ans Steps for run ID3 and J48 Classification algorithms in WEKA

1. Open WEKA Tool.


2. Click on WEKA Explorer.
3. Click on Preprocessing tab button.
4. Click on open file button.
5. Choose WEKA folder in C drive.

6. Select and Click on data option button.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


55
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

7. Choose iris data set and open file.


8. Click on classify tab and Choose J48 algorithm and select use training set test option
9. Click on start button.
10. Click on classify tab and Choose ID3 algorithm and select use training set test option.
11. Click on start button.
12. Click on classify tab and Choose Naïve-bayes algorithm and select use training set test
option.
13. Click on start button.
14. Click on classify tab and Choose k-nearest neighbor and select use training set test option.
15. Click on start button.

J48:

=== Run information ===

Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2

Relation: iris

Instances: 150

Attributes: 5

sepallength

sepalwidth

DATA WAREHOUSING AND DATA MINING LAB MANUAL


56
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

petallength

petalwidth

class

Test mode:evaluate on training data

=== Classifier model (full training set) ===

J48 pruned tree

------------------

petalwidth <= 0.6: Iris-setosa (50.0)

petalwidth > 0.6

| petalwidth <= 1.7

| | petallength <= 4.9: Iris-versicolor (48.0/1.0)

| | petallength > 4.9

| | | petalwidth <= 1.5: Iris-virginica (3.0)

| | | petalwidth > 1.5: Iris-versicolor (3.0/1.0)

| petalwidth > 1.7: Iris-virginica (46.0/1.0)

DATA WAREHOUSING AND DATA MINING LAB MANUAL


57
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Number of Leaves : 5

Size of the tree : 9

Time taken to build model: 0 seconds

=== Evaluation on training set ===

=== Summary ===

Correctly Classified Instances 147 98 %

Incorrectly Classified Instances 3 2 %

Kappa statistic 0.97

Mean absolute error 0.0233

Root mean squared error 0.108

Relative absolute error 5.2482 %

Root relative squared error 22.9089 %

Total Number of Instances 150

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

1 0 1 1 1 1 Iris-setosa

0.98 0.02 0.961 0.98 0.97 0.99 Iris-versicolor

DATA WAREHOUSING AND DATA MINING LAB MANUAL


58
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

0.96 0.01 0.98 0.96 0.97 0.99 Iris-virginica

Weighted Avg. 0.98 0.01 0.98 0.98 0.98 0.993

=== Confusion Matrix ===

abc <-- classified as

50 0 0 | a = Iris-setosa

0 49 1 | b = Iris-versicolor

1 2 48 | c = Iris-virginica

Naïve-bayes:

=== Run information ===

Scheme:weka.classifiers.bayes.NaiveBayes

Relation: iris

Instances: 150

Attributes: 5

DATA WAREHOUSING AND DATA MINING LAB MANUAL


59
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

sepallength

sepalwidth

petallength

petalwidth

class

Test mode:evaluate on training data

=== Classifier model (full training set) ===

Naive Bayes Classifier

Class

Attribute Iris-setosa Iris-versicolor Iris-virginica

(0.33) (0.33) (0.33)

===============================================================

sepallength

DATA WAREHOUSING AND DATA MINING LAB MANUAL


60
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

mean 4.9913 5.9379 6.5795

std. dev. 0.355 0.5042 0.6353

weight sum 50 50 50

precision 0.1059 0.1059 0.1059

sepalwidth

mean 3.4015 2.7687 2.9629

std. dev. 0.3925 0.3038 0.3088

weight sum 50 50 50

precision 0.1091 0.1091 0.1091

petallength

mean 1.4694 4.2452 5.5516

std. dev. 0.1782 0.4712 0.5529

weight sum 50 50 50

precision 0.1405 0.1405 0.1405

petalwidth

mean 0.2743 1.3097 2.0343

std. dev. 0.1096 0.1915 0.2646

weight sum 50 50 50

precision 0.1143 0.1143 0.1143

Time taken to build model: 0 seconds

=== Evaluation on training set ===

DATA WAREHOUSING AND DATA MINING LAB MANUAL


61
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

=== Summary ===

Correctly Classified Instances 144 96 %

Incorrectly Classified Instances 6 4 %

Kappa statistic 0.94

Mean absolute error 0.0324

Root mean squared error 0.1495

Relative absolute error 7.2883 %

Root relative squared error 31.7089 %

Total Number of Instances 150

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

1 0 1 1 1 1 Iris-setosa

0.96 0.04 0.923 0.96 0.941 0.993 Iris-versicolor

0.92 0.02 0.958 0.92 0.939 0.993 Iris-virginica

Weighted Avg. 0.96 0.02 0.96 0.96 0.96 0.995

=== Confusion Matrix ===

abc <-- classified as

50 0 0 | a = Iris-setosa

0 48 2 | b = Iris-versicolor

DATA WAREHOUSING AND DATA MINING LAB MANUAL


62
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

1 4 46 | c = Iris-virginica

K-Nearest Neighbor (IBK):

=== Run information ===

Scheme:weka.classifiers.lazy.IBk -K 1 -W 0 -A "weka.core.neighboursearch.LinearNNSearch -A
\"weka.core.EuclideanDistance -R first-last\""

Relation: iris

Instances: 150

Attributes:
5
sepallength
sepalwidth
petallength
petalwidth
class

Test mode:evaluate on training data

=== Classifier model (full training set) ===

DATA WAREHOUSING AND DATA MINING LAB MANUAL


63
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

IB1 instance-based classifier

using 1 nearest neighbour(s) for classification

Time taken to build model: 0 seconds

=== Evaluation on training set ===

=== Summary ===

Correctly Classified Instances 150 100 %

Incorrectly Classified Instances 0 0 %

Kappa statistic 1

Mean absolute error 0.0085

Root mean squared error 0.0091

Relative absolute error 1.9219 %

Root relative squared error 1.9335 %

Total Number of Instances 150

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

1 0 1 1 1 1 Iris-setosa

1 0 1 1 1 1 Iris-versicolor

DATA WAREHOUSING AND DATA MINING LAB MANUAL


64
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

1 0 1 1 1 1 Iris-virginica

Weighted Avg. 1 0 1 1 1 1

=== Confusion Matrix ===

a b c <--
classified as 50 0 0 | a =
Iris-setosa

0 50 0 | b = Iris-versicolor

0 0 50 | c = Iris-virginica.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


65
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Unit – 4

DEMONSTRATE PERFORMING CLUSTERING ON DATA SETS


CLUSTERING TAB
A. Load each dataset into Weka and run simple k-means clustering algorithm with different
values of k(number of desired clusters). Study the clusters formed. Observe the sum of
squared errors and centroids, and derive insights

Ans:Steps for run K-mean Clustering algorithms in WEKA

1. Open WEKA Tool.


2. Click on WEKA Explorer.
3. Click on Preprocessing tab button.
4. Click on open file button.
5. Choose WEKA folder in C drive.
6. Select and Click on data option button.
7. Choose iris data set and open file.
8. Click on cluster tab and Choose k-mean and select use training set test option.
9. Click on start button.

Output:

=== Run information ===

Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500

-S 10

Relation: iris

Instances: 150

Attributes: 5

DATA WAREHOUSING AND DATA MINING LAB MANUAL


66
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

sepallength

sepalwidth

petallength

petalwidth

class

Test mode:evaluate on training data

=== Model and evaluation on training set ===

kMeans

======

Number of iterations: 7

Within cluster sum of squared errors: 62.1436882815797


Missing values globally replaced with mean/mode

DATA WAREHOUSING AND DATA MINING LAB MANUAL


67
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Cluster centroids:

Cluster#

Attribute Full Data 0 1

(150) (100) (50)

==================================================================

sepallength 5.8433 6.262 5.006

sepalwidth 3.054 2.872 3.418

petallength 3.7587 4.906 1.464

petalwidth 1.1987 1.676 0.244

class Iris-setosa Iris-versicolor Iris-setosa

Time taken to build model (full training data) : 0 seconds

=== Model and evaluation on training set ===

Clustered Instances

1 100 ( 67%)

2 50 ( 33%)

DATA WAREHOUSING AND DATA MINING LAB MANUAL


68
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

B. Explore other clustering techniques available in Weka.

Ans: Clustering Algorithms And Techniques in WEKA, They are

DATA WAREHOUSING AND DATA MINING LAB MANUAL


69
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

C. Explore visualization features of weka to visualize the clusters. Derive interesting insights
and explain.

Ans: Visualize Features

WEKA’s visualization allows you to visualize a 2-D plot of the current working relation.
Visualization is very useful in practice, it helps to determine difficulty of the learning problem. WEKA
can visualize single attributes (1-d) and pairs of attributes (2-d), rotate 3-d visualizations
(Xgobi-style). WEKA has “Jitter” option to deal with nominal attributes and to detect “hidden” data
points.

Access To Visualization From The Classifier, Cluster And Attribute Selection Panel Is Available From A
Popup Menu. Click The Right Mouse Button Over An Entry In The Result List To Bring Up The Menu.
You Will Be Presented With Options For Viewing Or Saving The Text Output And

- Depending On The Scheme --- Further Options For Visualizing Errors, Clusters, Trees Etc.

To open Visualization screen, click ‘Visualize’ tab.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


70
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

DATA WAREHOUSING AND DATA MINING LAB MANUAL


71
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Select a square that corresponds to the attributes you would like to visualize. For example, let’s choose
‘outlook’ for X – axis and ‘play’ for Y – axis. Click anywhere inside the square that corresponds to ‘play
o

Changing the View:

In the visualization window, beneath the X-axis selector there is a drop-down list, ‘Colour’, for
choosing the color scheme. This allows you to choose the color of points based on the attribute
selected. Below the plot area, there is a legend that describes what values the colors correspond
to. In your example, red represents ‘no’, while blue represents ‘yes’. For better visibility you
should change the color of label ‘yes’. Left-click on ‘yes’ in the ‘Class colour’ box and select
lighter color from the color palette.

n the left and ‘outlook’ at the top.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


72
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Selecting Instances

Sometimes it is helpful to select a subset of the data using visualization tool. A special case is
the ‘UserClassifier’, which lets you to build your own classifier by interactively selecting instances.
Below the Y – axis there is a drop-down list that allows you to choose a selection method. A group of
points on the graph can be selected in four ways [2]:

1. Select Instance. Click on an individual data point. It brings up a window listing attributes of the
point. If more than one point will appear at the same location, more than one set of attributes will
be shown.

2. Rectangle. You can create a rectangle by dragging it around the point.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


73
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

3. Polygon. You can select several points by building a free-form polygon. Left-click on the graph
to add vertices to the polygon and right-click to complete it.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


74
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

4.Polyline. To distinguish the points on one side from the once on another, you can build a
polyline. Left-click on the graph to add vertices to the polyline and right-click to finish.
Viva Questions

Unit-V

Demonstrate performing regression on data sets.


A. Load each dataset into Weka and build Linear Regression model. Study
the cluster formed. Use training set option. Interpret the regression
model and derive patterns and conclusions from the regression results.

Ans:Steps for run Aprior algorithm in WEKA

1. Open WEKA Tool.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


75
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

2. Click on WEKA Explorer.


3. Click on Preprocessing tab button.
4. Click on open file button.
5. Choose WEKA folder in C drive.

6. Select and Click on data option button.

7. Choose labor data set and open file.


8. Click on Classify tab and Click the Choose button then expand the functions branch.
9. Select the LinearRegression leaf ans select use training set test option.
10. Click on start button.

Output:

=== Run information ===

Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8

Relation: labor-neg-data

Instances: 57

Attributes: 17

duration

wage-increase-first-year

DATA WAREHOUSING AND DATA MINING LAB MANUAL


76
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

wage-increase-second-year

wage-increase-third-year

cost-of-living-adjustment

working-hours

pension

standby-pay

shift-differential

education-allowance

statutory-holidays

vacation

longterm-disability-assistance

contribution-to-dental-plan

DATA WAREHOUSING AND DATA MINING LAB MANUAL


77
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

bereavement-assistance

contribution-to-health-plan

class

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

Linear Regression Model

DATA WAREHOUSING AND DATA MINING LAB MANUAL


78
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

B. Use options cross-validation and percentage split and repeat running the
Linear Regression Model. Observe the results and derive meaningful
results.

Ans: Steps for run Aprior algorithm in WEKA

1. Open WEKA Tool.

2. Click on WEKA Explorer.


3. Click on Preprocessing tab button.
4. Click on open file button.
5. Choose WEKA folder in C drive.

6. Select and Click on data option button.

7. Choose labor data set and open file.


8. Click on Classify tab and Click the Choose button then expand the functions branch.
9. Select the LinearRegression leaf and select test options cross-validation.
10. Click on start button.
11. Select the LinearRegression leaf and select test options percentage split.
12. Click on start button.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


79
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Output: cross-validation

=== Run information ===

Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8

Relation: labor-neg-data

57

Instances:

Attributes: 17

duration

wage-increase-first-year

DATA WAREHOUSING AND DATA MINING LAB MANUAL


80
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

wage-increase-second-year

wage-increase-third-year

cost-of-living-adjustment

working-hours

pension

standby-pay

shift-differential

education-allowance

statutory-holidays

vacation

longterm-disability-assistance

contribution-to-dental-plan

DATA WAREHOUSING AND DATA MINING LAB MANUAL


81
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

bereavement-assistance

contribution-to-health-plan

class

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

Linear Regression Model

duration =

0.4689 * cost-of-living-adjustment=tc,tcf +

0.6523 * pension=none,empl_contr +

1.0321 * bereavement-assistance=yes +

0.3904 * contribution-to-health-plan=full +

DATA WAREHOUSING AND DATA MINING LAB MANUAL


82
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

0.2765

Time taken to build model: 0.02 seconds

=== Cross-validation ===

=== Summary ===

Correlation coefficient 0.1967

Mean absolute error 0.6499

Root mean squared error 0.777

Relative absolute error 111.6598 %

Root relative squared error 108.8152 %

Total Number of Instances 56

Ignored Class Unknown Instances 1

DATA WAREHOUSING AND DATA MINING LAB MANUAL


83
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Output: percentage split

=== Run information ===

Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8

Relation: labor-neg-data

Instances: 57

Attributes: 17

duration

wage-increase-first-year

DATA WAREHOUSING AND DATA MINING LAB MANUAL


84
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

wage-increase-second-year

wage-increase-third-year

cost-of-living-adjustment

working-hours

pension

standby-pay

shift-differential

education-allowance

statutory-holidays

vacation

longterm-disability-assistance

contribution-to-dental-plan

DATA WAREHOUSING AND DATA MINING LAB MANUAL


85
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

bereavement-assistance

contribution-to-health-plan

class

Test mode: split 66.0% train, remainder test

=== Classifier model (full training set) ===

Linear Regression Model

duration =

0.4689 * cost-of-living-adjustment=tc,tcf +

0.6523 * pension=none,empl_contr +

1.0321 * bereavement-assistance=yes +

0.3904 * contribution-to-health-plan=full +

0.2765

Time taken to build model: 0.02 seconds

=== Evaluation on test split ===

=== Summary ===

Correlation coefficient 0.243

Mean absolute error 0.783

DATA WAREHOUSING AND DATA MINING LAB MANUAL


86
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Root mean squared error 0.9496

Relative absolute error 106.8823 %

Root relative squared error 114.13 %

Total Number of Instances 19

C. Explore Simple linear regression techniques that only looks at one


variable.

Ans: Steps for run Aprior algorithm in WEKA

1. Open WEKA Tool.

2. Click on WEKA Explorer.

3. Click on Preprocessing tab button.

4. Click on open file button.

5. Choose WEKA folder in C drive.

6. Select and Click on data option button.

7. Choose labor data set and open file.


8. Click on Classify tab and Click the Choose button then expand the functions branch.
9. Select the S i m p l e Linear Regression leaf and select test options cross-validation.
10. Click on start button.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


87
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Data Mining Lab


CREDIT RISK ASSESSMENT

● The business of banks is making loans. Assessing the credit worthiness of an applicant’s
of crucial importance. We have to develop a system to help a loan officer decide
whether the credit of a customer is good or bad. A bank’s business rules regarding loans
must consider two opposing factors. On the one hand, a bank wants to make as many
loans as possible. Interest on these loans is the banks profit source. On the other hand, a
bank cannot afford to make too many bad loans. To many bad could leads to the collapse
of the bank. The bank’s loan policy must involve a compromise not too strict, and not
too lenient.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
88
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

● Credit risk is an investor's risk of loss arising from a borrower who does not make
payments as promised. Such an event is called a default. Other terms for credit risk are
default risk and counterparty risk.
● Credit risk is most simply defined as the potential that a bank borrower or counterparty
will fail to meet its obligations in accordance with agreed terms.
● The goal of credit risk management is to maximize a bank's risk-adjusted rate of return
by maintaining credit risk exposure within acceptable parameters.
● Banks need to manage the credit risk inherent in the entire portfolio as well as the risk in
individual credits or transactions.
● Banks should also consider the relationships between credit risk and other risks.
● The effective management of credit risk is a critical component of a comprehensive
approach to risk management and essential to the long-term success of any banking
organization.
● A good credit assessment means you should be able to qualify, within the limits of your
income, for most loans.

SHOULD WRITE GERMAN CREDIT DATA

Week 1
1. List all the categorical (or nominal) attributes and the real-valued attributes
separately.
From the German Credit Assessment Case Study given to us, the following attribute are found
to be applicable for Credit-Risk Assessment. Total Valid Attributes Categorical or Nominal
attributes (which takes True/false, etc values) Real valued attributes
1. checking_status
2. duration
3. credit history
4. purpose
5. credit amount
6. savings_status
7. employment duration
8. installment rate
9. personal status
10. debtors

DATA WAREHOUSING AND DATA MINING LAB MANUAL


89
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

11. residence_since
12. property
13. installment plans
14. housing
15. existing credits
16. job
17. num_dependents
18. telephone
19. foreign worker

Week 2
2. What attributes do you think might be crucial in making the credit assessment? Come
up with some simple rules in plain English using your selected attributes.

According to me the following attributes may be crucial in making the credit risk assessment.
1. Credit_history
2. Employment
3. Property_magnitude
4. Job
5. Duration
6. Credit_amount
7. Installment
8. Existing credit

Based on the above attributes, we can make a decision whether to give credit or not.
checking_status = no checking AND other_payment_plans = none AND
credit_history = critical/other existing credit: good

checking_status = no checking AND existing_credits <= 1 AND


other_payment_plans = none AND purpose = radio/tv: good

checking_status = no checking AND foreign_worker = yes AND


employment = 4<=X<7: good

foreign_worker = no AND personal_status = male single: good


checking_status = no checking AND purpose = used car AND
other_payment_plans = none: good

duration <= 15 AND other_parties = guarantor: good

DATA WAREHOUSING AND DATA MINING LAB MANUAL


90
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

duration <= 11 AND credit_history = critical/other existing credit: good

checking_status = >=200 AND num_dependents <= 1 AND


property_magnitude = car: good

checking_status = no checking AND property_magnitude = real estate AND


other_payment_plans = none AND age > 23: good

savings_status = >=1000 AND property_magnitude = real estate: good

savings_status = 500<=X<1000 AND employment = >=7: good

credit_history = no credits/all paid AND housing = rent: bad

savings_status = no known savings AND checking_status = 0<=X<200 AND existing_credits >

1: good

checking_status = >=200 AND num_dependents <= 1 AND property_magnitude = life insurance:


good

installment_commitment <= 2 AND other_parties = co applicant AND existing_credits >

1: bad

installment_commitment <= 2 AND credit_history = delayed previously AND existing_credits > 1


AND residence_since > 1: good

installment_commitment <= 2 AND credit_history = delayed previously AND existing_credits <=


1: good

duration > 30 AND savings_status = 100<=X<500: bad

credit_history = all paid AND other_parties = none AND other_payment_plans = bank: bad

duration > 30 AND savings_status = no known savings AND num_dependents > 1: good

duration > 30 AND credit_history = delayed previously: bad

duration > 42 AND savings_status = <100 AND residence_since > 1: bad

DATA WAREHOUSING AND DATA MINING LAB MANUAL


91
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Week 3
3. One type of model that you can create is a Decision Tree - train a Decision Tree using
the complete dataset as the training data. Report the model obtained after training.

A decision tree is a flow chart like tree structure where each internal node(non-leaf) denotes a
test on the attribute, each branch represents an outcome of the test ,and each leaf
node(terminal node)holds a class label.
Decision trees can be easily converted into classification rules.
e.g. ID3, C4.5 and CART.

J48 pruned tree

● Using WEKA Tool, we can generate a decision tree by selecting the “classify tab”.
● In classify tab select choose option where a list of different decision trees are available.
From that list select J48.
● Now under test option, select training data test option.
● The resulting window in WEKA is as follows:
● To generate the decision tree, right click on the result list and select visualize tree option
by which the decision tree will be generated

DATA WAREHOUSING AND DATA MINING LAB MANUAL


92
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

● The obtained decision tree for credit risk assessment is very large to fit on the screen.

The decision tree below is unclear due to a large number of attributes.

Week 4
DATA WAREHOUSING AND DATA MINING LAB MANUAL
93
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

4. Suppose you use your above model trained on the complete dataset, and classify
credit good/bad for each of the examples in the dataset. What % of examples can you
classify correctly? (This is also called testing on the training set) Why do you think you
cannot get 100 % training accuracy?

In the above model we trained complete dataset and we classified credit good/bad for each of
the examples in the dataset.

For example:

IF purpose=vacation THEN credit=bad;

ELSE purpose=business THEN credit=good;

In this way we classified each of the examples in the dataset. We classified 85.5% of examples
correctly and the remaining 14.5% of examples are incorrectly classified. We can’t get 100%
training accuracy because out of the 20 attributes, we have some unnecessary attributes which
are also been analyzed and trained. Due to this the accuracy is affected and hence we can’t get
100% training accuracy.

Week 5
5. Is testing on the training set as you did above a good idea? Why not?

Bad idea, if take all the data into training set. Then how to test the above classification is
correctly or not?

According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as training
set and the remaining 1/3 as test set. But here in the above model we have taken complete
dataset as training set which results only 85.5% accuracy. This is done for the analyzing and
training of the unnecessary attributes which does not make a crucial role in credit risk
assessment. And by this complexity is increasing and finally it leads to the minimum accuracy. If
some part of the dataset is used as a training set and the remaining as test set then it leads to
the accurate results and the time for computation will be less. This is why, we prefer not to take
complete dataset as training set. Use Training Set Result for the table German Credit Data:

Correctly Classified Instances 855 85.5 %


Incorrectly Classified Instances 145 14.5 %

DATA WAREHOUSING AND DATA MINING LAB MANUAL


94
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Kappa statistic 0.6251


Mean absolute error 0.2312
Root mean squared error 0.34
Relative absolute error 55.0377 %
Root relative squared error 74.2015 %
Total Number of Instances 1000

Week 6
6. One approach for solving the problem encountered in the previous question is using
cross-validation? Describe what cross-validation is briefly. Train a Decision Tree again using
cross-validation and report your results. Does your accuracy increase/decrease? Why?

Cross validation:-

In k-fold cross-validation, the initial data are randomly portioned into ‘k’ mutually exclusive
subsets or folds D1, D2, D3, . . . . . ., Dk. Each of approximately equal size. Training and testing is
performed ‘k’ times. In iteration I, partition Di is reserved as the test set and the remaining
partitions are collectively used to train the model. That is in the first iteration subsets D2, D3, . .
. . . ., Dk collectively serve as the training set in order to obtain as first model. Which is tested on
Di. The second trained on the subsets D1, D3, . . . . . ., Dk and test on the D2 and so on....

DATA WAREHOUSING AND DATA MINING LAB MANUAL


95
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

1. Select classify tab and J48 decision tree and in the test option select cross validation
radio button and the number of folds as 10.
2. Number of folds indicates number of partition with the set of attributes.
3. Kappa statistics nearing 1 indicates that there is 100% accuracy and hence all the errors
will be zeroed out, but in reality there is no such training set that gives 100% accuracy.
4. Cross Validation Result at folds: 10 for the table German Credit Data:

Correctly Classified Instances


705
70.5 %

Incorrectly Classified Instances


295
29.5 %

DATA WAREHOUSING AND DATA MINING LAB MANUAL


96
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Kappa statistic
0.2467

Mean absolute error


0.3467

Root mean squared error


0.4796

Relative absolute error


82.5233 %

Root relative squared error


104.6565 %

Total Number of Instances


1000

Here there are 1000 instances with 100 instances per partition.
Cross Validation Result at folds: 20 for the table GermanCreditData:

Correctly Classified Instances


698
69.8 %
Incorrectly Classified Instances
302
30.2
%
Kappa statistic
0.2264

Mean absolute error


0.3571

Root mean squared error


0.4883

Relative absolute error


85.0006 %

Root relative squared error

DATA WAREHOUSING AND DATA MINING LAB MANUAL


97
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

106.5538 %

Total Number of Instances


1000
Cross Validation Result at folds: 50 for the table GermanCreditData:

Correctly Classified Instances


710
71
%
Incorrectly Classified Instances
290 29%
Kappa statistic
0 .2587

Mean absolute error


0.3444

Root mean squared error


0.4771

Relative absolute error


81.959 %
Root relative squared error

104.1164 %

Total Number of Instances 1000


Cross Validation Result at folds: 100 for the table German Credit Data:

Correctly Classified Instance 710 71


Incorrectly Classified Instance 290 29
Kappa statistic 0.2587
Mean absolute error 0.3444
Root mean squared error 0.477
Relative absolute error 81.959 %
Root relative squared error 104.1164 %
Total Number of Instances 1000

DATA WAREHOUSING AND DATA MINING LAB MANUAL


98
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Percentage split does not allow 100%, it allows only till 99.9%

DATA WAREHOUSING AND DATA MINING LAB MANUAL


99
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

● Percentage Split Result at 50%:


● Correctly Classified Instances362 72.4 %
● Incorrectly Classified Instances 138 27.6 %
● Kappa statistic 0.2725
● Mean absolute error 0.3225
● Root mean squared error 0.4764
● Relative absolute error 76.3523 %
● Root relative squared error 106.4373%
● Total Number of Instances 500

DATA WAREHOUSING AND DATA MINING LAB MANUAL


100
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

● Percentage Split Result at 99.9%:


● Correctly Classified Instances 00
● Incorrectly Classified Instance 1 100
● Kappa statistic 0
● Mean absolute error 0.6667
● Root mean squared error 0.6667
● Relative absolute error 221.7054 %
● Root relative squared error 221.7054 %
● Total Number of Instances 1

DATA WAREHOUSING AND DATA MINING LAB MANUAL


101
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Week 7
7. Check to see if the data shows a bias against "foreign workers" (attribute 20), or
"personal-status"(attribute 9). One way to do this (Perhaps rather simple minded) is to
remove these attributes from the dataset and see if the decision tree created in those cases is
significantly different from the full dataset case which you have already done. To remove an
attribute you can use the reprocess tab in WEKA's GUI Explorer. Did removing these attributes
have any significant effect? Discuss.

This increases in accuracy because the two attributes “foreign workers” and “personal status
“are not much important in training and analyzing. By removing this, the time has been reduced
to some extent and then it results in increase in the accuracy. The decision tree which is created
is very large compared to the decision tree which we have trained now. This is the main
difference between these two decision trees.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


102
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

After foreign worker is removed, the accuracy is increased to 85.9%

DATA WAREHOUSING AND DATA MINING LAB MANUAL


103
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

If we remove 9th attribute, the accuracy is further increased to 86.6% which shows that these
two attributes are not significant to perform training.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


104
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Cross validation after removing 9 th attribute.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


105
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Percentage split after removing 9 th attribute.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


106
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

After removing the 20th attribute, the cross validation is as above.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


107
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

After removing 20 th attribute, the percentage split is as above.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


108
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Week 8
8. Another question might be, do you really need to input so many attributes to get good
results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5,
7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had
removed two attributes in problem 7 Remember to reload the ARFF data file to get all the
attributes initially before you start selecting the ones you want.)

Select attribute 2,3,5,7,10,17,21 and click on invert to remove the remaining attributes.

Here accuracy is decreased.

Select random attributes and then check the accuracy.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


109
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

DATA WAREHOUSING AND DATA MINING LAB MANUAL


110
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

After removing the attributes 1,4,6,8,9,11,12,13,14,15,16,18,19 and 20,we select the left over
attributes and visualize them.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


111
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

After we remove 14 attributes, the accuracy has been decreased to 76.4% hence we can further
try random combination of attributes to increase the accuracy.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


112
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

DATA WAREHOUSING AND DATA MINING LAB MANUAL


113
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Week 9
9. Sometimes, the cost of rejecting an applicant who actually has a good credit
Case 1: Might be higher than accepting an applicant who has bad credit
Case 2: Instead of counting the misclassifications equally in both cases, give a higher cost to
the first case (say cost 5) and lower cost to the second case. You can do this by using a cost
matrix in WEKA.

Train your Decision Tree again and report the Decision Tree and cross-validation results. Are
they significantly different from results obtained in problem 6 (using equal cost)?

DATA WAREHOUSING AND DATA MINING LAB MANUAL


114
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

In the Problem 6, we used equal cost and we trained the decision tree. But here, we consider
two cases with different cost. Let us take cost 5 in case 1 and cost 2 in case 2.When we give such
costs in both cases and after training the decision tree, we can observe that almost equal to
that of the decision tree obtained in problem 6. Case1 (cost 5) Case2 (cost 5)

Total Cost 3820 1705

Average Cost 3.82 1.705

We don’t find this cost factor in problem 6. As there we use equal cost. This is the major
difference between the results of problem 6 and problem 9.

The cost matrices we used here:

Case 1: 5 1 1 5

Case 2: 2 1 1 2
1. Select classify tab. Select More Option from Test Option. Tick on cost sensitive
Evaluation and go to set. Set classes as 2.Click on Resize and then we’ll get cost matrix.
Then change the 2nd entry in 1st row and 2nd entry in 1st column to 5.0
2. Then confusion matrix will be generated and you can find out the difference between
good and bad attribute.
3. Check accuracy whether it’s changing or not.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


115
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

DATA WAREHOUSING AND DATA MINING LAB MANUAL


116
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Week 10

10. Do you think it is a good idea to prefer simple decision trees instead of having long
complex decision trees? How does the complexity of a Decision Tree relate to the bias of the
model?

When we consider long complex decision trees, we will have many unnecessary attributes in the
tree which results in increase of the bias of the model. Because of this, the accuracy of the
model can also effect.
This problem can be reduced by considering simple decision tree. The attributes will be less and
it decreases the bias of the model. Due to this the result will be more accurate.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


117
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

So it is a good idea to prefer simple decision trees instead of long complex trees.

1. Open any existing ARFF file e.g labour.arff.


2. In preprocess tab, select ALL to select all the attributes.
3. Go to classify tab and then use training set with J48 algorithm.

4. To generate the decision tree, right click on the result list and select visualize tree
option, by which the decision tree will be generated.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


118
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

5. Right click on J48 algorithm to get Generic Object Editor window


6. In this, make the unpruned option as true .
7. Then press OK and then start. We find the tree will become more complex if not pruned.

Visualize tree

DATA WAREHOUSING AND DATA MINING LAB MANUAL


119
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

8. The tree has become more complex.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


120
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Week 11

DATA WAREHOUSING AND DATA MINING LAB MANUAL


121
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

11. You can make your Decision Trees simpler by pruning the nodes. One approach is to use
Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your
Decision Trees using cross-validation (you can do this in WEKA) and report the Decision Tree
you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase?

Reduced-error pruning:-
The idea of using a separate pruning set for pruning—which is applicable to decision trees as
well as rule sets—is called reduced-error pruning. The variant described previously prunes a
rule immediately after it has been grown and is called incremental reduced-error pruning.
Another possibility is to build a full, unpruned rule set first, pruning it afterwards b discarding
individual tests.
However, this method is much slower. Of course, there are many different ways to assess the
worth of a rule based on the pruning set. A simple measure is to consider how well the rule
would do at discriminating the predicted class from other classes if it were the only rule in the
theory, operating under the closed world assumption. If it gets p instances right out of the t
instances that it covers, and there are P instances of this class out of a total T of instances
altogether, then it gets positive instances right. The instances that it does not cover include
N–n
negative ones, where
n=t–p
is the number of negative instances that the rule covers and
N=T-P
is the total number of negative instances. Thus the rule has an overall success ratio of
[p +(N - n)] T,
and this quantity, evaluated on the test set, has been used to evaluate the success of a rule
when using reduced-error pruning.

1. Right click on J48 algorithm to get Generic Object Editor window


2. In this, make reduced error pruning option as true and also the unpruned option as
true .
3. Then press OK and then start.
4. 4. We find that the accuracy has been increased by selecting the reduced error pruning
option.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


122
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Reduced error pruning is set to True

Week 12
12. (Extra Credit): How can you convert a Decision Trees into "if-then- else rules". Make up
your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There
also exist different classifiers that output the model in the form of rules - one such classifier in
WEKA is rules. PART, train this model and report the set of rules obtained. Sometimes just one
attribute can be good enough in making the decision, yes, just one! Can you predict what
attribute that might be in this dataset? One R classifier uses a single attribute to make
decisions (it chooses the attribute based on minimum error). Report the rule obtained by
training a one R classifier. Rank the performance of j48, PART and oneR.

In WEKA, rules, PART is one of the classifier which converts the decision trees into “IF-
THEN-ELSE” rules. Converting Decision trees into “IF-THEN-ELSE” rules using rules. PART
classifier:-
PART decision list
● outlook = overcast: yes (4.0)
● windy = TRUE: no (4.0/1.0)
● outlook = sunny: no (3.0/1.0)
: yes (3.0)
● Number of Rules : 4
Yes, sometimes just one attribute can be good enough in making the decision.
In this dataset (Weather), Single attribute for making the decision is “outlook”
outlook:
● sunny -> no
● overcast -> yes
● rainy -> yes
(10/14 instances correct)
With respect to the time, the one R classifier has higher ranking and J48 is in 2nd place and
PART gets 3rd place.
J48 PART one R
TIME (sec) 0.12 0.14 0.04
RANK II III I
But if you consider the accuracy, The J48 classifier has higher ranking, PART gets second place
and one R gets lst place J48 PART oneR

ACCURACY (%) 70.5 70.2% 66.8%

DATA WAREHOUSING AND DATA MINING LAB MANUAL


123
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

1. Open existing file as weather.nomial.arff


2. Select All.
3. Go to classify.
4. Start.

Here the accuracy is 100%

The tree is something like “if-then-else” rule

DATA WAREHOUSING AND DATA MINING LAB MANUAL


124
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

If outlook=overcast then

play=yes

If outlook=sunny and humidity=high then

play = no

else

play = yes

If outlook=rainy and windy=true then

play = no

else

play = yes

To click out the rules

DATA WAREHOUSING AND DATA MINING LAB MANUAL


125
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

1. Go to choose then click on Rule then select PART.

2. Click on Save and start.

3. Similarly for oneR algorithm.

DATA WAREHOUSING AND DATA MINING LAB MANUAL


126
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

If outlook = overcast then

Play = yes

If outlook = sunny and humidity = high then

Play = no

If outlook = sunny and humidity = low then

Play = yes

DATA WAREHOUSING AND DATA MINING LAB MANUAL


127
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

Additional Programs
1. Perform cluster analysis on German credit data set using partition clustering algorithm
2. Perform cluster analysis on German credit data set using EM clustering algorithm
Additional program-1
Aim: Perform cluster analysis on German credit data set using partition clustering algorithm
Recommended Hardware / Software Requirements:
• Hardware Requirements: Intel Based desktop PC with minimum of 166 MHZ or faster
processor with at least 64 MB RAM and 100 MB free disk space.
• Weka
Pseudo code
In pseudo code, the general algorithm for k-means clustering algorithm is:
1. Place K points into the space represented by the objects that are being clustered. These
points represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of
the objects into groups from which the metric to be minimized can be calculated.
Procedure: In Weka GUI Explorer, Select Cluster Tab, In that Select Simplekmeans. Then go to
Choose and select use training set. Click on start.
Output: cluster analysis on k-means clustering algorithm
=== Run information ===
Scheme: weka.clusterers.SimpleKMeans -N 3 -A "weka.core.EuclideanDistance -R
first-last" -I 500 -O -S 10
Relation: german_credit-weka.filters.unsupervised.attribute.Remove-R21
Instances: 1000
Attributes: 20
checking_status
duration
credit_history
purpose
credit_amount
83
savings_status
employment
installment_commitment
personal_status
other_parties
residence_since

DATA WAREHOUSING AND DATA MINING LAB MANUAL


128
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

property_magnitude
age
other_payment_plans
housing
existing_credits
job
num_dependents
own_telephone
foreign_worker
Test mode: evaluate on training data
=== Clustering model (full training set) ===
kMeans
======
Number of iterations: 8
Within cluster sum of squared errors: 5145.269062855846
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2
(1000) (484) (190) (326)
===============================================================
==========================================
checking_status no checking no checking <0
0<=X<200
duration 20.903 20.7314 26.0526 18.1564
credit_history existing paid existing paid existing paid existing paid
purpose radio/tv new car used car radio/tv
84
credit_amount 3271.258 3293.1281 4844.6474
2321.7822
savings_status <100 <100 <100 <100
employment 1<=X<4 1<=X<4 >=7 >=7
installment_commitment 2.973 2.8822 3.0579 3.0583
personal_status male single male single male single male single
other_parties none none none none
residence_since 2.845 2.4483 3.5211 3.0399
property_magnitude car car no known property real estate
age 35.546 33.155 41.0526 35.8865
other_payment_plans none none none none
housing own own for free own
existing_credits 1.407 1.3967 1.4474 1.3988

DATA WAREHOUSING AND DATA MINING LAB MANUAL


129
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

job skilled skilled skilled skilled


num_dependents 1.155 1.155 1.2474 1.1012
own_telephone none none yes none
foreign_worker yes yes yes yes
Time taken to build model (full training data) : 0.07 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 484 ( 48%)
1 190 ( 19%)
2 326 ( 33%)

Additional program-2
Aim: Perform cluster analysis on German credit data set using hierarchal clustering algorithm
Recommended Hardware / Software Requirements:
• Hardware Requirements: Intel Based desktop PC with minimum of 166 MHZ or faster
processor with at least 64 MB RAM and 100 MB free disk space.
• Weka
85
Pseudo code:
In pseudo code, the general algorithm using EM clustering algorithm is
Procedure: In Weka GUI Explorer, Select Cluster Tab, In that Select EM. Then go to Choose and
select use training set. Click on start.
Output:
=== Run information ===
Scheme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100
Relation: german_credit-weka.filters.unsupervised.attribute.Remove-R21
Instances: 1000
Attributes: 20
checking_status
duration
credit_history
purpose
credit_amount
savings_status
employment
installment_commitment
personal_status
other_parties
residence_since
property_magnitude

DATA WAREHOUSING AND DATA MINING LAB MANUAL


130
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

age
other_payment_plans
housing
existing_credits
job
num_dependents
own_telephone
foreign_worker
Test mode: evaluate on training data
86
=== Clustering model (full training set) ===
EM
==
Number of clusters selected by cross validation: 4
Cluster
Attribute 0 1 2 3
(0.26) (0.26) (0.2) (0.29)
===============================================================
============
checking_status
<0 100.8097 58.5666 51.7958 66.8279
0<=X<200 69.3481 63.6477 34.9535 105.0507
>=200 17.6736 20.0978 11.9012 17.3274
no checking 73.012 119.2995 101.8966 103.7918
[total] 260.8434 261.6116 200.5471 292.9978
duration
mean 17.7484 14.3572 23.4112 27.8358
std. dev. 8.0841 7.1757 12.1018 14.1317
credit_history
no credits/all paid 10.1705 6.0326 8.4795 19.3174
all paid 17.9296 11.0899 9.6553 14.3252
existing paid 175.3951 142.1934 53.3962 163.0153
delayed previously 10.1938 18.0432 24.9273 38.8357
critical/other existing credit 48.1544 85.2526 105.0888 58.5041
[total] 261.8434 262.6116 201.5471 293.9978
purpose
new car 57.7025 76.7946 47.734 55.7689
used car 14.504 7.9487 40.7163 43.831
furniture/equipment 95.3943 25.2704 24.1583 40.1769
radio/tv 53.3828 106.3023 48.3866 75.9283
domestic appliance 7.9495 3.4917 1.161 3.3979

DATA WAREHOUSING AND DATA MINING LAB MANUAL


131
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

repairs 5.5771 9.5832 6.9408 3.8988


education 9.921 10.7236 11.9789 21.3766
vacation 1 1 1 1
87
retraining 4.7356 4.1209 2.311 1.8324
business 16.6708 22.302 19.5059 42.5213
other 1.0059 1.0743 3.6542 10.2656
[total] 267.8434 268.6116 207.5471 299.9978
credit_amount
mean 2288.8498 1812.2911 3638.3737 5195.2049
std. dev. 1342.8531 995.7303 2694.223 3683.9507
savings_status
<100 170.6648 165.5967 96.2641 174.4744
100<=X<500 26.3033 25.4915 18.3092 36.8959
500<=X<1000 15.6275 21.5273 15.5765 14.2688
>=1000 12.2318 18.448 12.513 8.8072
no known savings 37.0161 31.5481 58.8844 59.5515
[total] 261.8434 262.6116 201.5471 293.9978
employment
unemployed 14.0219 3.1801 16.0683 32.7298
<1 90.51 34.2062 8.4379 42.846
1<=X<4 84.9242 128.879 27.7645 101.4323
4<=X<7 50.6437 42.1897 31.3087 53.858
>=7 21.7437 54.1567 117.9679 63.1317
[total] 261.8434 262.6116 201.5471 293.9978
installment_commitment
mean 2.8557 3.0212 3.312 2.8038
std. dev. 1.1596 1.1124 0.9515 1.1363
personal_status
male div/sep 15.737 9.9518 4.6205 23.6907
female div/dep/mar 151.4625 48.4321 18.2787 95.8267
male single 67.3068 159.5075 172.5861 152.5996
male mar/wid 26.3371 43.7203 5.0618 20.8808
female single 1 1 1 1
[total] 261.8434 262.6116 201.5471 293.9978
other_parties
none 235.863 218.7895 186.4245 269.923
co applicant 12.5526 10.6977 6.9588 14.7909
guarantor 11.4278 31.1244 6.1638 7.2839
[total] 259.8434 260.6116 199.5471 291.9978
residence_since

DATA WAREHOUSING AND DATA MINING LAB MANUAL


132
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

mean 2.6862 2.5399 3.5434 2.7831


88
std. dev. 1.1732 1.0186 0.7654 1.1061
property_magnitude
real estate 69.0217 148.9943 30.8391 37.1449
life insurance 81.2718 54.4192 41.9034 58.4056
car 95.7773 51.1875 60.6462 128.389
no known property 14.7725 7.0107 67.1584 69.0583
[total] 260.8434 261.6116 200.5471 292.9978
age
mean 27.7345 36.1057 43.8079 36.3705
std. dev. 5.7953 10.3158 11.3129 11.5738
other_payment_plans
bank 34.4988 32.0758 33.984 42.4414
stores 10.9742 12.5287 10.4947 17.0024
none 214.3704 216.0071 155.0685 232.554
[total] 259.8434 260.6116 199.5471 291.9978
housing
rent 85.8549 31.7206 15.9015 49.523
own 168.499 226.2291 124.0089 198.2629
for free 5.4895 2.6619 59.6367 44.2118
[total] 259.8434 260.6116 199.5471 291.9978
existing_credits
mean 1.213 1.4137 1.7961 1.3088
std. dev. 0.4142 0.5377 0.7406 0.4734
job
unemp/unskilled non res 11.7711 2.5192 6.8364 4.8733
unskilled resident 52.9713 105.4029 24.5489 21.0769
skilled 188.0096 147.8359 128.9987 169.1558
high qualif/self emp/mgmt 8.0914 5.8537 40.1631 97.8918
[total] 260.8434 261.6116 200.5471 292.9978
num_dependents
mean 1 1.2978 1.3983 1
std. dev. 0.3621 0.4573 0.4895 0.3621
own_telephone
none 219.2961 215.7304 81.1575 83.816
yes 39.5473 43.8813 117.3896 207.1818
[total] 258.8434 259.6116 198.5471 290.9978
89
foreign_worker
yes 248.5954 234.0215 197.4796 286.9034

DATA WAREHOUSING AND DATA MINING LAB MANUAL


133
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING

no 10.248 25.5901 1.0675 4.0944


[total] 258.8434 259.6116 198.5471 290.9978
Time taken to build model (full training data) : 22.43 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 279 ( 28%)
1 279 ( 28%)
2 194 ( 19%)
3 248 ( 25%)
Log likelihood: -33.06046

DATA WAREHOUSING AND DATA MINING LAB MANUAL


134

You might also like