DWDM LAB Final Manualtest
DWDM LAB Final Manualtest
ENGINEERING
We create 3 dimension tables and 1 fact table in the data warehouse. DimDate, DimCustomer,
DimVan and FactHire. We populate the 3 dimensions but we’ll leave the fact table empty.
The following script is used to create and populate dim and fact tables .
go
use TopHireDW
go
go
go
go
declare @i int, @Date date, @StartDate date, @EndDate date, @DateKey int,@DateString
varchar(10), @Year varchar(4), @Month varchar(7), @Date1 varchar(20)
end
go
go
( CustomerKey int not null identity(1,1) primary key,CustomerId varchar(20) not null,
CustomerName varchar(30), DateOfBirth date, Town varchar(50), TelephoneNo varchar(30),
DrivingLicenceNo varchar(30), Occupation varchar(30) )
go
go
insert into DimVan (RegNo, Make, Model, [Year], Colour, CC, Class)
go
go
HireDateKey int not null, CustomerKey int not null, VanKey int not null, --Dimension Keys
go
Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two
primitives, cube definition and dimension definition, can be used for defining the data
warehouses and data marts.
Star Schema
• The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
• There is a fact table at the center. It contains the keys to each of four dimensions.
• The fact table also contains the attributes, namely dollars sold and units sold.
Snowflake Schema
● Some dimension tables in the Snowflake schema are normalized.
● The normalization splits up the data into additional tables.
● Unlike Star schema, the dimensions table in a snowflake schema is normalized. For
● example, the item dimension table in star schema is normalized and split into two
● dimension tables, namely item and supplier table.
● Now the item dimension table contains the attributes item_key, item_name, type, brand,
● and supplier-key.
● The supplier key is linked to the supplier dimension table. The supplier dimension table
● contains the attributes supplier_key and supplier_type.
● A fact constellation has multiple fact tables. It is also known as galaxy schema.
• The following diagram shows two fact tables, namely sales and shipping.
• The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key,from_location, to_location.
• The shipping fact table also contains two measures, namely dollars sold and units
sold.
• It is also possible to share dimension tables between fact tables. For example,
time, item, and location dimension tables are shared between the sales and
shipping fact table.
(iii). Write ETL scripts and implement using data warehouse tools
ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers
a process of how the data are loaded from the source system to the data warehouse.
Extraction–transformation–loading (ETL) tools are pieces of software responsible for the
extraction of data from several sources, its cleansing, customization, reformatting, integration,
and insertion into a data warehouse.
Building the ETL process is potentially one of the biggest tasks of building a warehouse;
it is complex, time consuming, and consumes most of data warehouse project’s implementation
efforts, costs, and resources.
Building a data warehouse requires focusing closely on understanding three main areas:
1. Source Area- The source area has standard models such as entity relationship diagram.
2. Destination Area- The destination area has standard models such as star schema.
3. Mapping Area- But the mapping area has not a standard model till now.
Abbreviations
● ETL-extraction–transformation–loading
● DW-data warehouse
● DS-data sources
ETL Process:
Extract
The Extract step covers the data extraction from the source system and makes it
accessible for further processing. The main objective of the extract step is to retrieve all the
required data from the source system with as little resources as possible. The extract step should
be designed in a way that it does not negatively affect the source system in terms or performance,
response time or any kind of locking.
Transform
The transform step applies a set of rules to transform the data from the source to the
target. This includes converting any measured data to the same dimension (i.e. conformed
dimension) using the same units so that they can later be joined. The transformation step also
requires joining data from several sources, generating aggregates, generating surrogate keys,
sorting, deriving new calculated values, and applying advanced validation rules.
Load
During the load step, it is necessary to ensure that the load is performed correctly and
with as little resources as possible. The target of the Load process is often a database. In order to
make the load process efficient, it is helpful to disable any constraints and indexes before the
load and enable them back only after the load completes. The referential integrity needs to be
maintained by ETL tool to ensure consistency.
Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It
allows managers, and analysts to get an insight of the information through fast, consistent, and
interactive access to information.
● Roll-up
● Drill-down
● Slice and dice
● Pivot (rotate) Roll-up
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways:
● Drill-down is performed by stepping down a concept hierarchy for the dimension time.
● Initially the concept hierarchy was "day < month < quarter < year."
● On drilling down, the time dimension is descended from the level of quarter to the level of
month.
● When drill-down is performed, one or more dimensions from the data cube are added.
● It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-cube.
Consider the following diagram that shows how slice works.
● Here Slice is performed for the dimension "time" using the criterion time = "Q1".
● It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the
following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three dimensions.
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an
alternative presentation of data. Consider the following diagram that shows the pivot operation.
A.(v). Explore visualization features of the tool for analysis like identifying
trends etc.
Ans:Visualization Features:
WEKA’s visualization allows you to visualize a 2-D plot of the current working relation. Visualization
is very useful in practice, it helps to determine difficulty of the learning problem. WEKA can visualize
single attributes (1-d) and pairs of attributes (2-d), rotate 3-d visualizations (Xgobi-style). WEKA has
“Jitter” option to deal with nominal attributes and to detect “hidden” data points.
● Access To Visualization From The Classifier, Cluster And Attribute Selection Panel Is
Available From A Popup Menu. Click The Right Mouse Button Over An Entry In The Result
List To Bring Up The Menu. You Will Be Presented With Options For Viewing Or Saving The
Text Output And --- Depending On The Scheme --- Further Options For Visualizing Errors,
Clusters, Trees Etc.
Select a square that corresponds to the attributes you would like to visualize. For example, let’s choose
‘outlook’ for X – axis and ‘play’ for Y – axis. Click anywhere inside the square that corresponds to
‘play on the left and ‘outlook’ at the top
In the visualization window, beneath the X-axis selector there is a drop-down list,
‘Colour’, for choosing the color scheme. This allows you to choose the color of points based on the
attribute selected. Below the plot area, there is a legend that describes what values the colors
correspond to. In your example, red represents ‘no’, while blue represents ‘yes’. For better visibility
you should change the color of label ‘yes’. Left-click on ‘yes’ in the ‘Class colour’ box and select
lighter color from the color palette.
To the right of the plot area there are series of horizontal strips. Each strip represents an
attribute, and the dots within it show the distribution values of the attribute. You can choose
what axes are used in the main graph by clicking on these strips (left-click changes X-axis, right-click
changes Y-axis).
The software sets X - axis to ‘Outlook’ attribute and Y - axis to ‘Play’. The instances are spread out
in the plot area and concentration points are not visible. Keep sliding ‘Jitter’, a random displacement
given to all points in the plot, to the right, until you can spot concentration points.
The results are shown below. But on this screen we changed ‘Colour’ to temperature. Besides
‘outlook’ and ‘play’, this allows you to see the ‘temperature’ corresponding to the
‘outlook’. It will affect your result because if you see ‘outlook’ = ‘sunny’ and ‘play’ = ‘no’ to
explain the result, you need to see the ‘temperature’ – if it is too hot, you do not want to play.
Change ‘Colour’ to ‘windy’, you can see that if it is windy, you do not want to play as well.
Selecting Instances
Sometimes it is helpful to select a subset of the data using visualization tool. A special case
is the ‘UserClassifier’, which lets you to build your own classifier by interactively selecting
instances. Below the Y – axis there is a drop-down list that allows you to choose a selection method.
A group of points on the graph can be selected in four ways [2]
attributes of the point. If more than one point will appear at the same location, more than one
set of attributes will be shown.
3. Polygon. You can select several points by building a free-form polygon. Left-click on the graph to
add vertices to the polygon and right-click to complete it.
4. Polyline. To distinguish the points on one side from the once on another, you can build a polyline.
Left-click on the graph to add vertices to the polyline and right-click to finish.
(v). Explore Visualization features of the tool for analysis like identifying trends etc.
Step1:
Go to the Applications Menu-> Ubuntu software center->Search for Weka and click Install
button.
Applications Menu->Science->weka
(ii). Understand the features of WEKA toolkit such as Explorer, Knowledge Flow
interface,Experimenter,command-line interface.
Introduction:
Weka is a workbench that contains a collection of visualization tools and algorithms for data
analysis and predictive modeling, together with graphical user interfaces for easy access to these
functions.
Portability, since it is fully implemented in the Java programming language and thus runs
on almost any modern computing platform.
Description:
Open the program. Once the program has been loaded on the user’s machine it is opened by
navigating to the programs start option and that will depend on the user’s operating system. Figure
1.1 is an example of the initial opening screen on a computer.
After clicking the Explorer button the weka explorer interface appears.Inside the weka explorer window
there are six tabs:
● Open File- allows for the user to select files residing on the local machine or recorded
medium
● Open URL- provides a mechanism to locate a file or data source from a different location
specified by the user
● Open Database- allows the user to retrieve files or data from a database source provided by
user
2. Classify- used to test and train different learning schemes on the preprocessed data file under
experimentation. Again there are several options to be selected inside of the classify tab. Test
option gives the user the choice of using four different test mode scenarios on the data set.
3. Cross validation
4. Split percentage
3. Cluster- used to apply different tools that identify clusters within the data file. The Cluster tab
opens the process that is used to identify commonalties or clusters of occurrences within the data
set and produce information for the user to analyze.
Fig: Cluser analysis options and using training data set in Weka tool
4. Association- used to apply different rules to the data file that identify association within the data.
The associate tab opens a window to select the options for associations within the data set.
5. Select attributes-used to apply different rules to reveal changes based on selected attributes
inclusion or exclusion from the experiment
6. Visualize- used to see what the various manipulation produced on the data set in a 2D format, in
scatter plot and bar graph output.
2. Experimenter - this option allows users to conduct different experimental variations on data sets
and perform statistical manipulation. The Weka Experiment Environment enables the user to
create, run, modify, and analyze experiments in a more convenient manner than is possible when
processing the schemes individually. For example, the user can create an experiment that runs
several schemes against a series of datasets and then analyze the results to determine if one of the
schemes is (statistically) better than the other schemes. Results destination: ARFF file, CSV file,
JDBC database.
Algorithms: filters
3. Knowledge Flow -basically the same functionality as Explorer with drag and drop functionality.
The advantage of this option is that it supports incremental learning from previous results
4. Simple CLI - provides users without a graphic interface option the ability to execute commands
from a terminal window.
An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of
instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project at the
Department of Computer Science of The University of Waikato for use with the Weka machine learning
software in WEKA, each data entry is an instance of the java class weka. core. Instance, and each instance
consists of a For loading datasets in WEKA, WEKA can load ARFF files. Attribute Relation File Format
has two sections:
1) The Header section defines relation (dataset) name, attribute name, and type.
The figure above is from the german credit data that shows an ARFF file. Lines beginning with a % sign
are comments. And there are three basic keywords:
“@attribute “ in Header section followed with attribute names and their types( or sizes).
Click the “Open file...” button to open a data set and double click on the “data” directory.Weka provides a
number of small common machine learning datasets that you can use to practice on.
(vi). Load a data set ( ex.Weather data set , credit data set)
Purpose qualitative
d. Summary
UNIT - 2
The navigation flow for preprocess of data set with out any filter in Weka tool is as follows :
The GUI WEKA launcher -- > Explorer --> Preprocess --> Open file (Browse system for data set) -->
select ALL attributes .
The navigation flow for preprocess of data set with Discretize filter in Weka tool is as follows :
Open file (Browse system for data set) --> Choose --> weka --> Filters --> Supervised --> attributes -->
Discretiz --> ALL attributes .
The navigation flow for preprocess of data set with Resample filter in Weka tool is as follows :
Open file (Browse system for data set) --> Choose --> weka --> Filters --> Supervised --> instance -->
Resample --> Select Credit History .
B. Load a data set into weka and run Aprori algorithm with different
support and confidence values. Study the rules generated.
The navigation flow for applying Apriori algorithm on a data set Weka tool is as follows :
Open file (Browse system for data set) --> Choose --> weka --> Associations-->Apriori --> Start .
UNTI-3
The navigation flow for classification of data set with J48 classifier in Weka tool is as follows :
Open file (Browse system for data set) --> Choose --> weka --> Trees --> J48 .
B. Extract if-then rules from the decision tree generated by the classifier, observe the confusion
matrix and derive accuracy, F-measure,TPrate, FPrate, Precision and recall values. Apply
cross-validation strategy with values and compare accuracy results.
C. Load dataset into Weka and run Naive-bayes classification algorithm.Study the classifier
output.
The navigation flow for classification of data set with Naive-bayes classifier in Weka tool is as
follows :The GUI WEKA launcher -- > Explorer --> Classify -->Open file (Browse system for data set)
--> Choose --> weka --> bayes --> Naive-Bayes
E. Compare classification results of ID3,J48, Naïve-Bayes and k-NN classifiers for each dataset
, and reduce which classifier is performing best and poor for each dataset and justify.
Ans Steps for run ID3 and J48 Classification algorithms in WEKA
J48:
Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
------------------
Number of Leaves : 5
1 0 1 1 1 1 Iris-setosa
50 0 0 | a = Iris-setosa
0 49 1 | b = Iris-versicolor
1 2 48 | c = Iris-virginica
Naïve-bayes:
Scheme:weka.classifiers.bayes.NaiveBayes
Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
Class
===============================================================
sepallength
weight sum 50 50 50
sepalwidth
weight sum 50 50 50
petallength
weight sum 50 50 50
petalwidth
weight sum 50 50 50
1 0 1 1 1 1 Iris-setosa
50 0 0 | a = Iris-setosa
0 48 2 | b = Iris-versicolor
1 4 46 | c = Iris-virginica
Scheme:weka.classifiers.lazy.IBk -K 1 -W 0 -A "weka.core.neighboursearch.LinearNNSearch -A
\"weka.core.EuclideanDistance -R first-last\""
Relation: iris
Instances: 150
Attributes:
5
sepallength
sepalwidth
petallength
petalwidth
class
Kappa statistic 1
1 0 1 1 1 1 Iris-setosa
1 0 1 1 1 1 Iris-versicolor
1 0 1 1 1 1 Iris-virginica
Weighted Avg. 1 0 1 1 1 1
a b c <--
classified as 50 0 0 | a =
Iris-setosa
0 50 0 | b = Iris-versicolor
0 0 50 | c = Iris-virginica.
Unit – 4
Output:
-S 10
Relation: iris
Instances: 150
Attributes: 5
sepallength
sepalwidth
petallength
petalwidth
class
kMeans
======
Number of iterations: 7
Cluster centroids:
Cluster#
==================================================================
Clustered Instances
1 100 ( 67%)
2 50 ( 33%)
C. Explore visualization features of weka to visualize the clusters. Derive interesting insights
and explain.
WEKA’s visualization allows you to visualize a 2-D plot of the current working relation.
Visualization is very useful in practice, it helps to determine difficulty of the learning problem. WEKA
can visualize single attributes (1-d) and pairs of attributes (2-d), rotate 3-d visualizations
(Xgobi-style). WEKA has “Jitter” option to deal with nominal attributes and to detect “hidden” data
points.
Access To Visualization From The Classifier, Cluster And Attribute Selection Panel Is Available From A
Popup Menu. Click The Right Mouse Button Over An Entry In The Result List To Bring Up The Menu.
You Will Be Presented With Options For Viewing Or Saving The Text Output And
- Depending On The Scheme --- Further Options For Visualizing Errors, Clusters, Trees Etc.
Select a square that corresponds to the attributes you would like to visualize. For example, let’s choose
‘outlook’ for X – axis and ‘play’ for Y – axis. Click anywhere inside the square that corresponds to ‘play
o
In the visualization window, beneath the X-axis selector there is a drop-down list, ‘Colour’, for
choosing the color scheme. This allows you to choose the color of points based on the attribute
selected. Below the plot area, there is a legend that describes what values the colors correspond
to. In your example, red represents ‘no’, while blue represents ‘yes’. For better visibility you
should change the color of label ‘yes’. Left-click on ‘yes’ in the ‘Class colour’ box and select
lighter color from the color palette.
Selecting Instances
Sometimes it is helpful to select a subset of the data using visualization tool. A special case is
the ‘UserClassifier’, which lets you to build your own classifier by interactively selecting instances.
Below the Y – axis there is a drop-down list that allows you to choose a selection method. A group of
points on the graph can be selected in four ways [2]:
1. Select Instance. Click on an individual data point. It brings up a window listing attributes of the
point. If more than one point will appear at the same location, more than one set of attributes will
be shown.
3. Polygon. You can select several points by building a free-form polygon. Left-click on the graph
to add vertices to the polygon and right-click to complete it.
4.Polyline. To distinguish the points on one side from the once on another, you can build a
polyline. Left-click on the graph to add vertices to the polyline and right-click to finish.
Viva Questions
Unit-V
Output:
Relation: labor-neg-data
Instances: 57
Attributes: 17
duration
wage-increase-first-year
wage-increase-second-year
wage-increase-third-year
cost-of-living-adjustment
working-hours
pension
standby-pay
shift-differential
education-allowance
statutory-holidays
vacation
longterm-disability-assistance
contribution-to-dental-plan
bereavement-assistance
contribution-to-health-plan
class
B. Use options cross-validation and percentage split and repeat running the
Linear Regression Model. Observe the results and derive meaningful
results.
Output: cross-validation
Relation: labor-neg-data
57
Instances:
Attributes: 17
duration
wage-increase-first-year
wage-increase-second-year
wage-increase-third-year
cost-of-living-adjustment
working-hours
pension
standby-pay
shift-differential
education-allowance
statutory-holidays
vacation
longterm-disability-assistance
contribution-to-dental-plan
bereavement-assistance
contribution-to-health-plan
class
duration =
0.4689 * cost-of-living-adjustment=tc,tcf +
0.6523 * pension=none,empl_contr +
1.0321 * bereavement-assistance=yes +
0.3904 * contribution-to-health-plan=full +
0.2765
Relation: labor-neg-data
Instances: 57
Attributes: 17
duration
wage-increase-first-year
wage-increase-second-year
wage-increase-third-year
cost-of-living-adjustment
working-hours
pension
standby-pay
shift-differential
education-allowance
statutory-holidays
vacation
longterm-disability-assistance
contribution-to-dental-plan
bereavement-assistance
contribution-to-health-plan
class
duration =
0.4689 * cost-of-living-adjustment=tc,tcf +
0.6523 * pension=none,empl_contr +
1.0321 * bereavement-assistance=yes +
0.3904 * contribution-to-health-plan=full +
0.2765
● The business of banks is making loans. Assessing the credit worthiness of an applicant’s
of crucial importance. We have to develop a system to help a loan officer decide
whether the credit of a customer is good or bad. A bank’s business rules regarding loans
must consider two opposing factors. On the one hand, a bank wants to make as many
loans as possible. Interest on these loans is the banks profit source. On the other hand, a
bank cannot afford to make too many bad loans. To many bad could leads to the collapse
of the bank. The bank’s loan policy must involve a compromise not too strict, and not
too lenient.
DATA WAREHOUSING AND DATA MINING LAB MANUAL
88
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
● Credit risk is an investor's risk of loss arising from a borrower who does not make
payments as promised. Such an event is called a default. Other terms for credit risk are
default risk and counterparty risk.
● Credit risk is most simply defined as the potential that a bank borrower or counterparty
will fail to meet its obligations in accordance with agreed terms.
● The goal of credit risk management is to maximize a bank's risk-adjusted rate of return
by maintaining credit risk exposure within acceptable parameters.
● Banks need to manage the credit risk inherent in the entire portfolio as well as the risk in
individual credits or transactions.
● Banks should also consider the relationships between credit risk and other risks.
● The effective management of credit risk is a critical component of a comprehensive
approach to risk management and essential to the long-term success of any banking
organization.
● A good credit assessment means you should be able to qualify, within the limits of your
income, for most loans.
Week 1
1. List all the categorical (or nominal) attributes and the real-valued attributes
separately.
From the German Credit Assessment Case Study given to us, the following attribute are found
to be applicable for Credit-Risk Assessment. Total Valid Attributes Categorical or Nominal
attributes (which takes True/false, etc values) Real valued attributes
1. checking_status
2. duration
3. credit history
4. purpose
5. credit amount
6. savings_status
7. employment duration
8. installment rate
9. personal status
10. debtors
11. residence_since
12. property
13. installment plans
14. housing
15. existing credits
16. job
17. num_dependents
18. telephone
19. foreign worker
Week 2
2. What attributes do you think might be crucial in making the credit assessment? Come
up with some simple rules in plain English using your selected attributes.
According to me the following attributes may be crucial in making the credit risk assessment.
1. Credit_history
2. Employment
3. Property_magnitude
4. Job
5. Duration
6. Credit_amount
7. Installment
8. Existing credit
Based on the above attributes, we can make a decision whether to give credit or not.
checking_status = no checking AND other_payment_plans = none AND
credit_history = critical/other existing credit: good
1: good
1: bad
credit_history = all paid AND other_parties = none AND other_payment_plans = bank: bad
duration > 30 AND savings_status = no known savings AND num_dependents > 1: good
Week 3
3. One type of model that you can create is a Decision Tree - train a Decision Tree using
the complete dataset as the training data. Report the model obtained after training.
A decision tree is a flow chart like tree structure where each internal node(non-leaf) denotes a
test on the attribute, each branch represents an outcome of the test ,and each leaf
node(terminal node)holds a class label.
Decision trees can be easily converted into classification rules.
e.g. ID3, C4.5 and CART.
● Using WEKA Tool, we can generate a decision tree by selecting the “classify tab”.
● In classify tab select choose option where a list of different decision trees are available.
From that list select J48.
● Now under test option, select training data test option.
● The resulting window in WEKA is as follows:
● To generate the decision tree, right click on the result list and select visualize tree option
by which the decision tree will be generated
● The obtained decision tree for credit risk assessment is very large to fit on the screen.
Week 4
DATA WAREHOUSING AND DATA MINING LAB MANUAL
93
DEPARTMENT OF COMPUTER SCIENCE &
ENGINEERING
4. Suppose you use your above model trained on the complete dataset, and classify
credit good/bad for each of the examples in the dataset. What % of examples can you
classify correctly? (This is also called testing on the training set) Why do you think you
cannot get 100 % training accuracy?
In the above model we trained complete dataset and we classified credit good/bad for each of
the examples in the dataset.
For example:
In this way we classified each of the examples in the dataset. We classified 85.5% of examples
correctly and the remaining 14.5% of examples are incorrectly classified. We can’t get 100%
training accuracy because out of the 20 attributes, we have some unnecessary attributes which
are also been analyzed and trained. Due to this the accuracy is affected and hence we can’t get
100% training accuracy.
Week 5
5. Is testing on the training set as you did above a good idea? Why not?
Bad idea, if take all the data into training set. Then how to test the above classification is
correctly or not?
According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as training
set and the remaining 1/3 as test set. But here in the above model we have taken complete
dataset as training set which results only 85.5% accuracy. This is done for the analyzing and
training of the unnecessary attributes which does not make a crucial role in credit risk
assessment. And by this complexity is increasing and finally it leads to the minimum accuracy. If
some part of the dataset is used as a training set and the remaining as test set then it leads to
the accurate results and the time for computation will be less. This is why, we prefer not to take
complete dataset as training set. Use Training Set Result for the table German Credit Data:
Week 6
6. One approach for solving the problem encountered in the previous question is using
cross-validation? Describe what cross-validation is briefly. Train a Decision Tree again using
cross-validation and report your results. Does your accuracy increase/decrease? Why?
Cross validation:-
In k-fold cross-validation, the initial data are randomly portioned into ‘k’ mutually exclusive
subsets or folds D1, D2, D3, . . . . . ., Dk. Each of approximately equal size. Training and testing is
performed ‘k’ times. In iteration I, partition Di is reserved as the test set and the remaining
partitions are collectively used to train the model. That is in the first iteration subsets D2, D3, . .
. . . ., Dk collectively serve as the training set in order to obtain as first model. Which is tested on
Di. The second trained on the subsets D1, D3, . . . . . ., Dk and test on the D2 and so on....
1. Select classify tab and J48 decision tree and in the test option select cross validation
radio button and the number of folds as 10.
2. Number of folds indicates number of partition with the set of attributes.
3. Kappa statistics nearing 1 indicates that there is 100% accuracy and hence all the errors
will be zeroed out, but in reality there is no such training set that gives 100% accuracy.
4. Cross Validation Result at folds: 10 for the table German Credit Data:
Kappa statistic
0.2467
Here there are 1000 instances with 100 instances per partition.
Cross Validation Result at folds: 20 for the table GermanCreditData:
106.5538 %
104.1164 %
Percentage split does not allow 100%, it allows only till 99.9%
Week 7
7. Check to see if the data shows a bias against "foreign workers" (attribute 20), or
"personal-status"(attribute 9). One way to do this (Perhaps rather simple minded) is to
remove these attributes from the dataset and see if the decision tree created in those cases is
significantly different from the full dataset case which you have already done. To remove an
attribute you can use the reprocess tab in WEKA's GUI Explorer. Did removing these attributes
have any significant effect? Discuss.
This increases in accuracy because the two attributes “foreign workers” and “personal status
“are not much important in training and analyzing. By removing this, the time has been reduced
to some extent and then it results in increase in the accuracy. The decision tree which is created
is very large compared to the decision tree which we have trained now. This is the main
difference between these two decision trees.
If we remove 9th attribute, the accuracy is further increased to 86.6% which shows that these
two attributes are not significant to perform training.
Week 8
8. Another question might be, do you really need to input so many attributes to get good
results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5,
7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had
removed two attributes in problem 7 Remember to reload the ARFF data file to get all the
attributes initially before you start selecting the ones you want.)
Select attribute 2,3,5,7,10,17,21 and click on invert to remove the remaining attributes.
After removing the attributes 1,4,6,8,9,11,12,13,14,15,16,18,19 and 20,we select the left over
attributes and visualize them.
After we remove 14 attributes, the accuracy has been decreased to 76.4% hence we can further
try random combination of attributes to increase the accuracy.
Week 9
9. Sometimes, the cost of rejecting an applicant who actually has a good credit
Case 1: Might be higher than accepting an applicant who has bad credit
Case 2: Instead of counting the misclassifications equally in both cases, give a higher cost to
the first case (say cost 5) and lower cost to the second case. You can do this by using a cost
matrix in WEKA.
Train your Decision Tree again and report the Decision Tree and cross-validation results. Are
they significantly different from results obtained in problem 6 (using equal cost)?
In the Problem 6, we used equal cost and we trained the decision tree. But here, we consider
two cases with different cost. Let us take cost 5 in case 1 and cost 2 in case 2.When we give such
costs in both cases and after training the decision tree, we can observe that almost equal to
that of the decision tree obtained in problem 6. Case1 (cost 5) Case2 (cost 5)
We don’t find this cost factor in problem 6. As there we use equal cost. This is the major
difference between the results of problem 6 and problem 9.
Case 1: 5 1 1 5
Case 2: 2 1 1 2
1. Select classify tab. Select More Option from Test Option. Tick on cost sensitive
Evaluation and go to set. Set classes as 2.Click on Resize and then we’ll get cost matrix.
Then change the 2nd entry in 1st row and 2nd entry in 1st column to 5.0
2. Then confusion matrix will be generated and you can find out the difference between
good and bad attribute.
3. Check accuracy whether it’s changing or not.
Week 10
10. Do you think it is a good idea to prefer simple decision trees instead of having long
complex decision trees? How does the complexity of a Decision Tree relate to the bias of the
model?
When we consider long complex decision trees, we will have many unnecessary attributes in the
tree which results in increase of the bias of the model. Because of this, the accuracy of the
model can also effect.
This problem can be reduced by considering simple decision tree. The attributes will be less and
it decreases the bias of the model. Due to this the result will be more accurate.
So it is a good idea to prefer simple decision trees instead of long complex trees.
4. To generate the decision tree, right click on the result list and select visualize tree
option, by which the decision tree will be generated.
Visualize tree
Week 11
11. You can make your Decision Trees simpler by pruning the nodes. One approach is to use
Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training your
Decision Trees using cross-validation (you can do this in WEKA) and report the Decision Tree
you obtain? Also, report your accuracy using the pruned model. Does your accuracy increase?
Reduced-error pruning:-
The idea of using a separate pruning set for pruning—which is applicable to decision trees as
well as rule sets—is called reduced-error pruning. The variant described previously prunes a
rule immediately after it has been grown and is called incremental reduced-error pruning.
Another possibility is to build a full, unpruned rule set first, pruning it afterwards b discarding
individual tests.
However, this method is much slower. Of course, there are many different ways to assess the
worth of a rule based on the pruning set. A simple measure is to consider how well the rule
would do at discriminating the predicted class from other classes if it were the only rule in the
theory, operating under the closed world assumption. If it gets p instances right out of the t
instances that it covers, and there are P instances of this class out of a total T of instances
altogether, then it gets positive instances right. The instances that it does not cover include
N–n
negative ones, where
n=t–p
is the number of negative instances that the rule covers and
N=T-P
is the total number of negative instances. Thus the rule has an overall success ratio of
[p +(N - n)] T,
and this quantity, evaluated on the test set, has been used to evaluate the success of a rule
when using reduced-error pruning.
Week 12
12. (Extra Credit): How can you convert a Decision Trees into "if-then- else rules". Make up
your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There
also exist different classifiers that output the model in the form of rules - one such classifier in
WEKA is rules. PART, train this model and report the set of rules obtained. Sometimes just one
attribute can be good enough in making the decision, yes, just one! Can you predict what
attribute that might be in this dataset? One R classifier uses a single attribute to make
decisions (it chooses the attribute based on minimum error). Report the rule obtained by
training a one R classifier. Rank the performance of j48, PART and oneR.
In WEKA, rules, PART is one of the classifier which converts the decision trees into “IF-
THEN-ELSE” rules. Converting Decision trees into “IF-THEN-ELSE” rules using rules. PART
classifier:-
PART decision list
● outlook = overcast: yes (4.0)
● windy = TRUE: no (4.0/1.0)
● outlook = sunny: no (3.0/1.0)
: yes (3.0)
● Number of Rules : 4
Yes, sometimes just one attribute can be good enough in making the decision.
In this dataset (Weather), Single attribute for making the decision is “outlook”
outlook:
● sunny -> no
● overcast -> yes
● rainy -> yes
(10/14 instances correct)
With respect to the time, the one R classifier has higher ranking and J48 is in 2nd place and
PART gets 3rd place.
J48 PART one R
TIME (sec) 0.12 0.14 0.04
RANK II III I
But if you consider the accuracy, The J48 classifier has higher ranking, PART gets second place
and one R gets lst place J48 PART oneR
If outlook=overcast then
play=yes
play = no
else
play = yes
play = no
else
play = yes
Play = yes
Play = no
Play = yes
Additional Programs
1. Perform cluster analysis on German credit data set using partition clustering algorithm
2. Perform cluster analysis on German credit data set using EM clustering algorithm
Additional program-1
Aim: Perform cluster analysis on German credit data set using partition clustering algorithm
Recommended Hardware / Software Requirements:
• Hardware Requirements: Intel Based desktop PC with minimum of 166 MHZ or faster
processor with at least 64 MB RAM and 100 MB free disk space.
• Weka
Pseudo code
In pseudo code, the general algorithm for k-means clustering algorithm is:
1. Place K points into the space represented by the objects that are being clustered. These
points represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of
the objects into groups from which the metric to be minimized can be calculated.
Procedure: In Weka GUI Explorer, Select Cluster Tab, In that Select Simplekmeans. Then go to
Choose and select use training set. Click on start.
Output: cluster analysis on k-means clustering algorithm
=== Run information ===
Scheme: weka.clusterers.SimpleKMeans -N 3 -A "weka.core.EuclideanDistance -R
first-last" -I 500 -O -S 10
Relation: german_credit-weka.filters.unsupervised.attribute.Remove-R21
Instances: 1000
Attributes: 20
checking_status
duration
credit_history
purpose
credit_amount
83
savings_status
employment
installment_commitment
personal_status
other_parties
residence_since
property_magnitude
age
other_payment_plans
housing
existing_credits
job
num_dependents
own_telephone
foreign_worker
Test mode: evaluate on training data
=== Clustering model (full training set) ===
kMeans
======
Number of iterations: 8
Within cluster sum of squared errors: 5145.269062855846
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2
(1000) (484) (190) (326)
===============================================================
==========================================
checking_status no checking no checking <0
0<=X<200
duration 20.903 20.7314 26.0526 18.1564
credit_history existing paid existing paid existing paid existing paid
purpose radio/tv new car used car radio/tv
84
credit_amount 3271.258 3293.1281 4844.6474
2321.7822
savings_status <100 <100 <100 <100
employment 1<=X<4 1<=X<4 >=7 >=7
installment_commitment 2.973 2.8822 3.0579 3.0583
personal_status male single male single male single male single
other_parties none none none none
residence_since 2.845 2.4483 3.5211 3.0399
property_magnitude car car no known property real estate
age 35.546 33.155 41.0526 35.8865
other_payment_plans none none none none
housing own own for free own
existing_credits 1.407 1.3967 1.4474 1.3988
Additional program-2
Aim: Perform cluster analysis on German credit data set using hierarchal clustering algorithm
Recommended Hardware / Software Requirements:
• Hardware Requirements: Intel Based desktop PC with minimum of 166 MHZ or faster
processor with at least 64 MB RAM and 100 MB free disk space.
• Weka
85
Pseudo code:
In pseudo code, the general algorithm using EM clustering algorithm is
Procedure: In Weka GUI Explorer, Select Cluster Tab, In that Select EM. Then go to Choose and
select use training set. Click on start.
Output:
=== Run information ===
Scheme: weka.clusterers.EM -I 100 -N -1 -M 1.0E-6 -S 100
Relation: german_credit-weka.filters.unsupervised.attribute.Remove-R21
Instances: 1000
Attributes: 20
checking_status
duration
credit_history
purpose
credit_amount
savings_status
employment
installment_commitment
personal_status
other_parties
residence_since
property_magnitude
age
other_payment_plans
housing
existing_credits
job
num_dependents
own_telephone
foreign_worker
Test mode: evaluate on training data
86
=== Clustering model (full training set) ===
EM
==
Number of clusters selected by cross validation: 4
Cluster
Attribute 0 1 2 3
(0.26) (0.26) (0.2) (0.29)
===============================================================
============
checking_status
<0 100.8097 58.5666 51.7958 66.8279
0<=X<200 69.3481 63.6477 34.9535 105.0507
>=200 17.6736 20.0978 11.9012 17.3274
no checking 73.012 119.2995 101.8966 103.7918
[total] 260.8434 261.6116 200.5471 292.9978
duration
mean 17.7484 14.3572 23.4112 27.8358
std. dev. 8.0841 7.1757 12.1018 14.1317
credit_history
no credits/all paid 10.1705 6.0326 8.4795 19.3174
all paid 17.9296 11.0899 9.6553 14.3252
existing paid 175.3951 142.1934 53.3962 163.0153
delayed previously 10.1938 18.0432 24.9273 38.8357
critical/other existing credit 48.1544 85.2526 105.0888 58.5041
[total] 261.8434 262.6116 201.5471 293.9978
purpose
new car 57.7025 76.7946 47.734 55.7689
used car 14.504 7.9487 40.7163 43.831
furniture/equipment 95.3943 25.2704 24.1583 40.1769
radio/tv 53.3828 106.3023 48.3866 75.9283
domestic appliance 7.9495 3.4917 1.161 3.3979