Data Mining Lab-Weka Edited
Data Mining Lab-Weka Edited
By:
Mekuriaw Melkamu
Tizalegn Tilaye
I
1. INTRODUCTION TO DATA WAREHOUSE
accessible for further processing. All the needed data is retrieved without affecting the source
system‘s performance, response time or locking in a negative manner. This first step in the ETL
process usually involves a cleaning phase in which data quality is ensured through data
unification. The rules of unification should include things such as making identifiers unique such
as gender categories, phone number, and zip code conversions into standard form and validation
of address fields converted into the proper format.
1
TRANSFORMATION
This step applies a set of rules to change source data into similar dimensions so the same units of
measurement can be used. This transformation step also joins data from a variety of sources,
generates aggregates, surrogate keys and applies validation and new values.
LOADING
The loading phase is a two-part process of disabling constraints and indexes before the load
process starts and then enables them once the load is completed. In this step, the target of the load
process is often a database.
1.4.SETTING UP A DATA WAREHOUSE
The main purpose of a data warehouse is to organize large amounts of stable data to be easily
retrieved and analyzed. So when setting up, care must be taken to ensure the data is rapidly
accessible and easily analyzed. One way of designing this system is with the use of dimensional
modeling, which allows large volumes of data to be efficiently queried and examined. Since much
of the data in warehouses is stable, that is, unchanging, there is hardly a need for repetitive backup
methods. Also, once new data is loaded it can be updated and backed up right away by way of,
in some cases, the data preparation database, so it becomes available for easy access. There are
four categories of data warehousing tools; these are extraction, table management, query
management and data integrity tools. All these tools can be used in the setup and maintenance of
the best technology to manage and store the huge amounts of data a company collects, analyzes
and reviews.
COMPANY ANALYSIS
The first step, in setting up the company‘s data warehouse, is to evaluate the firm’s
objectives, For example, a growing company might set the objective to engage customers in
building rapport. By examining what the company needs to do to achieve these tasks, what will
need to be tracked, the key performance indicators to be noted and a numeric evaluation of the
company‘s activities the company can note and evaluate where they need to get started.
EXISTING SYSTEM ANALYSIS
By asking customers and various stakeholders pointed questions, Business Intelligence leaders
can gather the performance information they currently have in place that is or isn‘t effective.
Reports can be collected from various departments in the company, and they may even be able
to collect analytical and summary reports from analysts and supervisors. INFORMATION
MODELING OF CORE BUSINESS PROCESSES
An information model is conceptual, and allows for one to form ideas of what business processes
need to be interrelating and how to get them linked. Since the data warehouse is a collection of
correlating structures, creating a concept of what indicators need to be linked together to create
top performance levels is a vital step in the information modeling stage. A simple way to design
this model is to gather key performance indicators into fact tables and relate them to dimensions
such as customers, salespeople, products and such.
DESIGN AND TRACK
Once all those concepts are set in place, the next critical step is to move data into the warehouse
structure and track where it comes from and what it relates to. In this phase of design, it is crucial
to plan how to link data in the separate databases so that the information can be connected as it
is loaded into the data warehouse tables. The ETL process can be pretty complex and require
specialized programs with sophisticated algorithms, so the right tools have to be chosen at the
right, and most cost effective price for the job. Because the data is to be tracked over time, the
data will need to be available for a very long period. However the grain (atoms or make up) of
2
the data will defer over time, but the system should be set that the differing granularity is still
consistent throughout the singular data structure.
3
Rules that have both high confidence and support are called strong rules
Some competing alternative approaches can generate useful rules even with low
support values
4
STEPS:
5
Click open file..button in Preprocess tab and select apriori.csv.
6
Goto Associate tab – choose Apriori and click Start button.
OUTPUT:
7
The above screenshot shows the association rules that were generated when apriori algorithm is
applied on the given dataset.
8
3.LAB SESSION 2: FP GROWTH ALGORITHM
AIM:
This experiment illustrates the use of FP-Growth associate in weka. The sample data set used in this
experiment is apriori.arff. This document assumes that appropriate data preprocessing has been
performed.
3.1. INTRODUCTION
Apriori: uses a generate-and-test approach – generates candidate itemsets and tests if they
are frequent.
– Generation of candidate itemsets is expensive (in both space and time)
– Support counting is expensive
• Subset checking (computationally expensive)
• Multiple Database scans (I/O)
FP-Growth: allows frequent itemset discovery without candidate item set generation. Two
step approach:
– Step 1: Build a compact data structure called the FP-tree.
Built using 2 passes over the data-set.
– Step 2: Extracts frequent item sets directly from the FP-tree
PROCEDURE:
1. Open the data file in Weka Explorer. It is presumed that the required data fields have been
discretized.
2. Clicking on the associate tab will bring up the interface for association rule algorithm.
9
3. We will use FP-Growth algorithm.
4. In order to change the parameters for the run (example support, confidence etc) we click on the
text box immediately to the right of the choose button.
STEPS:
10
Remove- Transaction field and save the file as aprioritest.arff.
11
OUTPUT:
12
4. LAB SESSION 3: K-MEANS CLUSTERING
AIM:
This experiment illustrates the use of simple k-mean clustering with Weka explorer. The sample
data set used for this example is based on the vote.arff data set. This document assumes that
appropriate pre-processing has been performed.
13
4.3. How the K-Mean Clustering algorithm works?
PROCEDURE:
1. Run the Weka explorer and load the data file vote.arff in preprocessing interface.
2. In order to perform clustering select the ‘cluster’ tab in the explorer and click on the choose
button. This step results in a dropdown list of available clustering algorithms.
3. In this case we select ‘simple k-means’.
4. Next click in text button to the right of the choose button to get popup window shown in the
screenshots. In this window we enter six on the number of clusters and we leave the value of the
seed on as it is. The seed value is used in generating a random number which is used for making
the internal assignments of instances of clusters.
5. Once of the option have been specified. We run the clustering algorithm there we must make
sure that they are in the ‘cluster mode’ panel. The use of training set option is selected and then
we click ‘start’ button. This process and resulting window are shown in the following
screenshots.
14
STEPS: (Using Weka Explorer)
1.Open weka tool and click Explorer.
15
Choose cluster tab – click choose button – choose SimpleKmeans.
16
OUTPUT:
17
Goto - Visualize tab- click one box any visualize.
In the above move the jitter to last and to view the results of clustering
18
STEPS: (Using WekaKnowledgeFlow)
1.Open weka tool and click Explorer.
Click DataSources in the left side window and choose ArffLoader and draw in right side
window.
19
Click Run button.
20
Right click the TextViwer choose Show results option.
OUTPUT:
21
5. LAB SESSION 4: HIERARCHICAL CLUSTERING
AIM:
This experiment illustrates the use of one hierarchical clustering with Weka explorer. The sample
data set used for this example is based on the vote.arff data set. This document assumes that
appropriate pre-processing has been performed.
PROCEDURE:
1. Open the data file in Weka Explorer. It is presumed that the required data fields have been
discretized.Clicking on the cluster tab will bring up the interface for cluster algorithm.
2. We will use hierarchical clustering algorithm.
3. Visualization of the graph
STEPS:
The following screenshot shows the clustering rules that were generated when hierarchical
clustering algorithm is applied on the given dataset.
1.OpenWeka tool and choose Explorer.
22
2.Click - Open file… in preprocess tab –choose vote.arff.
23
OUTPUT:
6. Visualize the tree by right clicking and choose Visualize Tree option.
24
25
6. LAB SESSION 5: BAYESIAN CLASSIFICATION
AIM:
This experiment illustrates the use of Bayesian classifier with Weka explorer. The sample data set
used for this example is based on the weather.nominal.arff data set. This document assumes
that appropriate pre-processing has been performed.
6.1. BAYESIAN CLASSIFICATION
Bayesian classification is based on Bayes theorem. Bayesian classifiers are the statistical
classifiers. Bayesian classifiers can predict class membership probabilities such as the probability
that a given tuple belongs to a particular class.
Bayesian Classification: Why?
A statistical classifier: performs probabilistic prediction, i.e., predicts class membership
probabilities
Foundation: Based on Bayes‘ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable
performance with decision tree and selected neural network classifiers
Incremental: Each training example can incrementally increase/decrease the probability
that a hypothesis is correct — prior knowledge can be combined with observed data.
Standard: Even when Bayesian methods are computationally intractable, they can provide
a standard of optimal decision making against which other methods can be measured
PROCEDURE:
1. Open the data file in Weka Explorer. It is presumed that the required data fields have been
discretized.
2. Next we select the “classify” tab and click choose button to select the “NavieBayes” in the
classifier.
3. Now we specify the various parameters. These can be specified by clicking in the text box to
the right of the chose button. In this example, we accept the default values his default version
does perform some pruning but does not perform error pruning.
4. We select the 10-fold cross validation as our evaluation approach. Since we don’t have
separate evaluation data set, this is necessary to get a reasonable idea of accuracy of generated
model.
5. We now click start to generate the model .the ASCII version of the tree as well as evaluation
statistic will appear in the right panel when the model construction is complete.
6. Note that the classification accuracy of model is about 69%.this indicates that we may find
more work. (Either in preprocessing or in selecting current parameters for the classification)
7. Now weka also lets us a view a graphical version of the classification tree.
26
8. We will use our model to classify the new instances.
27
Click Choose button in Preprocess tab.
28
Click the above RemovePercentage –P 50.0 and change Percentage as 60.0 – click OK.
29
Click Apply and Save – Type file name weather.nominaltest.arff
30
Click Choose button in Preprocess tab.
31
To change the InverSelection as True – click OK.
Goto Classify tab in Weka Explorer and click Choose button- select NavieBayes.
32
Click Start.
OUTPUT:
33
Apply NavieBayes classification using Test set. (weather.nominaltest.arff)
Now click Supplied test set – Set button – click Open file.. –choose - weather.nominaltest.arff.
To create the following and click Run button and Right click the TextViwer and select Show
Result.
Draw ArrfLoader and select the filename.
35
Output:
36
7.LAB SESSION 6: DECISION TREE
AIM:
This experiment illustrates the use of j-48 classifier in weka. The sample data set used in this
experiment is weather dataset available at arff format. This document assumes that appropriate data
preprocessing has been performed.
7.1. DECISION TREE
A decision tree is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each
leaf node holds a class label. The topmost node in the tree is the root node. The following
decision tree is for the concept buy computer that indicates whether a customer at a company is
likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf
node represents a class.
Tree Pruning
Tree pruning is performed in order to remove anomalies in the training data due to noise or
outliers. The pruned trees are smaller and less complex.
38
STEPS: (Using Weka Explorer)
1.Open Weka tool and choose Explorer.
39
Click – Start button.
OUTPUT :
Decision Tree
40
STEPS: (For Knowledge Flow)
To create the following and click Run button and Right click the TextViwer and select Show
Result.
41
Right click GraphViewer – click Show Plots.
OUTPUT:
Decision Tree
42
8. LAB SESSION 7: SUPPORT VECTOR MACHINE
AIM:
This experiment illustrates the use of Support vector classifier in weka. The sample data set used
in this experiment is vote dataset available in arff format. This document assumes that appropriate
data preprocessing has been performed.
Algorithm
Define an optimal hyper plane: maximize margin
Extend the above definition for non-linearly separable problems: have a penalty term for
misclassifications.
Map data to high dimensional space where it is easier to classify with linear decision
surfaces: reformulate problem so that data is mapped implicitly to this space.
PROCEDURE:
1. We begin the experiment by loading the data (vote.arff) into weka.
2. Next we select the classify tab and click choose function button to select the Support vector
machine (SMO).
3. Now we specify the various parameters. These can be specified by clicking in the text box to
the right of the chose button.
4. Under the “text “options in the main panel. We select the 10-fold cross validation as our
evaluation approach. Since we don’t have separate evaluation data set, this is necessary to get
43
a reasonable idea of accuracy of generated model.
5. We now click ”start” to generate the model .the ASCII version of the tree as well as evaluation
statistic will appear in the right panel when the model construction is complete.
6. Note that the classification accuracy of model is about 69%.this indicates that we may find
more work. (Either in preprocessing or in selecting current parameters for the classification)
7. The run information of the support vector classifier will be displayed with the correctly and
incorrectly classified instances.
STEPS:
Open Weka tool – click Explorer in Weka GUI Chooser.
44
Goto Classify tab - click Choose button – select SMO option.
45
OUTPUT:
46
9. LAB SESSION 8: APPLICATIONS OF CLASSIFICATION FOR WEB
MINING
AIM
To analyze an application using weka tool.
9.1 WEB MINING
Use of data mining techniques to automatically discover interesting and potentially useful
information from Web documents and services.
Web mining may be divided into three categories.
Web content mining
Web structure mining
Web usage mining
Web mining is the application of data mining techniques to extract knowledge from web data, i.e.
web content, web structure, and web usage data.
9.1.1. WEB CONTENT MINING
Web content mining is the process of extracting useful information from the contents of web
documents. Content data is the collection of facts a web page is designed to contain. It may
consist of text, images, audio, video, or structured records such as lists and tables. Application of
text mining to web content has been the most widely researched. Issues addressed in text mining
include topic discovery and tracking, extracting association patterns, clustering of web
documents and classification of web pages. Research activities on this topic have drawn heavily
on techniques developed in other disciplines such as Information Retrieval (IR) and Natural
Language Processing (NLP). While there exists a significant body of work in extracting knowledge
from images in the fields of image processing and computer vision, the application of these
techniques to web content mining has been limited.
Hyperlinks
A hyperlink is a structural unit that connects a location in a web page to a different location,
either within the same web page or on a different web page. A hyperlink that connects to a
different part of the same page is called an intra-document hyperlink, and a hyperlink that
connects two different pages is called an inter-document hyperlink. There has been a significant
body of work on hyperlink analysis, of which provide an up-to-date survey.
Document structure
In addition, the content within a Web page can also be organized in a tree- structured format,
based on the various HTML and XML tags within the page. Mining efforts here have focused on
automatically extracting document object model (DOM) structures out of documents.
9.1.3. WEB USAGE MINING
Web usage mining is the application of data mining techniques to discover interesting usage
patterns from web usage data, in order to understand and better serve the needs of web-based
47
applications Usage data captures the identity or origin of web users along with their browsing
behavior at a web site. Web usage mining itself can be classified further depending on the kind of
usage data considered.
48
10. LAB SESSION 9: CASE STUDY ON TEXT MINING
AIM:
To perform the text mining using Weka tool.What is text mining?
Data mining in text: find something useful and surprising from a text collection.
Text mining vs. information retrieval.
Data mining vs. database queries.
Information Retrieval
Information retrieval deals with the retrieval of information from a large number of text- based
documents. Some of the database systems are not usually present in information retrieval systems
because both handle different kinds of data. Examples of information retrieval system include.
Online Library catalogue system.
Online Document Management Systems.
Web Search Systems etc.
The main problem in an information retrieval system is to locate relevant documents in a
document collection based on a user's query. This kind of user's query consists of some keywords
describing an information need.
In such search problems, the user takes an initiative to pull relevant information out from a
collection. This is appropriate when the user has ad-hoc information need, i.e., a short-term need.
But if the user has a long-term information need, then the retrieval system can also take an initiative
to push any newly arrived information item to the user.
This kind of access to information is called Information Filtering. And the corresponding
systems are known as Filtering Systems or Recommender Systems.
and the set of retrieved document as {Retrieved}. The set of documents that are relevant and
retrieved can be denoted as {Relevant} ∩ {Retrieved}. This can be shown in the form of a Venn
diagram as follows
49
There are three fundamental measures for assessing the quality of text retrieval
Precision
Recall
F-score
PRECISION
Precision is the percentage of retrieved documents that are in fact relevant to the query. Precision
can be defined as
Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|
RECALL
Recall is the percentage of documents that are relevant to the query and were in fact retrieved.
Recall is defined as
Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}|
F-SCORE
F-score is the commonly used trade-off. The information retrieval system often needs to trade-off for
precision or vice versa. F-score is defined as harmonic mean of recall or precision as follows
F-score = recall x precision / (recall + precision) / 2
SAMPLE EXERCISE
Spam.arff
50
Open the file spam.arff
Choose StringToWordVector
51
Click Start button.
Choose Edit -> Select Attribute as class
52
OUTPUT:
53