0% found this document useful (0 votes)
10 views

Lab Updated - Merged

Uploaded by

vishushukla2313
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Lab Updated - Merged

Uploaded by

vishushukla2313
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 49

AJAY KUMAR GARG ENGINEERING COLLEGE

27th KM STONE DELHI-HAPUR BYPASS ROAD, P.O.


ADHYATMIK NAGARGHAZIABAD-201009

DATA WAREHOUSING AND DATA MININGLAB


SUBJECT CODE (KCS-751A)

B.TECH
(IV YEAR – VIITH SEM)

(2024-25)

LAB

RECORD

Submitted by: Submitted To:


Name: RUSHIL GAUTAM MS. SHIVA TYAGI
Roll no. 2100271530067 AP, CSE DEPTT.
Year: 4th,VII SEM
Section: CSE(AIML)-2
DATA WAREHOUSING AND DATA MINING LAB INDEX

S.N Name of the Experiment Pg No Date Signature


o
1 Installation of WEKA Tool 1

2 Creating new Arff File

3 Data Processing Techniques on


Data set
4 Data cube construction – OLAP
operations

5 Implementation of Apriori algorithm

6 Implementation of FP- Growth


algorithm

7 Implementation of Decision Tree


Induction

8 Calculating Information gains


measures

9 Classification of data using Bayesian


approach

10 Implementation of K-means
Algorithm
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Experiment 1: Installation of WEKA Tool

Aim: A. Investigation the Application interfaces of the Weka tool. Introduction:

Introduction
Weka (pronounced to rhyme with Mecca) is a workbench that contains a
collection of visualizationtools and algorithms for data analysis and predictive
modeling, together with graphical user interfaces for easy access to these
functions. The original non-Java version of Weka was a Tcl/Tk front-end to
(mostly third-party) modeling algorithms implemented in other programming
languages, plus data preprocessing utilities in C, and Make file-based system for
running machinelearning experiments. This original version was primarily
designed as a tool for analyzing data from agricultural domains, but the more
recent fully Java-based version (Weka 3), for which development started in 1997,
is now used in many different application areas, in particular for educational
purposes and research. Advantages of Weka include:

▪ Free availability under the GNU General Public License.


▪ Portability, since it is fully implemented in the Java programming
language and thus runson almost any modern computing platform
▪ A comprehensive collection of data preprocessing and modeling techniques
▪ Ease of use due to its graphical user interfaces

Description:
Open the program. Once the program has been loaded on the user’s machine it is
opened bynavigating to the programs start option and that will depend on the
users operating system. Figure 1.1 is an example of the initial opening screen on
a computer.There are four options available on this initial screen:

1
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Fig: 1.1 Weka GUI

1. Explorer - the graphical interface used to conduct experimentation on raw


data After clickingthe Explorer button the weka explorer interface appears.

2
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Fig: 1.2 Pre-processor

3
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

4
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Inside the weka explorer window there are six tabs:


1. Preprocess- used to choose the data file to be used by theapplication.
Open File- allows for the user to select files residing on the local machine or
recorded medium Open URL- provides a mechanism to locate a file or data
source from a different location specifiedby the user
Open Database- allows the user to retrieve files or data from a database source provided
by user
2. Classify- used to test and train different learning schemes on the preprocessed
data file under experimentation…

Fig: 1.3 choosing Zero set from classify Again


there are several options to be selected inside of the classify tab. Test option gives
the userthe choice of using four different test mode scenarios on the data set.
1. Use training set
5
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

2. Supplied training set


3. Cross validation

6
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

4. Split percentage

3. Cluster- used to apply different tools that identify clusters within the data file.
The Cluster tab opens the process that is used to identify commonalties or clusters
of occurrenceswithin the data set with user to analyze.

4. Select attributes-used to apply different rules to reveal changes based


on selected attributesinclusion or exclusion from the experiment

5. Visualize- used to see what the various manipulation produced on the data
set in a 2D format,in scatter plot and bar graph output.

2. Experimenter - this option allows users to conduct different experimental


variations on data sets and perform statistical manipulation. The Weka
Experiment Environment enables the user tocreate, run, modify, and analyze
experiments in a more convenient manner than is possible when processing the
schemes individually. For example, the user can create an experiment that runs
several schemes against a series of datasets and then analyze the results to
determine if one of the schemes is (statistically) better than the other schemes.

Fig: 1.6 Weka experiment

7
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Results destination: ARFF file, CSV file, JDBC database.

Experiment type: Cross-validation (default), Train/Test Percentage Split (data


randomized).
Iteration control: Number of repetitions, Data sets first/Algorithms first.
Algorithms: filters

3. Knowledge Flow -basically the same functionality as Explorer with drag and
drop functionality. The advantage of this option is that it supports incremental
learning from previous results
4. Simple CLI - provides users without a graphic interface option the ability to
execute commands from a terminal window.

b. Explore the default datasets in weka tool.

Click the “Open file…” button to open a data set and double click on the
“data” directory. Weka provides a number of small common machine
learning datasets that you can use to practice on.Select the “iris.arff” file to
load the Iris dataset.

8
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B
Fig: 1.7 Different Data Sets in weka

9
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Experiment 2: Creating new ARFF file


An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list
of instances sharing a set of attributes. ARFF files were developed by the Machine
Learning Project at the Department of Computer Science of The University of Waikato
for use with the Weka machine learning software in WEKA, each data entry is an
instance of the java class weka.core. Instance, and each instance consists of a For loading
datasets in WEKA, WEKA can load ARFF files. Attribute Relation File Format has two
sections:

1. The Header sectiondefinesrelation (dataset) name, attribute name, and type.


2. The Data section lists the data instances.

The figure above is from the textbook that shows an ARFF file for the weather
data. Lines beginning with a % sign are comments. And there are three basic

1
0 SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B
keywords:

1
1 SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

The external representation of an Instances class Consists of:

▪ A header: Describes the attribute types


▪ Data section: Comma separated list of data

1
2 SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Experiment 3: Data Processing Techniques on


Data Set
To search through all possible combinations of attributes in the data and find
which subset of attributes works best for prediction, make sure that you set up
attribute evaluator to „Cfs Subset Val‟ and a search method to „Best First‟. The
evaluator will determine what method to use to assigna worth to each subset of
attributes. The search method will determine what style of search to perform. The
options that you can set for selection in the „Attribute Selection Mode‟ fig no: 3.2

1. Use full training set. The worth of the attribute subset is determined
using the full set oftraining data.

2.Cross-validation. The worth of the attribute subset is determined by a process


of cross- validation. The „Fold‟ and „Seed‟ fields set the number of folds to use
and the random seed usedwhen shuffling the data.

Fig: 3.1 Choosing Cross validation

13
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Specify which attribute to treat as the class in the drop-down box below the test options.
Once allthe test options are set, you can start the attribute selection process by clicking
on „Start‟ button.
When it is finished, the results of selection are shown on the right part of the window
and entry isadded to the „Result list‟.

2. Visualizing Results:-

Fig: 3.2 Data Visualization


WEKA‟s visualization allows you to visualize a 2-D plot of the current working
relation.Visualization is very useful in practice; it helps to determine difficulty of the
learning problem. WEKA can visualize single attributes (1-d) and pairs of attributes (2-
14
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

d), rotate 3-d visualizations (Xgobi-style). WEKA has “Jitter” option to deal with
nominal attributes and to detect “hidden” data points.

15
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Fig 3.3: Preprocessing with jitter

16
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Fig: 3.3 Data visualization

17
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Aim: B. Pre-process a given dataset based on Handling Missing


Values
Process: Replacing Missing Attribute Values by the Attribute Mean. This method is
used fordata sets with numerical attributes. An example of such a data set is
presented in fig no: 3.4..

Fig: 3.4 Missing values

18
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

In this method, every missing attribute value for a numerical attribute is replaced
by the arithmeticmean of known attribute values. In Fig, the mean of known
attribute values for Temperature is 99.2, hence all missing attribute values for
Temperature should be replaced by The table with missing attribute values
replaced by the mean is presented in fig. For symbolic attributes Headacheand
Nausea, missing attribute values were replaced using the most common value of
the Replace Missing Values.

19
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Fig: 3.6 Replaced values

20
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Experiment 4: Data cube construction – OLAP operations

An OLAP cube is a term that typically refers to multi-dimensional array of data.


OLAP is an acronym for online analytical processing,[1]which is a computer-
based technique of analyzing data to look for insights. The term cube here refers
to a multi-dimensional dataset, which is alsosometimes called a hypercube if the
number of dimensions is greater than 3.

Operations:
1. Sliceis the act of picking a rectangular subset of a cube by choosing a single
value for one ofits dimensions, creating a new cube with one fewer dimension.
[4] The picture shows a slicing operation: The sales figures of all sales regions
and all product categories of the company in theyear 2005 and 2006 are
"sliced" out of the data cube.

2. Dice:The dice operation produces a subcube by allowing the analyst to


pick specific values ofmultiple dimensions.[5]The picture shows a dicing
operation: The new cube shows the sales figures of a limited number of
product categories, the time and region dimensions cover the samerange as
before.

3. Drill
Down/Up allows the user to navigate among levels of data ranging
from the most summarized (up) to the most detailed (down).[4] The picture
shows a drill-down operation: Theanalyst moves from the summary category
"Outdoor-Schutzausrüstung" to see the sales figuresfor the individual
products.

4. Roll-up:A roll-up involves summarizing the data along a dimension. The


summarization rulemight be computing totals along a hierarchy or applying a
set of formulas such as "profit = sales
- expenses".

5. Pivot
allows an analyst to rotate the cube in space to see its various faces.
For example, citiescould be arranged vertically and products horizontally
21
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

while viewing data for a particular quarter. Pivoting could replace products
with time periods to see data across time for a single product.

22
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

23
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

24
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Experiment 5: Implementation of Apriori Algorithm

Description:
The Apriori algorithm is an influential algorithm for mining frequent item sets for
Boolean association rules. It uses a “bottom-up” approach, where frequent subsets
are extended one at a time (a step known as candidate generation, and groups of
candidates are tested against the data).

Problem:

TID ITEMS
100 1,3,4
200 2,3,5
300 1,2,3,5
400 2,5

To find frequent item sets for above transaction with a minimum support of 2
having confidencemeasure of 70% (i.e, 0.7).

Procedure:
Step 1:
Count the number of transactions in which each item occurs

TID ITEMS
1 2
2 3
3 3
4 1
5 3

25
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Step 2:

Eliminate all those occurrences that have transaction numbers less than the minimum
support ( 2in this case).

ITE NO. OF
M TRANSACTIO
NS
1 2

2 3

3 3

5 3

This is the single items that are bought frequently. Now let’s say we want to
find a pair of itemsthat are bought frequently. We continue from the above
table (Table in step 2).

26
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Step 3:
We start making pairs from the first
item like 1,2;1,3;1,5 and then from
second item like 2,3;2,5.We do not
perform 2,1 because we already did
1,2 when we were making pairs with
1 and buying 1 and 2 together is same
as buying 2 and 1 together. After
making all the pairs we get,

ITEM PAIRS

1,2
1,3
1,5
Step 4: 2,3
2,5
3,5

Now, we count how many times each pair is bought


together.

NO.OF
ITEM PAIRS TRANSACTIO
NS
1,2 1
1,3 2
1,5 1
2,3 2
2,5 3
3,5 2

27
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Step 5:
Again remove all item pairs having number of transactions less than 2.

ITEM PAIRS NO. OF


TRANSACTIO
NS
1,3 2
2,3 2
2,5 3
3,5 2

These pair of items is bought frequently together. Now, let’s say we want to
find a set of threeitems that are bought together. We use above table (of step
5) and make a set of three items.

Step 6:
To make the set of three items we need one more rule (It is termed as self- join),
it simply means,from item pairs in above table, we find two pairs with the same
first numeric, so, we get (2,3) and (2,5), which gives (2,3,5). Then we find how
many times (2, 3, 5) are bought together in theoriginal table and we get the
following

ITEM NO. OF
SET TRANSACTIO
NS
(2,3,5) 2

Thus, the set of three items that are bought together from this
data are (2, 3, 5).

Confidence:
We can take our frequent item set knowledge even further, by finding
association rules using thefrequent item set. In simple words, we know (2, 3, 5)
are bought together frequently, but what isthe association between them. To do

28
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

this, we create a list of all subsets of frequently bought items (2, 3, 5) in our case
we get following subsets:

29
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

▪ {2}
▪ {3}
▪ {5}
▪ {2,3}
▪ {3,5}
▪ {2,5}

Now, we find association among all the subsets.


{2} =>{3,5}:( If„2‟ is bought , what‟s the probability that „3‟ and „5‟ would be
bought insame transaction)
Confidence = P (3◻5◻2)/ P(2) =2/3 =67%
{3}=>{2,5}= P (3◻5◻2)/P(3)=2/3=67%
{5}=>{2,3}= P (3◻5◻2)/P(5)=2/3=67%
{2,3}=>{5}= P (3◻5◻2)/P(2◻3)=2/2=100%
{3,5}=>{2}= P (3◻5◻2)/P(3◻5)=2/2=100%
{2,5}=>{3}= P (3◻5◻2)/ P(2◻5)=2/3=67%
Also, considering the remaining 2-items sets, we would get the following associations-
{1}=>{3}=P(1◻3)/P(1)=2/2=100%
{3}=>{1}=P(1◻3)/P(3)=2/3=67%
{2}=>{3}=P(3◻2)/P(2)=2/3=67%
{3}=>{2}=P(3◻2)/P(3)=2/3=67%
{2}=>{5}=P(2◻5)/P(2)=3/3=100%
{5}=>{2}=P(2◻5)/P(5)=3/3=100%
{3}=>{5}=P(3◻5)/P(3)=2/3=67%
{5}=>{3}=P(3◻5)/P(5)=2?3=67%
Eliminate all those having confidence less than 70%. Hence, the rules would be –
{2,3}=>{5}, {3,5}=>{2}, {1}=>{3},{2}=>{5}, {5}=>{2}.

⮚ Nowthese manual results should be checked with the rules generated in WEKA.

30
SignatureofTheFacu lty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

So first create a csv file for the above problem, the csv file for the above problem
will look likethe rows and columns in the above figure. This file is written in
excel sheet.

Procedure for running the rules in weka:


Step 1:
Open weka explorer and open the file and then select all the item sets. The figure gives a better
understanding of how to do that.

31
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Step 2:
Now select the association tab and then choose apriori algorithm by setting
the minimum support and confidence as shown in the figure….

32
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Step 3:
Now run the apriori algorithm with the set values of minimum support and the
confidence. Afterrunning the weka generates the association rules and the respective
confidence with minimum support as shown in the figure.
The above csv file has generated 5 rules as shown in the figure….

Conclusion:
As we have seen the total rules generated by us manually and by the weka are
33
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B
matching, hencethe rules generated are 5.

34
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Experiment 6: Implementation of FP- Growth Algorithm


(5a) Aim: To generate association rules using FP Growth Algorithm

PROBLEM:
To find all frequent item sets in following dataset using FP-growth
algorithm. Minimumsupport=2 and confidence =70%

TID ITEMS
100 1,3,4
200 2,3,5
300 1,2,3,5
400 2,5

Solution:
Similar to Apriori Algorithm, find the frequency of occurrences of all each item in
dataset andthen prioritize the items according to its descending order of its frequency
of occurrence.
Eliminating those occurrences with the value less than minimum support and
assigning thepriorities, we obtain the following table.

ITE NO. OF PRIORIT


M TRANSACTIO Y
NS
1 2 4

2 3 1

3 3 2

5 3 3

Re-arranging the original table, we obtain

35
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

TID ITEMS

100 1,3

200 2,3,5

300 2,3,5,1

400 2,5

Construction of tree:
Note that all FP trees have „null‟ node as the root node. So, draw the root node
first and attach the items of the row 1 one by one respectively and write their
occurrences in front of it. The treeis further expanded by adding nodes according
to the prefixes (count) formed and by further incrementing the occurrences every
time they occur and hence the tree is built.

Prefixes:

▪ 1->3:1 2,3,5:1
▪ 5->2,3:2 2:1
▪ 3->2:2

Frequent item sets:

▪ 1-> 3:2 /*2 and 5 are eliminated because they‟re less than
minimum support, and theoccurrence of 3 is obtained by adding
the occurrences in both the instances*/
▪ Similarly, 5->2,3:2 ; 2:3;3:2
▪ 3->2 :2

Therefore, the frequent item sets are {3,1}, {2,3,5},


{2,5}, {2,3},{3,5}The tree is constructed as below:

36
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

1:1

Generating the association rules for the following tree and


calculating theconfidence measures we get-
▪ {3}=>{1}=2/3=67%
▪ {1}=>{3}=2/2=100%
▪ {2}=>{3,5}=2/3=67%
▪ {2,5}=>{3}=2/3=67%
▪ {3,5}=>{2}=2/2=100%
▪ {2,3}=>{5}=2/2=100%
▪ {3}=>{2,5}=2/3=67%
▪ {5}=>{2,3}=2/3=67%
▪ {2}=>{5}=3/3=100%
▪ {5}=>{2}=3/3=100%
▪ {2}=>{3}=2/3=67%
▪ {3}=>{2}=2/3=67%

Thus eliminating all the sets having confidence less than 70%, we obtain the
following conclusions:
{1}=>{3} , {3,5}=>{2} , {2,3}=>{5} , {2}=>{5}, {5}=>{2}.

As we see there are 5 rules that are being generated manually and these are to be
checked against the results in WEKA. Inorder to check the results in the tool we

37
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

need to follow the similar procedure like


Apriori.

38
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Sofirst create a csv file forthe above problem, the csv file for the above problem
will look like the rowsand columns in the above figure. This file is written in excel
sheet.

Procedure for running the rules in weka:


Step 1:
Open weka explorer and open the file and then select all the item sets. The figure gives
betterunderstanding of how to do that….

39
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Step 2:
Now select the association tab and then choose FP growth algorithm by setting theminimum support
and confidence as shown in the figure….

40
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

Step 3:
Now run the FP Growth algorithm with the set values of minimum support
and the confidence. After running the weka generates the association rules and the
respective confidence withminimumsupport as shown in the figure.

The above csv file has generated 5 rules as shown in the figure…

Conclusion:
As we have seen the total rules generated by us manually and by the weka are matching,
hencethe rules generated are 5.

41
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

42
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

43
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

44
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

45
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

46
SignatureofTheFaculty
AJAYKUMARGARGENGINEERINGCOLLEGE,GHAZIABAD
DepartmentofCSE(AIML)

Roll No.-2100271530067 Name- RUSHILGAUTAM


Subject-DataWarehousing&DataMiningLab Semester-
Year–4th 7
Batch-B

47
SignatureofTheFaculty

You might also like