Data_Warehousing_Lab_Record_Final
Data_Warehousing_Lab_Record_Final
Karumathampatti, Coimbatore
Laboratory Record
Name :
Register No. :
Branch : B.E. – Computer Science and Engineering
Semester : V
Academic Year : 2023-2024
1
JANSONS INSTITUTE OF TECHNOLOGY
Karumathampatti, Coimbatore
2
INDEX
Implementation of warehouse
9 testing 39
3
EX.NO:1 DATE:
Introduction :
Weka (pronounced to rhyme with Mecca) is a workbench that contains a collection of
visualization tools and algorithms for data analysis and predictive modeling, together with
graphical user interfaces for easy access to these functions. The original non-Java version of
Weka was a Tcl /Tk front-end to (mostly third-party) modeling algorithms implemented in other
programming languages, plus data preprocessing utilities in C, and Make file-based system for
running machine learning experiments. This original version was primarily designed as a tool for
analyzing data from agricultural domains, but the more recent fully Java-based version (Weka 3),
for which development started in 1997, is now used in many different application areas, in
particular for educational purposes and research. Advantages of Weka include:
Free availability under the GNU General Public License.
▪ Portability, since it is fully implemented in the Java programming language and thus runs
on almost any modern computing platform
▪ A comprehensive collection of data preprocessing and modeling techniques
▪ Ease of use due to its graphical user interfaces
Description:
Open the program. Once the program has been loaded on the user’s machine it is opened by
navigating to the programs start option and that will depend on the user’s operating system.
Figure 1.1 is an example of the initial opening screen on a computer.
There are four options available on this initial screen
4
Fig: 1.1 Weka GUI
1. Explorer - the graphical interface used to conduct experimentation on raw data After
clickingthe Explorer button the weka explorer interface appears.
5
Inside the Weka explorer window there are six tabs:
1. Preprocess- used to choose the data file to be used by the application.
Open File- allows for the user to select files residing on the local machine or recorded medium
Open URL- provides a mechanism to locate a file or data source from a different location
specified by the user
Open Database- allows the user to retrieve files or data from a database source provided by user
6
2. Classify- used to test and train different learning schemes on the preprocessed data file under
experimentation
Again there are several options to be selected inside of the classify tab. Test option gives the user
the choice of using four different test mode scenarios on the data set.
1. Use training set
2. Supplied training set
3. Cross validation
4. Split percentage
3. Cluster- used to apply different tools that identify clusters within the data file.
The Cluster tab opens the process that is used to identify commonalties or clusters of occurrences
within the data set and produce information for the user to analyze.
7
4. Association- used to apply different rules to the data file that identify association within the
data. The associate tab opens a window to select the options for associations within the dataset.
5. Select attributes-used to apply different rules to reveal changes based on selected attributes
inclusion or exclusion from the experiment
6. Visualize- used to see what the various manipulation produced on the data set in a 2D format,
in scatter plot and bar graph output.
2. Experimenter - this option allows users to conduct different experimental variations on data
8
sets and perform statistical manipulation. The Weka Experiment Environment enables the user to
create, run, modify, and analyze experiments in a more convenient manner than is possible when
processing the schemes individually. For example, the user can create an experiment that runs
several schemes against a series of datasets and then analyze the results to determine if one of the
schemes is (statistically) better than the other schemes.
9
3. Knowledge Flow -basically the same functionality as Explorer with drag and drop
functionality. The advantage of this option is that it supports incremental learning from
previous results
4. Simple CLI - provides users without a graphic interface option the ability to
execute commands from a terminal window.
b. Explore the default datasets in weka tool.
Click the “Open file…” button to open a data set and double click on the “data”
directory. Weka provides a number of small common machine learning datasets that
you can use to practiceon. Select the “iris.arff” file to load the Iris dataset.
Result:
Thus the Data exploration and Integration with WEKA was studied.
10
EX.NO:2 DATE:
AIM:
To use the Weka tool for data validation. Students will learn how to load datasets into
Weka, explore data characteristics, handle missing values, detect outliers, and ensure data
quality through various preprocessing techniques.
Algorithm:
Load Dataset Algorithm:
● Open Weka.
● Navigate to the "Explorer" tab.
● Click on the "Open file" button and select a dataset (e.g., from the Weka
datasets or upload a CSV file).
Missing Values Handling Algorithm:
● Explore the dataset using the "Preprocess" panel in the "Explorer" tab.
● Identify missing values using the "Edit" option or utilize filters for missing
value identification.
● Choose an imputation strategy (e.g., mean, median) and apply it using Weka's
preprocessing tools.
Outlier Detection Algorithm:
● Visualize potential outliers using the "Visualize" panel in the "Explorer" tab.
● Utilize scatter plots or box plots to identify outliers.
● Apply filters or transformations to handle outliers using Weka's preprocessing
capabilities.
Data Quality Assurance Algorithm:
● Evaluate data quality using summary statistics and visualizations.
● Apply filters for data quality assurance, such as removing instances with
inconsistent data.
● Examine the impact of filters on data consistency, accuracy, and reliability .
Procedure:
Introduction to Weka:
● Install Weka on your system following the provided guidelines.
● Launch Weka and explore the user interface.
Loading Datasets:
● Load a sample dataset into Weka.
● Examine the dataset details using the "Explorer" tab.
11
Handling Missing Values:
● Identify and visualize missing values.
● Apply imputation techniques to handle missing values.
Outlier Detection:
● Visualize potential outliers.
● Apply outlier handling techniques using Weka's filters.
Output:
12
Result:
Thus, the program for applying Weka tool for data validation is implemented successfully
and the output is verified
13
EX.NO:3 DATE:
AIM:
To guide the process of planning the architecture for a real-time application in data
warehousing.
Procedure:
14
Code:
Output:
15
java -cp weka.jar weka.filters.unsupervised.attribute.ReplaceMissingValues -i
example_dataset.arff -o example_dataset_no_missing.arff
16
● Flask app running, ready to serve machine learning models.
Result:
Thus the program has been completed successfully and the output is verified.
17
EX.NO:4 DATE:
Aim:
Create a schema definition query for a simple database that stores information about
books, including their title, author, publication year, and genre.
Algorithm:
Step 1: Identify the entities: In this case, you have a "Book" entity.
Step 2: Define attributes: Determine the attributes each book should have, such as title,
author, publication year, and genre.
Step 3: Determine data types: Choose appropriate data types for each attribute
(e.g.,VARCHAR for title and author, INTEGER for publication year).
Step 4: Set primary key: Decide on a primary key for the table (e.g., book_id).
Step 5: Establish relationships (if applicable): If there are multiple tables, define
relationships between them.
Program:
# generate_arff.py
import random
18
genres = ["Fiction", "Dystopian", "Romance", "Mystery", "Fantasy", "Science
Fiction"]
return random.choice(genres)
Sample output:
In bash
ARFF file generated successfully: books_dataset.arff
Result:
Once you execute the query, the result would be the creation of a table in your database
named "Books" with the defined structure. You can then use this table to store
information about books in your database.
19
EX.NO:5 DATE:
Aim:
Algorithm:
20
Program:
import pandas as pd
21
# Displaying the sample output
print("Product\t\tCustomer\tTotal Sales")
print("-----------------------------------")
for row in query_result:
print(f"{row['product']}\t\t{row['customer']}\t\t{row['SUM(sales)']}")
Sample output:
Step 1:
Step 2:
22
Step 3:
Step 4:
Result:
Thus the data warehouse for real time application is designed successfully
23
EX.NO:6 DATE:
AIM:
Procedure:
Schema is a logical description of the entire database. It includes the name and description
of records of all record types including all associated data-items and aggregates. Much like
a database, a data warehouse also requires to maintain a schema. A database uses relational
model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema.
Star Schema
• There is a fact table at the center. It contains the keys to each of four dimensions.
• The fact table also contains the attributes, namely dollars sold and units sold.
24
Snowflake Schema
• Now the item dimension table contains the attributes item_key, item_name, type,
brand, and supplier-key.
• The supplier key is linked to the supplier dimension table. The supplier dimension
table contains the attributes supplier_key and supplier_type.
• A fact constellation has multiple fact tables. It is also known as galaxy schema.
• The following diagram shows two fact tables, namely sales and shipping.
25
• The sales fact table is same as that in the star schema.
• The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
• The shipping fact table also contains two measures, namely dollars sold and units
sold.
• It is also possible to share dimension tables between fact tables. For example, time,
item, and location dimension tables are shared between the sales and shipping fact
table.
Result:
Star, Snowflake and Fact Constellation schemas for sales enterprise data has been
designed using DBDesigner Tool.
26
EX.NO:7 DATE:
AIM:
Develop a comprehensive case study utilizing OLAP technology to analyze and showcase
its effectiveness in data management and decision-making processes.
Procedure:
1. OLAP Operations:
Since OLAP servers are based on multidimensional view of data, we will
discuss OLAP
a. operations in multidimensional data.
b. Here is the list of OLAP operations
c. Roll-up (Drill-up)
d. Drill-down
e. Slice and dice
f. Pivot (rotate)
2. Roll-up (Drill-up):
a. Roll-up performs aggregation on a data cube in any of the following ways
b. By climbing up a concept hierarchy for a dimension
c. By dimension reduction
d. Roll-up is performed by climbing up a concept hierarchy for the dimension
location.
e. Initially the concept hierarchy was "street < city < province < country".
f. On rolling up, the data is aggregated by ascending the location hierarchy from
the level of city to the level of country.
g. The data is grouped into cities rather than countries.
h. When roll-up is performed, one or more dimensions from the data cube are
removed.
3. Drill-down:
a. Drill-down is the reverse operation of roll-up. It is performed by either of the
following ways
b. By stepping down a concept hierarchy for a dimension
c. By introducing a new dimension.
d. Drill-down is performed by stepping down a concept hierarchy for the
dimension time.
e. Initially the concept hierarchy was "day < month < quarter < year."
27
f. On drilling down, the time dimension is descended from the level of quarter
to the level
g. of month.
h. When drill-down is performed, one or more dimensions from the data cube
are added.
i. It navigates the data from less detailed data to highly detailed data.
4. Slice:
a. The slice operation selects one particular dimension from a given cube and
provides a
b. new sub-cube.
5. Dice:
a. Dice selects two or more dimensions from a given cube and provides a new
sub-cube.
6. Pivot (rotate):
a. The pivot operation is also known as rotation. It rotates the data axes in view
in order to
b. provide an alternative presentation of data.
c. Now, we are practically implementing all these OLAP Operations using
Microsoft
d. Excel.
e. Procedure for OLAP Operations:
f. 1.Open Microsoft Excel, go toData tab in top & click on ―Existing
Connections”.
g. 2. Existing Connections window will be opened, there “Browse for
more”option should be
h. clicked for importing .cub extension file for performing OLAP Operations.
For sample, I took
i. music.cub file.
28
3.As shown in above window, select ―PivotTable Report” and click “OK”.
4.We got all the music.cub data for analyzing different OLAP Operations.Firstly, we
performed
drill-down operation as shown below.
5. Now we are going to perform roll-up (drill-up) operation, in the above window I selected
January month then automatically Drill-up option is enabled on top. We will click on Drill-up
option, then the below window will be displayed.
29
6. Next OLAP operation Slicing is performed by inserting slicer as shown in top navigation
Options.
While inserting slicers for slicing operation, we select 2 Dimensions (for e.g.
CategoryName & Year) only with one Measure (for e.g. Sum of sales).After inserting a
slice&
adding a filter (CategoryName: AVANT ROCK & BIG BAND; Year: 2009 & 2010), we will
get
table as shown below.
30
7. Dicing operation is similar to Slicing operation. Here we are selecting 3 dimensions
(CategoryName, Year, RegionCode)& 2 Measures (Sum of Quantity, Sum of Sales) through
„insert slicer‟ option. After that adding a filter for CategoryName, Year & RegionCode as
shown below.
8. Finally, the Pivot (rotate) OLAP operation is performed by swapping rows (Order Date-
Year)
& columns (Values-Sum of Quantity & Sum of Sales) through right side bottom navigation
bar
as shown below.
31
After Swapping (rotating), we will get resultant as represented below with a pie-chart for
Category-Classical& Year Wise data.
Result:
Thus the case study using OLAP has been completed successfully.
32
EX.NO:8 DATE:
AIM:
The primary objective of this lab practical is to provide students with a comprehensive
understanding of Online Transaction Processing (OLTP) through a practical case study. By
the end of this exercise, students should be adept at designing OLTP databases, managing
transactions, handling concurrency, ensuring data integrity, and optimizing performance.
Procedure:
In this scenario, we will focus on the implementation of OLTP in the context of a small-scale
e-commerce platform. The platform needs to manage customer information, product
inventory, and order processing efficiently. Students will go through the process of setting up
the database, simulating transactions, addressing concurrency issues, exploring isolation
levels, maintaining data integrity, and optimizing system performance.
Steps:
1. Database Setup:
33
2. Transaction Simulation:
34
Insert orders into the Orders table, ensuring each order involves multiple products and links
to the respective customers.
3. Concurrency Control :
4. Isolation Levels:
35
Document the advantages and disadvantages of each isolation level.
5. Data Integrity :
6. Performance Testing:
36
After measuring the initial system performance, students are expected to optimize the
database schema or queries to enhance efficiency. This could involve indexing,
denormalization, or query optimization strategies. For instance:
Subsequently, students should rerun the performance tests, comparing the results before and
after optimization.
7. Documentation:
37
● Performance Testing: Present the methodology for performance testing, including
the SQL scripts used to simulate a large number of transactions. Include any
optimizations made to enhance system performance.
● Challenges Faced: Acknowledge any challenges encountered during the lab and
describe how they were addressed. This could include difficulties in implementing
certain features, managing concurrency, or optimizing performance.
● Conclusion: Summarize the key findings and lessons learned from the lab. Reflect on
the significance of OLTP concepts in the context of the case study.
Result:
Thus the case study using OLTP has been completed successfully.
38
EX.NO: 9 DATE:
AIM:
To apply the Navie Bayes Classification for testing the given dataset.
Algorithm:
Example: predict whether a costumer will buy a computer or not " Costumers are
described by two attributes: age and income " X is a 35 years-old costumer with an
income of 40k " H is the hypothesis that the costumer will buy a computer " P(H|X)
reflects the probability that costumer X will buy a computer given that we know the
costumers’ age and income.
Input Data:
39
Output Data:
40
Result:
Thus the Navie Bayes Classification for testing the given dataset is implemented.
41
EX.NO:10 (a) DATE:
Aim:
To convert a text file to ARFF(Attribute-Relation File Format) using Weka3.8.2 tool.
Objectives:
Most of the data that we have collected from public forum is in the text format that
cannot be read by Weka tool. Since Weka (Data Mining tool) recognizes the data in
ARFF format only we have to convert the text file into ARFF file.
Algorithm:
Output:
42
Data ARFF File:
Result:
Thus, conversion of a text file to ARFF(Attribute-Relation File Format) using Weka3.8.2 tool
is implemented
43
EX.NO:10 (b) DATE:
Aim:
To convert ARFF (Attribute-Relation File Format) into text file.
Objectives:
Since the data in the Weka tool is in ARFF file format we have to convert the
ARFF file to text format for further processing.
Algorithm:
44
Data Text File:
Result:
45