DW 9 Exp 1
DW 9 Exp 1
CERTIFICATE
3.
Plan the architecture for real time application
4.
Write the query for schema definition
AIM:
To exploring the data and performing integration with weka
PROCEDURE:
To install WEKA on your machine, visit WEKA’s official website and download the
installation file. WEKA supports installation on Windows, Mac OS X and Linux. You just
need to follow the instructions on this page to install WEKA for your OS.
The WEKA GUI Chooser application will start and you would see the following screen
The GUI Chooser application allows you to run five different types of applications
as listed here:
Explorer
Experimenter
Knowledge Flow
Workbench
Simple CLI
Under the Cluster tab, there are several clustering algorithms provided - such as
SimpleKMeans, FilteredClusterer, HierarchicalClusterer, and so on.
Associate Tab
Under the Associate tab, you would find Apriori, FilteredAssociator and FPGrowth.
Select Attributes Tab
Select Attributes allows you feature selections based on several algorithms such
as ClassifierSubsetEval, PrinicipalComponents, etc.
Visualize Tab
Lastly, the Visualize option allows you to visualize your processed data for analysis. As you
noticed, WEKA provides several ready-to-use algorithms for testing and building your
machine learning applications. To use WEKA effectively, you must have a sound knowledge
of these algorithms, how they work, which one to choose under what circumstances, what to
look for in their processed output, and so on. In short, you must have a solid foundation in
machine learning to use WEKA effectively in building your apps.
Loading Data
Arff Format
An Arff file contains two sections - header and data.
The header describes the attribute types.
The data section contains a comma separated list of data.
As an example for Arff format, the Weather data file loaded from the WEKA sample
databases is shown below:
From the screenshot, you can infer the following points:
The @relation tag defines the name of the database.
The @attribute tag defines the attributes.
The @data tag starts the list of data rows each containing the comma separated
fields.
The attributes can take nominal values as in the case of outlook shown
here: @attribute outlook (sunny, overcast, rainy)
The attributes can take real values as in this
case: @attribute temperature real
You can also set a Target or a Class variable called play as shown
here: @attribute play (yes, no)
The Target assumes two nominal values yes or no.
Understanding Data
Let us first look at the highlighted Current relation sub window. It shows the name of the
database that is currently loaded. You can infer two points from this sub window:
There are 14 instances - the number of rows in the table.
The table contains 5 attributes - the fields, which are discussed in the upcoming
sections.
On the left side, notice the Attributes sub window that displays the various fields in
the database.
The weather database contains five fields - outlook, temperature, humidity, windy and
play. when you select an attribute from this list by clicking on it, further details on the
attribute itself are displayed on the right hand side.
Let us select the temperature attribute first. When you click on it, you would see the
following screen:
To remove Attribute/s select them and click on the Remove button at the bottom.
The selected attributes would be removed from the database. After you fully preprocess
the data, you can save it for model building.
Next, you will learn to preprocess the data by applying filters on this data.
Data Integration
Suppose you have 2 datasets as below and need to merge them together
RESULT:
Thus, the weka software are installed and performed data exploration and integrated successfully.
EX.NO:2
DATE: APPLY WEKA TOOL FOR DATA
VALIDATION AIM:
PROCEDURE:
Data validation is the process of verifying and validating data that is collected
before it is used. Any type of data handling task, whether it is gathering data, analyzing
it, or structuring it for presentation, must include data validation to ensure accurate
results.
1. Data Sampling
Click on choose ( certain datasets in sample datasets does not allow this operation.
I used Brest-cancer dataset for this experiment )
Filters -> supervised -> Instance -> Re-sample
Click on the name of the algorithm to change parameters
Change biasToUniformClass to have a biased sample. If you set it to 1 resulting
dataset will have equal number of instances for each class. Ex:- Brest-cancer
positive 20 negative 20.
Change noReplacement accordingly.
Change sampleSizePrecent accordingly. ( self explanatory )
2. Removing duplicates
3. Data
Reduction PCA
Load iris dataset
Filters -> unsupervised -> attribute -> PrincipleComponents
Original iris dataset have 5 columns. ( 4 data + 1 class ). Lets reduce that to 3 columns ( 2 data +
1 class ).
4. Data transformation
Normalization
Load iris dataset
Filters -> unsupervised -> attribute -> normalize
Normalization is important when you don’t know the distribution of data beforehand.
Scale is the length of number line and translation is the lower bound.
Ex :- scale 2 and translation -1 => -1 to 1, scale 4 and translation -2 => -2 to 2
This filter get applied to all numeric columns. You can’t selectively normalize.
Standardization
Load iris dataset.
Used when dataset known to be in Gaussian (bell curve) distribution.
Filters -> unsupervised -> attribute -> standardize
This filter get applied to all numeric columns. You can’t selectively standardize.
Discretization
Load diabetes dataset.
Discretization comes in handy when using decision trees.
Suppose you need to change weight column to two values like low and high.
Set column number 6 to AttributeIndices.
Set bins to 2 ( Low/ High)
When you set equal frequency to true there will be equal number of high and low
entries in the final column.
RESULT:
Thus, the software was performed and apply weka tool for data validation as successfully validate.
EX.NO:3
DATE: PLAN THE ARCHITECTURE FOR REAL TIME APPLICATION
AIM:
To plan the architecture for real time application.
PROCEDURE:
DESIGN STEPS:
RESULT:
Thus, the plan of architecture for real time application is done successfully and output is verified.
EX.NO:4
DATE: QUERY FOR SCHEMA DEFINITION
AIM:
To Write a query for Star, Snowflake and Galaxy schema definitions.
PROCEDURE:
STAR SCHEMA
SNOWFLAKE SCHEMA
A fact constellation has multiple fact tables. It is also known as galaxy schema.
The sales fact table is same as that in the star schema.
The sales fact table is same as that in the star schema.
The shipping fact table also contains two measures, namely dollars sold and units
sold.
SYNTAX:
Cube Definition:
define cube < cube_name > [ < dimension-list > }: < measure_list >
Dimension Definition:
define dimension < dimension_name > as ( < attribute_or_dimension_list > )
)
SAMPLE PROGRAM:
define dimension time as (time key, day, day of week, month, quarter, year
define dimension item as (item key, item name, brand, type, supplier
type) define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key,
supplier type)) define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state,
country))
define dimension time as (time key, day, day of week, month, quarter, year) define dimension item as (item
key, item name, brand, type, supplier type) define dimension branch as (branch key, branch name, branch
type)
define dimension location as (location key, street, city, province or state,country) define cube shipping [time,
item, shipper, from location, to location]:
dollars cost = sum(cost in dollars), units shipped = count(*) define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper key, shipper name, location as location in cube sales, shipper type)
define dimension from location as location in cube sales define dimension to location as location in cube sales
OUTPUT:
STAR SCHEMA
SNOWFLAKE SCHEMA
FACT CONSTELLATION SCHEMA
RESULT:
Thus, the query for star, Snowflake and Galaxy schema was written Successfully.
EX.NO:5
DATA: DESIGN DATA WARE HOUSE FOR REAL TIME APPLICATIONS
AIM:
To design a data warehouse for real time applications
PROCEDURE:
Dropping Tables
Since decision-making is concerned with the trends related to students’ history, behavior, and
academic
performance, tables “assets” and “item” are not needed; and therefore, they are discarded
and excluded from the
data warehouse.
DROP TABLE
assets ; DROP
TABLE item ;
Merging Tables
Based on the design assumptions, the three tables “department”, “section”, and “course” do
not constitute
separately important parameters for extracting relevant patterns and discovering
knowledge. Therefore, they are
merged altogether with the “transcript_fact_table” table.
SELECT co_name FROM course, section, transcript
WHERE tr_id = n AND str_semester/year = se_semester/year AND tr_se_num = se_num
AND se_code =
co_code ;
ALTER TABLE transcript fact table ADD co_course TEXT ;
DROP TABLE department ;
DROP TABLE section ;
DROP TABLE course ;
Furthermore, table “Activities” is merged with table “RegistrationActivities” and a new table
is produced
called “RegisteredActivities”.
SELECT act_name FROM activities, registrationActivities
WHERE reg_act_id = act_id ;
New Columns
During transformation new columns can be added. In fact, tr_courseDifficulty is added to
table “transcript_fact_table” in order to increase the degree of knowledge and information.
ALTER TABLE transcript_fact_table ADD tr_courseDifficulty TEXT ;
Moreover a Boolean column is added to table “receipt” called
re_paidOnDueDate ALTER TABLE receipt (re_paidOnDueDate) ;
Removing Columns
Unnecessary columns can be removed too during the transformation process. Below is a
list of useless columns
that were discarded during the transformation process from tables “Account”, “Student”,
“Receipt” and
“Activities” respectively:
ALTER TABLE Receipt REMOVE re_dueDate
REMOVE re_dateOfPayment ;
ALTER TABLE Activities REMOVE
ac_supervisor ; ALTER TABLE Student
REMOVE st_phone REMOVE st_email ;
Conceptual Schema – The Snowflake Schema
The proposed data warehouse is a Snowflake type design with one center fact table and
seven dimensions
OUTPUT:
RESULT:
Thus, the program was written and executed successfully for analyze the dimensional modeling.
EX NO: 06 ANALYSES THE DIMENSIONAL MODELING USING MYSQL
DATE:
AIM:
Algorithm:
Review the concept of dimensional modeling, which involves creating dimension tables
and a
fact table to organize and structure data in a way that supports efficient querying and
reporting.
dim_time includes time-related information such as date, year, quarter, month, and day.
Examine the fact table (fact_sales) and understand how it relates to dimension tables
through Foreign key references.
The fact table includes sales-related information such as sales ID, time ID, product ID,
location ID, amount, and quantity.
Foreign key relationships link the fact table to the corresponding records in dimension
tables.
Observe the sample data inserted into dimension tables and the fact table.
Understand how the data in dimension tables is related to the fact table through foreign
keys.
Verify that the sample sales data in the fact table corresponds to existing records in
dimension tables.
5. Evaluate Data Types:
Review the data types chosen for columns (e.g., DECIMAL for amount, INT for identifiers)
to ensure they are appropriate for the data they store.
Ensure that foreign key relationships are correctly established to maintain data integrity.
Confirm that each foreign key in the fact table corresponds to a primary key in its
respective dimension table.
7. Consider Indexing:
Evaluate whether indexes are applied appropriately, especially on columns used for joins
and filtering, to enhance query performance.
8. Verify Constraints:
Check for constraints like primary keys, auto-incrementing values, and other constraints
to maintain data consistency and integrity.
10. Documentation:
Document the dimensional model, including relationships, data types, and constraints,
For future reference.
Program:
date DATE,
year INT,
quarter INT,
):
):
product_name VARCHAR(255),
category VARCHAR(50),
subcategory VARCHAR(50)
country VARCHAR(100),
city VARCHAR(100),
);
state VARCHAR(100)
time_id INT,
product_id INT,
location_id INT,
quantity INT,
dim_location(location_id)
RESULT:
Thus, the program was written and executed successfully for analyze the dimensional
modeling.
EX.NO:7 CASE STUDY USING OLAP
DATE:
AIM:
Case study scenario for using OLAP (Online Analytical Processing) in a data
warehousing environment
Solution:
1. Data Integration:
Extract data from various sources and load it into a centralized data warehouse.
Transform and cleanse the data to ensure consistency and quality.
2. Dimensional Modeling:
Design a star schema or snowflake schema to organize the data into fact tables (e.g.,
sales transactions) and dimension tables (e.g., store, product, time, customer).
3. OLAP Cube Creation:
Build OLAP cubes based on the dimensional model to provide multi-dimensional
views of the data. Dimensions such as time, product, store, and customer can be sliced
and diced for analysis.
4. Analysis:
-Sales Performance Analysis:
Analyze total sales revenue, units sold, and average transaction value by store, region,
product category, and time period.
- Customer Behavior Analysis:
Explore customer segmentation based on demographics, purchase frequency, and
purchase
amount. Identify high-value customers and their purchasing patterns.
-Product Trend Analysis:
Identify top-selling products, analyze product performance over time, and detect
seasonality or trends in sales.
- Cross-Selling Analysis:
Analyze associations between products frequently purchased together to optimize product
placement and marketing strategies
5. Visualization:
Create interactive dashboards and reports using OLAP cube data to present insights to
business users. Visualization tools like Tableau, Power BI, or custom-built dashboards
can be used.
6. Decision Making:
Use insights gained from analysis to make data-driven decisions such as inventory
management, marketing campaigns, and product assortment planning.
Benefits:
1. Improved Decision Making: Provides timely and relevant insights to stakeholders for
making informed decisions.
2. Enhanced Operational Efficiency: Optimizes inventory management, marketing
strategies, and resource allocation based on data-driven insights
3. Competitive Advantage: Enables the company to stay ahead of competitors by
understanding customer preferences and market trends.
4. Scalability: The OLAP solution can scale to handle large volumes of data and
accommodate evolving business needs.
Conclusion:
By leveraging OLAP technology within a data warehousing environment, the retail
company
can gain deeper insights into their sales data, customer behavior, and product
performance, ultimately driving business growth and profitability
EX NO: 8 CASE STUDY USING OLTP
DATE:
Aim:
case study scenario for using OLTP (Online Transaction Processing) in a data warehousing
environment:
Background:
A retail company operates an e-commerce platform where customers can purchase
products online. They need to manage a high volume of transactions efficiently while
ensuring data integrity and real-time processing.
Objective:
Solution:
1. Database Design:
- Design a relational database schema optimized for transactional processing. Tables
include entities such as customers, orders, products, inventory, and payments.
Implement normalization to minimize data redundancy and maintain data integrity.
2. Online Order Management:
- Generate order IDs and track order status (e.g., pending, shipped, delivered).
3. Inventory Management:
4. Payment Processing:
5. Customer Management:
Maintain customer profiles with information such as contact details, shipping addresses,
and order history.
- Enable customers to update their profiles and track order statuses.
- Optimize database performance for handling concurrent transactions and high throughput.
Enables the company to process online orders, manage inventory, and handle payments
in real-time, providing a seamless shopping experience for customers.
2. Data Integrity:
Ensures data consistency and accuracy by enforcing constraints and validations within
the OLTP system.
3. Efficient Order Fulfillment:
5. Insight Generation:
Captures transactional data that can be used for business intelligence and analytics
purposes, such as identifying sales trends, customer preferences, and market opportunities.
Conclusion:
AIM:
Implementing warehouse testing involves several steps to ensure the efficiency and accuracy of
warehouse operations.
PROCEDURE:
Determine the specific objectives and goals of the warehouse testing, This could include
ensuring inventory accuracy, optimizing picking and packing processes, improving order
fulfillment times,etc,
Define a set of test scenarios that cover different aspects of warehouse operations, such as
receiving goods, put-away, picking, packing, shipping, and inventory counts. These
scenarios should be based on real-world scenarios and should cover both normal and edge
cases.
Establish criteria for evaluating the success of each test scenario, This could include
accuracy rates,time taken to complete tasks, error rates, etc.
Gather or generate the necessary test data to simulate real-world warehouse operations,
This could include product data, inventory levels, cust orders, shipping information,etc.
Assign personnel and equipment necessary to conduct the warehouse testing. This may
involve coordinating with warehouse staff, IT personnel, and any external vendors or
consultants as needed.
Conduct the warehouse testing by executing the predefined test scenarios using the
allocated resources. Ensure that each scenario is executed according to the defined
criteria, and record the results of each test.
Analyze the results of the warehouse testing to identify any areas of improvement or areas
where issues were encountered. Determine the root causes of any issues and prioritize
them based on their impact on warehouse operations.
8.Implement Remedial Actions:
Take corrective actions to address any issues or deficiencies identified during the testing
process. This could involve updating procedures, modifying system configurations,
providing additional training to warehouse staff, etc.
9.Document Findings:
Document the findings of the warehouse testing process, including test results, i
corrective actions taken, and any recommendations for future improvements. This
documentation will serve as a reference for future testing cycles and continuous
improvement efforts.
Continuously iterate and refine the warehouse testing process based on feedback and
results from previous testing cycles. Make adjustments as needed to improve the
efficiency and effectiveness of warehouse.
PROGRAM:
RESULT:
Thus, the program was written and executed successfully for Implementation of warehouse testing.