0% found this document useful (0 votes)

13 views

Data Mining - Assignment

Data mining

Uploaded by

kinya277

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Data Mining - Assignment

Data mining

Uploaded by

kinya277

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 15

FACULTY OF COMPUTING AND INFORMATION MANAGEMENT

BACHELOR OF SCIENCE IN INFORMATION SECURITY AND FORENSIC STUDIES

UNIT: BISF-3107

BY: KINYA SHARON KAGENI RegNo – 20/00208

Email: [email protected]
LECTURER: MR MBOGO NJOROGE
MAY-AUG 2022

This ASSIGNMENT is submitted IN PARTIAL FULFILMENT OF THE REQUIREMENTS

OF the award of BACHELORS OF SCIENCE IN INFORMATION SECURITY AND
FORENSIC STUDIES in KCA University
Question 1

a) Explain the terms data warehouse, data mining, and data mart (6 marks)

A data warehouse- is a single, complete and consistent store of data obtained from a variety of
different sources made available to end users in a way they can understand and use in a business
context

Data mining- the process of discovering meaningful new correlations, patterns, and trends by
sifting through large amounts of stored data, using pattern recognition technologies and
statistical and mathematical techniques

Data Mart – A logical subset of the complete data warehouse. Often viewed as a restriction of
the data warehouse to a single business process or to a group of related business processes
targeted toward a particular business group.

b) Explain the entire data warehouse life cycle (7 marks)

Data life cycles, also known as information life cycles, are the periods of time when data is
stored in your system. From the moment you capture your data to the moment it is exported,
your data goes through a life cycle. When data values enter your system’s firewalls, this is the
stage at which they are detected.

Steps in The Development Of Data Warehouses

 The first step is to determine your business objectives.
 The second step is to collect and analyze information.
 The third step is to identify your core business processes
 The fourth step is to construct a conceptual data model.
 The fifth step is to locate data sources and plan data transformations.
 The sixth step is to set a tracking duration.
 The seventh step is to implement the plan.

1) Requirement gathering

 It is done by business analysts, Onsite technical lead and client

 In this phase, a Business Analyst prepares business requirement

specification(BRS)Document

 80% of requirement collection takes place at clients place and it takes 3-4 months for
collecting the requirements

2) Analysis

 After collecting the requirements data modeler starts identifying dimensions, facts &
aggregation depending on the requirements
 An ETL Lead & BA create ETL specification document which contains how each target
table to be populated from source
3) System Requirement Specification (SRS)

 After collection of onsite knowledge transfer, an offshore team will prepare the SRS
 An SRS document includes software, hardware, operating system requirements

4) Data Modeling

 It’s a process of designing the database by fulfilling the use requirements

 A data modeler is responsible for creating Data marts with the following kinds of schema
 Star Schema
 Snowflake Schema

5) ETL Development

 Designing ETL applications to fulfill the specifications documents which are prepared in
the analysis phase

6) ETL Code review

Code review will be done by the developer

The following activities take place

 Check the naming standards

 Check the business logic
 Check the mapping of source to target

7) Peer Review

A code will be reviewed by a team member

 Validation of code but not data

8) ETL Testing

Following tests will be carried out for each ETL Application

 Unit testing
 Business Functionality testing
 Performance testing
 User acceptance testing

9) Report development environment

 Design the reports to fulfill report requirement templates/Report data workbook(RDW)

10) Deployment

 A process of migrating the ETL Code & Reports to a pre-production environment for
stabilization
 It is also known as pilot phase/stabilization phase

11) Production Environment/Go live

 An active/working environment

c) With aid of appropriate diagrams describe the three types of OLAP architecture (7
marks)

The three types are:

i. MOLAP stands for Multidimensional OLAP, an application based on

multidimensional DBMSs.
ii. ROLAP stands for Relational OLAP, an application based on relational DBMSs.
iii. HOLAP stands for Hybrid OLAP, an application using both relational and
multidimensional techniques.
i. Multi-dimensional OLAP (MOLAP)

Use specialized data structures and multi-dimensional Database Management Systems

(MDDBMSs) to organize, navigate, and analyze data.

One of the significant distinctions of MOLAP against a ROLAP is that data are summarized and
are stored in an optimized format in a multidimensional cube, instead of in a relational database.
In MOLAP model, data are structured into proprietary formats by client's reporting requirements
with the calculations pre-generated on the cubes.

MOLAP Architecture

MOLAP Architecture includes the following components

 Database server.
 MOLAP server.
 Front-end tool.

The MOLAP engine in the application layer collects data from the databases in the data layer. It
then loads data cubes into the multi-dimensional databases. When the user makes a query, data
will move in a propriety format from the MDDBs to the client desktop in the presentation layer.
This enables users to view data in multiple dimensions.

ii. Relational OLAP (ROLAP) Server

In this type of analytical processing, data storage is done in a relational database. In this
database, the arrangement of data is made in rows and columns. Data is presented to end-users in
a multi-dimensional form.

ROLAP servers contain optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional tools and services.

ROLAP technology tends to have higher scalability than MOLAP technology.

ROLAP systems work primarily from the data that resides in a relational database, where the
base data and dimension tables are stored as relational tables. This model permits the
multidimensional analysis of data.

This technique relies on manipulating the data stored in the relational database to give the
presence of traditional OLAP's slicing and dicing functionality. In essence, each method of
slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Relational OLAP Architecture

ROLAP Architecture includes the following components

 Database server.
 ROLAP server.
 Front-end tool.
When a user makes a query (complex), the ROLAP server will fetch data from the RDBMS
server. The ROLAP engine will then create data cubes dynamically. The user will view data
from a multi-dimensional point.

Unlike in MOLAP, where the multi-dimensional view is static, ROLAP provides a dynamic
multi-dimensional view. This explains why it is slower when compared to MOLAP.

iii. Hybrid OLAP (HOLAP) Server

This type of analytical processing solves the limitations of MOLAP and ROLAP and combines
their attributes. Data in the database is divided into two parts: specialized storage and relational
storage. Integrating these two aspects addresses issues relating to performance and scalability.
HOLAP stores huge volumes of data in a relational database and keeps aggregations in a
MOLAP server.

The HOLAP model consists of a server that can support ROLAP and MOLAP. It consists of a
complex architecture that requires frequent maintenance. Queries made in the HOLAP model
involve the multi-dimensional database and the relational database. The front-user tool presents
data from the database management system (directly) or through the intermediate MOLAP.

Question 2

a) State five data mining techniques (5 marks)

• Statistical Data Analysis

• Cluster Analysis

• Decision Trees and Decision Rules

• Association Rules

• Artificial Neural Networks

• Genetic Algorithms

• Fuzzy Sets and Fuzzy Logic

b) Highlight the four analytical operations that can be performed on data cubes (4mks)

Roll-up performs aggregations on the data by moving up the dimensional hierarchy or by

dimensional reduction e.g. 4-D sales data to 3-D sales data.

Drill-down is the reverse of roll-up and involves revealing the detailed data that forms the
aggregated data. Drill-down can be performed by moving down the dimensional hierarchy or by
dimensional introduction e.g. 3-D sales data to 4-D sales data.

Slice and dice - ability to look at data from different viewpoints. The slice operation performs a
selection on one dimension of the data whereas dice uses two or more dimensions

Pivot - ability to rotate the data to provide an alternative view of the same data e.g. sales
revenue data displayed using the location (city) as x-axis against time (quarter) as the y-axis can
be rotated so that time (quarter) is the x-axis against location (city) is the y-axis.

c) With aid of a diagram describe the core components of the Data Warehouse
Architecture

Typically, a data warehouse consists of four components:

i. Data sources
ii. Data Staging and Processing ETL (Extract, Transform, and Load)
iii. Data Warehouse
iv. Data Marts

Question 3
a) Explain the concept of knowledge discovery in databases and state the major steps in a
knowledge discovery process (6 marks)

Data Mining – Knowledge Discovery in Databases(KDD).

Data Mining also known as Knowledge Discovery in Databases, refers to the nontrivial
extraction of implicit, previously unknown and potentially useful information from data stored in
databases.
KDD is an iterative process where evaluation measures can be enhanced, mining can be refined,
new data can be integrated and transformed in order to get different and more appropriate results.

Preprocessing of databases consists of Data cleaning and Data Integration.

Reasons as to why we need Data Mining

Volume of information is increasing everyday that we can handle from business transactions,
scientific data, sensor data, Pictures, videos, etc. So, we need a system that will be capable of
extracting essence of information available and that can automatically generate report, views or
summary of data for better decision-making.

Why Data Mining is used in Business

Data mining is used in business to make better managerial decisions by:

 Automatic summarization of data

 Extracting essence of information stored.
 Discovering patterns in raw data.

Knowledge Discovery From Data Consists of the Following Steps:

Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection.

 Cleaning in case of Missing values.

 Cleaning noisy data, where noise is a random or variance error.
 Cleaning with Data discrepancy detection and Data transformation tools.

Data Integration: Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse).

 Data integration using Data Migration tools.

 Data integration using Data Synchronization tools.
 Data integration using ETL(Extract-Load-Transformation) process.

Data Selection: Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection.

 Data selection using Neural network.

 Data selection using Decision Trees.
 Data selection using Naive bayes.
 Data selection using Clustering, Regression
Data Transformation: Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a two-step process:

 Data Mapping: Assigning elements from source base to destination to capture

transformations.
 Code generation: Creation of the actual transformation program.

Data Mining: Data mining is defined as clever techniques that are applied to extract patterns
potentially useful.

 Transforms task relevant data into patterns.

 Decides purpose of model using classification or characterization.

Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing patterns

representing knowledge based on given measures.

 Find interestingness score of each pattern.

 Uses summarization and Visualization to make data understandable by user.

Knowledge representation: Knowledge representation is defined as technique which utilizes

visualization tools to represent data mining results.

 Generate reports.
 Generate tables.
 Generate discriminant rules, classification rules, characterization rules

b) Explain four Schemes of Data Mining Classification (7 marks)

Classification according to the kinds of databases mined: A data mining system can be
classified according to the kinds of databases mined. Database systems can be classified
according to different criteria (such as data models, or the types of data or applications involved),
each of which may require its own data mining technique.

Classification according to the kinds of knowledge mined: Data mining systems can be
categorized according to the kinds of knowledge they mine, that is, based on data mining
functionalities, such as characterization, discrimination, association and correlation analysis,
classification, prediction, clustering, outlier analysis, and evolution analysis. A comprehensive
data mining system usually provides multiple and integrated data mining functionalities.

Classification according to the kinds of techniques utilized: Data mining systems can be
categorized according to the underlying data mining techniques employed. These techniques can
be described according to the degree of user interaction involved (for example autonomous
systems, interactive exploratory systems, query-driven systems) or the methods of data analysis
employed (for example database-oriented or data warehouse– oriented techniques, machine
learning, statistics, visualization, pattern recognition, neural networks etc).

Classification according to the applications adapted: Data mining systems can also be
categorized according to the applications they adapt. For example, data mining systems may be
tailored specifically for finance, telecommunications, DNA, stock markets, e-mail, and so on.
Different applications often require the integration of application-specific methods. Therefore, a
generic, all-purpose data mining system may not fit domain-specific mining tasks.

Cloud Data Warehousing For Dummies 3rd Edition
No ratings yet
Cloud Data Warehousing For Dummies 3rd Edition
52 pages
Fundamentals of Computer Programming With C# (By Svetlin Nakov & Co.)
100% (14)
Fundamentals of Computer Programming With C# (By Svetlin Nakov & Co.)
1,132 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Unit_5_DWDM
No ratings yet
Unit_5_DWDM
19 pages
Development lifecycle
No ratings yet
Development lifecycle
3 pages
DM Module 1
No ratings yet
DM Module 1
16 pages
Lecture 2
No ratings yet
Lecture 2
11 pages
Unit_II_OLAP
No ratings yet
Unit_II_OLAP
18 pages
4th Year Dw& Dm Kai075 Unit 1
No ratings yet
4th Year Dw& Dm Kai075 Unit 1
25 pages
Unit 2 Ques
No ratings yet
Unit 2 Ques
80 pages
Question bank FORMAT
No ratings yet
Question bank FORMAT
8 pages
Bi 1nov2017 One
No ratings yet
Bi 1nov2017 One
10 pages
Unit Ii DWDM
No ratings yet
Unit Ii DWDM
10 pages
Adbs Unit IV
No ratings yet
Adbs Unit IV
34 pages
Lecture 13
No ratings yet
Lecture 13
17 pages
Unit II - DW&DM
No ratings yet
Unit II - DW&DM
19 pages
Unit-2: Multi-Dimensional Data Model?
No ratings yet
Unit-2: Multi-Dimensional Data Model?
21 pages
221
No ratings yet
221
2 pages
Dimensional modelling _
No ratings yet
Dimensional modelling _
5 pages
DMW M1
No ratings yet
DMW M1
12 pages
DWDM Lecture Materials 231015 173712
No ratings yet
DWDM Lecture Materials 231015 173712
62 pages
HAJJATII
No ratings yet
HAJJATII
11 pages
Data Mining and Warehousing
No ratings yet
Data Mining and Warehousing
10 pages
DM UNIT V (1)
No ratings yet
DM UNIT V (1)
50 pages
Data Warehouse Components
No ratings yet
Data Warehouse Components
8 pages
R18CSE4102-UNIT 1 Data Mining Notes
No ratings yet
R18CSE4102-UNIT 1 Data Mining Notes
26 pages
Unit -II Data Warehouseing&OLAP
No ratings yet
Unit -II Data Warehouseing&OLAP
17 pages
Difference Between Data Warehousing and Data Mining: Data Warehouse Architecture Three-Tier Data Warehouse Architecture
No ratings yet
Difference Between Data Warehousing and Data Mining: Data Warehouse Architecture Three-Tier Data Warehouse Architecture
10 pages
Ccs341 Dw Notes All 5 Units
No ratings yet
Ccs341 Dw Notes All 5 Units
159 pages
Datawarehouse / Etl Testing: Reason For Build Data Warehouse: 1) Data Is Scattered at Different Places
No ratings yet
Datawarehouse / Etl Testing: Reason For Build Data Warehouse: 1) Data Is Scattered at Different Places
19 pages
DWDM Lecture Notes U-1
No ratings yet
DWDM Lecture Notes U-1
11 pages
Ccs341-Dw-Int I Key-Set Ii - Ar
No ratings yet
Ccs341-Dw-Int I Key-Set Ii - Ar
14 pages
Introduction To Data Warehouse
No ratings yet
Introduction To Data Warehouse
15 pages
DBMS 1
No ratings yet
DBMS 1
19 pages
Module-1: Data Warehousing & Modelling
No ratings yet
Module-1: Data Warehousing & Modelling
13 pages
CCS341-DATA WAREHOUSING - 1805692571-Ccs341-Question-Bank
No ratings yet
CCS341-DATA WAREHOUSING - 1805692571-Ccs341-Question-Bank
10 pages
Data Warehouses: FPT University
No ratings yet
Data Warehouses: FPT University
46 pages
Seah 2014
No ratings yet
Seah 2014
7 pages
DM Mod1 PDF
No ratings yet
DM Mod1 PDF
16 pages
Cs701 Data Warehouse and Data Mining
No ratings yet
Cs701 Data Warehouse and Data Mining
23 pages
Data Mining and Data Warehouse BY
100% (1)
Data Mining and Data Warehouse BY
12 pages
Data Warehousing and On-Line Analytical Processing
No ratings yet
Data Warehousing and On-Line Analytical Processing
40 pages
CST466-M1 - Ktunotes - in
No ratings yet
CST466-M1 - Ktunotes - in
24 pages
Assignment 1st DMDW
No ratings yet
Assignment 1st DMDW
12 pages
Stayabhama Datawarehouse
No ratings yet
Stayabhama Datawarehouse
164 pages
DMBI Winter 23
No ratings yet
DMBI Winter 23
45 pages
Data Mining Complete
No ratings yet
Data Mining Complete
95 pages
dwm_chap3_new
No ratings yet
dwm_chap3_new
24 pages
Data Warehousing: Chapter # 3 Carlo Vercellis
No ratings yet
Data Warehousing: Chapter # 3 Carlo Vercellis
17 pages
Notes Format
No ratings yet
Notes Format
132 pages
Assignment 2
No ratings yet
Assignment 2
6 pages
Super Important Questions For BDA
100% (1)
Super Important Questions For BDA
26 pages
Assignment 3 DM
No ratings yet
Assignment 3 DM
5 pages
SE-705 Simulation and Modeling
No ratings yet
SE-705 Simulation and Modeling
40 pages
A UML-based Data Warehouse Design Method PDF
No ratings yet
A UML-based Data Warehouse Design Method PDF
25 pages
Data Mining Cat
No ratings yet
Data Mining Cat
6 pages
Chap 2
No ratings yet
Chap 2
21 pages
Olap and Oltap
No ratings yet
Olap and Oltap
14 pages
bi-unit-4
No ratings yet
bi-unit-4
40 pages
Data Mining ---------1.
No ratings yet
Data Mining ---------1.
34 pages
Ccs341-Dw-Int I Key-Set I-Ar
No ratings yet
Ccs341-Dw-Int I Key-Set I-Ar
18 pages
BI module 3
No ratings yet
BI module 3
10 pages
11 03 10 Change - Control
No ratings yet
11 03 10 Change - Control
2 pages
7 Serialization and Collection Classesssdfsdf
No ratings yet
7 Serialization and Collection Classesssdfsdf
18 pages
DR AF T DR AF T: Business Analyst As Scrum Product Owner
No ratings yet
DR AF T DR AF T: Business Analyst As Scrum Product Owner
42 pages
Architecting Splunk For High Availability and Disaster Recovery
No ratings yet
Architecting Splunk For High Availability and Disaster Recovery
47 pages
Lesson 5
No ratings yet
Lesson 5
28 pages
Oracle Apps Complex Maintenance, Repair and Overhaul Fundamentals
No ratings yet
Oracle Apps Complex Maintenance, Repair and Overhaul Fundamentals
4 pages
Plagiarism Checker
No ratings yet
Plagiarism Checker
6 pages
Defect Tracking Interview Questions
No ratings yet
Defect Tracking Interview Questions
4 pages
Thales High Speed Encryption Solutions SB v11
No ratings yet
Thales High Speed Encryption Solutions SB v11
2 pages
Business Analytics Tools
No ratings yet
Business Analytics Tools
22 pages
Bharathi Kannan Profile
No ratings yet
Bharathi Kannan Profile
2 pages
TM 354
No ratings yet
TM 354
25 pages
Ch01 - Procedural Programming
No ratings yet
Ch01 - Procedural Programming
48 pages
Data Modeling Using The Entity-Relationship (ER) Model
100% (1)
Data Modeling Using The Entity-Relationship (ER) Model
55 pages
AWS EC2 Notes
No ratings yet
AWS EC2 Notes
7 pages
SAP Azure Migration
No ratings yet
SAP Azure Migration
15 pages
SPAM - SUM - SQL Error 208 - Invalid Object Name 'Sap - Get - Para'
No ratings yet
SPAM - SUM - SQL Error 208 - Invalid Object Name 'Sap - Get - Para'
3 pages
Java8 Slides
No ratings yet
Java8 Slides
104 pages
School of Information Technology & Engineering: Design Patterns List
No ratings yet
School of Information Technology & Engineering: Design Patterns List
5 pages
Best Practices For Team-Based Development
No ratings yet
Best Practices For Team-Based Development
4 pages
IC Preliminary Project Execution Plan 11095
No ratings yet
IC Preliminary Project Execution Plan 11095
6 pages
Yubraj Khatiwada MCA 503 Advanced Database Management System
No ratings yet
Yubraj Khatiwada MCA 503 Advanced Database Management System
11 pages
Mini Project Mca
No ratings yet
Mini Project Mca
49 pages
Oe-Ec604c
No ratings yet
Oe-Ec604c
10 pages
A Python Programming Tutorial
No ratings yet
A Python Programming Tutorial
68 pages
Minor Project Progress Report: "Gym Website"
No ratings yet
Minor Project Progress Report: "Gym Website"
19 pages
3 Development of Enterprise Systems
No ratings yet
3 Development of Enterprise Systems
38 pages
Netcore Cloud
No ratings yet
Netcore Cloud
14 pages