0% found this document useful (0 votes)
13 views

Data Mining - Assignment

Data mining

Uploaded by

kinya277
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Data Mining - Assignment

Data mining

Uploaded by

kinya277
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

FACULTY OF COMPUTING AND INFORMATION MANAGEMENT

BACHELOR OF SCIENCE IN INFORMATION SECURITY AND FORENSIC STUDIES


UNIT: BISF-3107

BY: KINYA SHARON KAGENI RegNo – 20/00208


Email: [email protected]
LECTURER: MR MBOGO NJOROGE
MAY-AUG 2022

This ASSIGNMENT is submitted IN PARTIAL FULFILMENT OF THE REQUIREMENTS


OF the award of BACHELORS OF SCIENCE IN INFORMATION SECURITY AND
FORENSIC STUDIES in KCA University
Question 1

a) Explain the terms data warehouse, data mining, and data mart (6 marks)

A data warehouse- is a single, complete and consistent store of data obtained from a variety of
different sources made available to end users in a way they can understand and use in a business
context

Data mining- the process of discovering meaningful new correlations, patterns, and trends by
sifting through large amounts of stored data, using pattern recognition technologies and
statistical and mathematical techniques

Data Mart – A logical subset of the complete data warehouse. Often viewed as a restriction of
the data warehouse to a single business process or to a group of related business processes
targeted toward a particular business group.

b) Explain the entire data warehouse life cycle (7 marks)

Data life cycles, also known as information life cycles, are the periods of time when data is
stored in your system. From the moment you capture your data to the moment it is exported,
your data goes through a life cycle. When data values enter your system’s firewalls, this is the
stage at which they are detected.

Steps in The Development Of Data Warehouses


 The first step is to determine your business objectives.
 The second step is to collect and analyze information.
 The third step is to identify your core business processes
 The fourth step is to construct a conceptual data model.
 The fifth step is to locate data sources and plan data transformations.
 The sixth step is to set a tracking duration.
 The seventh step is to implement the plan.

1) Requirement gathering

 It is done by business analysts, Onsite technical lead and client

 In this phase, a Business Analyst prepares business requirement


specification(BRS)Document

 80% of requirement collection takes place at clients place and it takes 3-4 months for
collecting the requirements

2) Analysis

 After collecting the requirements data modeler starts identifying dimensions, facts &
aggregation depending on the requirements
 An ETL Lead & BA create ETL specification document which contains how each target
table to be populated from source
3) System Requirement Specification (SRS)

 After collection of onsite knowledge transfer, an offshore team will prepare the SRS
 An SRS document includes software, hardware, operating system requirements

4) Data Modeling

 It’s a process of designing the database by fulfilling the use requirements


 A data modeler is responsible for creating Data marts with the following kinds of schema
 Star Schema
 Snowflake Schema

5) ETL Development

 Designing ETL applications to fulfill the specifications documents which are prepared in
the analysis phase

6) ETL Code review

Code review will be done by the developer

The following activities take place

 Check the naming standards


 Check the business logic
 Check the mapping of source to target

7) Peer Review

A code will be reviewed by a team member

 Validation of code but not data

8) ETL Testing

Following tests will be carried out for each ETL Application


 Unit testing
 Business Functionality testing
 Performance testing
 User acceptance testing

9) Report development environment

 Design the reports to fulfill report requirement templates/Report data workbook(RDW)

10) Deployment

 A process of migrating the ETL Code & Reports to a pre-production environment for
stabilization
 It is also known as pilot phase/stabilization phase

11) Production Environment/Go live

 An active/working environment

c) With aid of appropriate diagrams describe the three types of OLAP architecture (7
marks)

The three types are:

i. MOLAP stands for Multidimensional OLAP, an application based on


multidimensional DBMSs.
ii. ROLAP stands for Relational OLAP, an application based on relational DBMSs.
iii. HOLAP stands for Hybrid OLAP, an application using both relational and
multidimensional techniques.
i. Multi-dimensional OLAP (MOLAP)

Use specialized data structures and multi-dimensional Database Management Systems


(MDDBMSs) to organize, navigate, and analyze data.

One of the significant distinctions of MOLAP against a ROLAP is that data are summarized and
are stored in an optimized format in a multidimensional cube, instead of in a relational database.
In MOLAP model, data are structured into proprietary formats by client's reporting requirements
with the calculations pre-generated on the cubes.

MOLAP Architecture

MOLAP Architecture includes the following components

 Database server.
 MOLAP server.
 Front-end tool.

The MOLAP engine in the application layer collects data from the databases in the data layer. It
then loads data cubes into the multi-dimensional databases. When the user makes a query, data
will move in a propriety format from the MDDBs to the client desktop in the presentation layer.
This enables users to view data in multiple dimensions.

ii. Relational OLAP (ROLAP) Server

In this type of analytical processing, data storage is done in a relational database. In this
database, the arrangement of data is made in rows and columns. Data is presented to end-users in
a multi-dimensional form.

ROLAP servers contain optimization for each DBMS back end, implementation of aggregation
navigation logic, and additional tools and services.

ROLAP technology tends to have higher scalability than MOLAP technology.

ROLAP systems work primarily from the data that resides in a relational database, where the
base data and dimension tables are stored as relational tables. This model permits the
multidimensional analysis of data.

This technique relies on manipulating the data stored in the relational database to give the
presence of traditional OLAP's slicing and dicing functionality. In essence, each method of
slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Relational OLAP Architecture

ROLAP Architecture includes the following components

 Database server.
 ROLAP server.
 Front-end tool.
When a user makes a query (complex), the ROLAP server will fetch data from the RDBMS
server. The ROLAP engine will then create data cubes dynamically. The user will view data
from a multi-dimensional point.

Unlike in MOLAP, where the multi-dimensional view is static, ROLAP provides a dynamic
multi-dimensional view. This explains why it is slower when compared to MOLAP.

iii. Hybrid OLAP (HOLAP) Server

This type of analytical processing solves the limitations of MOLAP and ROLAP and combines
their attributes. Data in the database is divided into two parts: specialized storage and relational
storage. Integrating these two aspects addresses issues relating to performance and scalability.
HOLAP stores huge volumes of data in a relational database and keeps aggregations in a
MOLAP server.

The HOLAP model consists of a server that can support ROLAP and MOLAP. It consists of a
complex architecture that requires frequent maintenance. Queries made in the HOLAP model
involve the multi-dimensional database and the relational database. The front-user tool presents
data from the database management system (directly) or through the intermediate MOLAP.

Question 2

a) State five data mining techniques (5 marks)

• Statistical Data Analysis

• Cluster Analysis

• Decision Trees and Decision Rules


• Association Rules

• Artificial Neural Networks

• Genetic Algorithms

• Fuzzy Sets and Fuzzy Logic

b) Highlight the four analytical operations that can be performed on data cubes (4mks)

Roll-up performs aggregations on the data by moving up the dimensional hierarchy or by


dimensional reduction e.g. 4-D sales data to 3-D sales data.

Drill-down is the reverse of roll-up and involves revealing the detailed data that forms the
aggregated data. Drill-down can be performed by moving down the dimensional hierarchy or by
dimensional introduction e.g. 3-D sales data to 4-D sales data.

Slice and dice - ability to look at data from different viewpoints. The slice operation performs a
selection on one dimension of the data whereas dice uses two or more dimensions

Pivot - ability to rotate the data to provide an alternative view of the same data e.g. sales
revenue data displayed using the location (city) as x-axis against time (quarter) as the y-axis can
be rotated so that time (quarter) is the x-axis against location (city) is the y-axis.

c) With aid of a diagram describe the core components of the Data Warehouse
Architecture

Typically, a data warehouse consists of four components:


i. Data sources
ii. Data Staging and Processing ETL (Extract, Transform, and Load)
iii. Data Warehouse
iv. Data Marts

Question 3
a) Explain the concept of knowledge discovery in databases and state the major steps in a
knowledge discovery process (6 marks)

Data Mining – Knowledge Discovery in Databases(KDD).

Data Mining also known as Knowledge Discovery in Databases, refers to the nontrivial
extraction of implicit, previously unknown and potentially useful information from data stored in
databases.
KDD is an iterative process where evaluation measures can be enhanced, mining can be refined,
new data can be integrated and transformed in order to get different and more appropriate results.

Preprocessing of databases consists of Data cleaning and Data Integration.

Reasons as to why we need Data Mining

Volume of information is increasing everyday that we can handle from business transactions,
scientific data, sensor data, Pictures, videos, etc. So, we need a system that will be capable of
extracting essence of information available and that can automatically generate report, views or
summary of data for better decision-making.

Why Data Mining is used in Business

Data mining is used in business to make better managerial decisions by:

 Automatic summarization of data


 Extracting essence of information stored.
 Discovering patterns in raw data.

Knowledge Discovery From Data Consists of the Following Steps:


Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection.

 Cleaning in case of Missing values.


 Cleaning noisy data, where noise is a random or variance error.
 Cleaning with Data discrepancy detection and Data transformation tools.

Data Integration: Data integration is defined as heterogeneous data from multiple sources
combined in a common source(DataWarehouse).

 Data integration using Data Migration tools.


 Data integration using Data Synchronization tools.
 Data integration using ETL(Extract-Load-Transformation) process.

Data Selection: Data selection is defined as the process where data relevant to the analysis is
decided and retrieved from the data collection.

 Data selection using Neural network.


 Data selection using Decision Trees.
 Data selection using Naive bayes.
 Data selection using Clustering, Regression
Data Transformation: Data Transformation is defined as the process of transforming data into
appropriate form required by mining procedure. Data Transformation is a two-step process:

 Data Mapping: Assigning elements from source base to destination to capture


transformations.
 Code generation: Creation of the actual transformation program.

Data Mining: Data mining is defined as clever techniques that are applied to extract patterns
potentially useful.

 Transforms task relevant data into patterns.


 Decides purpose of model using classification or characterization.

Pattern Evaluation: Pattern Evaluation is defined as identifying strictly increasing patterns


representing knowledge based on given measures.

 Find interestingness score of each pattern.


 Uses summarization and Visualization to make data understandable by user.

Knowledge representation: Knowledge representation is defined as technique which utilizes


visualization tools to represent data mining results.

 Generate reports.
 Generate tables.
 Generate discriminant rules, classification rules, characterization rules

b) Explain four Schemes of Data Mining Classification (7 marks)

Classification according to the kinds of databases mined: A data mining system can be
classified according to the kinds of databases mined. Database systems can be classified
according to different criteria (such as data models, or the types of data or applications involved),
each of which may require its own data mining technique.

Classification according to the kinds of knowledge mined: Data mining systems can be
categorized according to the kinds of knowledge they mine, that is, based on data mining
functionalities, such as characterization, discrimination, association and correlation analysis,
classification, prediction, clustering, outlier analysis, and evolution analysis. A comprehensive
data mining system usually provides multiple and integrated data mining functionalities.

Classification according to the kinds of techniques utilized: Data mining systems can be
categorized according to the underlying data mining techniques employed. These techniques can
be described according to the degree of user interaction involved (for example autonomous
systems, interactive exploratory systems, query-driven systems) or the methods of data analysis
employed (for example database-oriented or data warehouse– oriented techniques, machine
learning, statistics, visualization, pattern recognition, neural networks etc).

Classification according to the applications adapted: Data mining systems can also be
categorized according to the applications they adapt. For example, data mining systems may be
tailored specifically for finance, telecommunications, DNA, stock markets, e-mail, and so on.
Different applications often require the integration of application-specific methods. Therefore, a
generic, all-purpose data mining system may not fit domain-specific mining tasks.

You might also like