2019-FEB-BIT-B224739-WKD
NAME: KISALA MICHEAL
INDEX NO. 2019-FEB-BIT-B224739-WKD
COURSE UNIT: DATA MINING AND WAREHOUSING
LECTURER: DR GOLOBA
STUDENTS NO. 1800103221
QTN I
1) In the context of Uganda’s ministry of economic planning, discuss what is meant by the
following terms when describing the characteristics of data in data warehouse.
A warehouse is a subject-oriented, integrated, time-variant and non- volatile
collection of data in support of management's decision making process".
Subject Oriented: Data that gives information about a particular subject instead of
about a company's ongoing operations.
Integrated: Data that is gathered into the data warehouse from a variety of sources and
merged into a coherent whole.
Time-variant: All data in the data warehouse is identified with a particular time period.
Non-volatile: Data is stable in a data warehouse. More data is added but data is never removed
11) if you were to develop a data warehouse for the Uganda’s ministry of economic planning,
explain to management how online transaction processing (OLTP) systems would differ from data
online analytical processing (OLAP)
The major distinguishing features between OLTP and OLAP are summarized as
follows.
1. Users and system orientation: An OLTP system is customer-oriented and is used for
transaction and query processing by clerks, clients, and information technology professionals.
An OLAP system is market-oriented and is used for data analysis by knowledge workers,
including managers, executives, and analysts.
2. Data contents: An OLTP system manages current data that, typically, are too detailed to be
easily used for decision making. An OLAP system manages large amounts of historical data,
provides facilities for summarization and aggregation, and stores and manages information at
different levels of granularity. These features make the data easier for use in informed decision
making.
3. Database design: An OLTP system usually adopts an entity-relationship (ER) data model
and an application oriented database design. An OLAP system typically adopts either a
star or snowflake model and a subject-oriented database design.
4. View: An OLTP system focuses mainly on the current data within an enterprise or
department, without referring to historical data or data in different organizations. In
contrast, an OLAP
system often spans multiple versions of a database schema. OLAP systems also deal
with information that originates from different organizations, integrating information from
many data stores. Because of their huge volume, OLAP data are stored on multiple storage
media.
5. Access patterns: The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery mechanisms.
However, accesses to OLAP systems are mostly read-only operations although many could be
complex queries.
111) Discuss the three main tasks which will be associated with the administration and
management of Uganda’s ministry of economic planning data warehouse’
QUESTION TWO
1) Data warehouse architecture consists of many components. Explain the role of each
component in case you were designing a data warehouse for ministry of health.
A database, data warehouse, or other information repository, which consists of
the set of d a t a b a s e s , data warehouses, spreadsheets, or o t h e r kinds of
information repositories containing the student and course information.
2. A database or data warehouse server which fetches the relevant data based on
users’ data mining requests.
3. A knowledge base that contains the domain knowledge used to guide the search
or to evaluate the interestingness of resulting patterns. For example,
the knowledge base may contain metadata which describes data from
multiple heterogeneous sources.
4. A data mining engine, which consists of a set of functional modules for tasks
such as classification, association, classification, cluster analysis, and
evolution and deviation analysis.
5. A pattern evaluation module that works in tandem with the data
mining modules by employing interestingness measures to help focus the
search towards interestingness patterns.
6. A graphical user interface that allows the user an interactive approach
11) Describe the processes which will be associated with data extraction, cleansing, and
transformation when designing a data warehouse for ministry of health.
EXTRACT
Some of the data elements in the operational database can be reasonably be expected to be
useful in the decision making, but others are of less value for that purpose. For this
reason, it is necessary to extract the relevant data from the operational database before
bringing into the data warehouse. Many commercial tools are available to help with the
extraction process. Data Junction is one of the commercial products. The user of one of these
tools typically has an easy- to-use windowed interface by which to specify the following:
(i) Which files and tables are to be accessed in the source database?
(ii) Which fields are to be extracted from them? This is often done internally by
SQL Select statement.
(iii) What are those to be called in the resulting database?
(iv) What is the target machine and database format of the output?
(v) On what schedule should the extraction process be repeated?
T
R
A
N
S
F
O
R
M
The operational databases developed can be based on any set of priorities, which keeps
changing with the requirements. Therefore, those who develop data warehouse based on these
databases are typically faced with inconsistency among their data sources. Transformation
process deals with rectifying any inconsistency (if any).
One of the most common transformation issues is. Attribute Naming Inconsistency‘. It is
common for the given data element to be referred to by different data names in different
databases. Employee Name may be EMP_NAME in one database, ENAME in the other.
Thus one set of Data Names are picked and used consistently in the data warehouse. Once all
the data elements have right names, they must be converted to common formats. The
conversion may encompass the following:
(i) Characters must be converted ASCII to EBCDIC or vice
versa. (ii) Mixed Text may be converted to all uppercase for
consistency. (iii) Numerical data must be converted in to a
common format.
(iv) Data Format has to be
standardized.
(v) Measurement may have to convert. (Rs/
$)
(vi) Coded data (Male/ Female, M/F) must be converted into a common
format.
All these transformation activities are automated and many commercial products are available
to
perform the tasks. Data MAPPER from Applied Database Technologies is one such
comprehensive tool.
CLEANSING
Information quality is the key consideration in determining the value of the information. The
developer of the data warehouse is not usually in a position to change the quality of its
underlying historic data, though a data warehousing project can put spotlight on the data quality
issues and lead to improvements for the future. It is, therefore, usually necessary to go
through the data entered into the data warehouse and make it as error free as possible. This
process is known as Data Cleansing.
Data Cleansing must deal with many types of possible errors. These include missing data and
incorrect data at one source; inconsistent data and conflicting data when two or more source are
involved. There are several algorithms followed to clean the data, which will be discussed in
the coming lecture notes.
111) Describe the real time and near-real-time data warehouse in the context of ministry of health
Real-time data is data that’s collected, processed, and analyzed on a continual basis. It’s
information that’s available for use immediately after being generated. Near real-time data is a
snapshot of historical data, so teams are left viewing a situation as it existed in the recent past
rather than as it is now. Batched data is even slower and may be days old by the time it’s ready for
use.
There’s no industry-standard definition of how much time needs to elapse before real-time data
transitions to near real-time data. But as a general rule, real-time data is measured in seconds,
whereas near real-time data may be days old by the time the BI team works through their queue to
provide a report. And with spreadsheet extracts, the data may be even older by the time the
decision-maker receives it. Powered by a modern, cloud data platform with its centralized data
stores and ability to provide nearly unlimited computing power, there’s no need to settle for
batched or even near-time data when you need to conduct analyses
1v) Discuss how Nkumba University data marts would differ from data warehouses and identify
the main reasons for implementing data marts for Nkumba University.
Size: a data mart is typically less than 100 GB; a data warehouse is typically larger than 100 GB
and often a terabyte or more.
>Range: a data mart is limited to a single focus for one line of business; a data warehouse is
typically enterprise-wide and ranges across multiple areas.
Sources: a data mart includes data from just a few sources; a data warehouse stores data from
multiple sources
QUESTION FIVE
(1) Data Mining Examples: Most Common Applications of Data Mining 2021
Mobile Service Providers.
Retail Sector.
Artificial Intelligence.
Ecommerce.
Science and Engineering.
Crime Prevention.
Research.
Farming.
(11) Data mining is the process of extracting valid, previously unknown, comprehensible, and
actionable information from large databases and using it to make crucial business decisions.
There are four main operations associated with data mining techniques which include:
• Predictive modeling
• Database segmentation
• Link analysis
• Deviation detection.
Techniques are specific implementations of the· data mining operations. However, each operation
has its own strengths and weaknesses. With this in mind, data mining tools sometimes offer a
choice of operations to implement a technique.
Predictive Modeling
It is designed on a similar pattern of the human learning experience in using observations to form a
model of the important characteristics of some task. It corresponds to the ‘real world’. It ‘is
developed using a supervised learning approach, which has to phases: training and testing.
Training phase is based on a large sample of historical data called a training set, while testing
involves trying out the model on new, previously unseen data to determine its accuracy and
physical performance characteristics.
It is commonly used in customer retention management, credit approval, cross-selling, and direct
marketing. There are two techniques associated with predictive modeling. These are:
• Classification
• Value prediction
Classification
Classification is used to classify the records to form a finite set of possible class values. There are
two specializations of classification: tree induction and neural induction. An example of
classification using tree induction
In this example, we are interested in predicting whether a customer who is currently renting
property is likely to be interested in buying property. A predictive model has determined that only
two variables are of interest: the length· of the customer has rented property and the age of the
customer. The model predicts that those customers who have rented for more than two years and
are over 25 years old are the most likely to .be interested in buying property. An example of
classification using neural induction is shown in Figure.
A neural network contains collections of connected nodes with input, output, and processing at
each node. Between the visible input and output layers may be a number of hidden processing
layers. Each processing unit (circle) in one layer is connected to each processing unit in the next
layer by a weighted value, expressing the strength of the relationship. This approach is an attempt
to copy the way the human brain works· in recognizing patterns by arithmetically combining all the
variables associated with a given data point.
Value prediction
It uses the traditional statistical techniques of linear regression and nonlinear regression. These
techniques are easy to use and understand. Linear regression attempts to fit a straight line through a
plot of the data, such that the line is the best representation of the average of all observations at that
point in the plot. The problem with linear regression is that the technique only works well with
linear data and is sensitive to those data values which do not conform to the expected norm.
Although nonlinear regression avoids the main problems of linear regression, it is still not flexible
enough to handle all possible shapes of the data plot. This is where the traditional statistical
analysis methods and data mining methods begin to diverge. Applications of value prediction
include credit card fraud detection and target mailing list identification.
Database Segmentation
Segmentation is a group of similar records that share a number of properties. The aim of database
segmentation is to partition a database into an unknown number of segments, or clusters.
This approach uses unsupervised learning to discover homogeneous sub-populations in a database
to improve the accuracy of the profiles. Applications of database segmentation include customer
profiling, direct marketing, and cross-selling
As shown in figure, using database segmentation, we identify the cluster that corresponds to legal
tender and forgeries. Note that there are two clusters of forgeries, which is attributed to at least two
gangs of forgers working on falsifying the banknotes.
Link Analysis
Link analysis aims to establish links, called associations, between the individual record sets of
records, in a database. There are three specializations of link analysis. These are:
• Associations discovery
• Sequential pattern discovery
• Similar time sequence discovery.
Association’s discovery finds items that imply the presence of other items in the same event. There
are association rules which are used to define association. For example, ‘when a customer rents
property for more than two years and is more than 25 years old, in 40% of cases, the customer will
buy a property. This association happens in 35% of all customers who rent properties’.
Sequential pattern discovery finds patterns between events such that the presence of one set of item
is followed by another set of items in a database of events over a period of the. For example, this
approach can be used to understand long-term customer buying behavior.
Time sequence discovery is used in the discovery of links between two sets of data that are time-
dependent. For example, within three months of buying property, new home owners will purchase
goo
ds such as cookers, freezers, and washing machines.
Applications of link analysis include product affinity analysis, direct marketing, and stock price
movement.
Deviation Detection
Deviation detection is a relatively new technique in terms of commercially available data mining
tools. However, deviation detection is often a source of true discovery because it identifies outliers,
which express deviation from some previously known expectation “and norm. This operation can
be performed using statistics and visualization techniques.
Applications of deviation detection include fraud detection in the use of credit cards and insurance
claims, quality control, and defects tracing.
1v)
Data mining benefits include:
It helps companies gather reliable information.
It's an efficient, cost-effective solution compared to other data applications.
It helps businesses make profitable production and operational adjustments.
Data mining uses both new and legacy systems.
It helps businesses make informed decisions