0% found this document useful (0 votes)
186 views10 pages

Data Mining and Warehousing Insights

Kisala Micheal is a student with index number 2019-FEB-BIT-B224739-WKD taking the course DATA MINING AND WAREHOUSING under lecturer Dr. Goloba. The document discusses terms related to data warehouses including subject oriented, integrated, time-variant and non-volatile. It also discusses the differences between online transaction processing (OLTP) systems and online analytical processing (OLAP) systems. Finally, it describes processes for data extraction, cleansing, and transformation when designing a data warehouse for the ministry of health.

Uploaded by

VAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
186 views10 pages

Data Mining and Warehousing Insights

Kisala Micheal is a student with index number 2019-FEB-BIT-B224739-WKD taking the course DATA MINING AND WAREHOUSING under lecturer Dr. Goloba. The document discusses terms related to data warehouses including subject oriented, integrated, time-variant and non-volatile. It also discusses the differences between online transaction processing (OLTP) systems and online analytical processing (OLAP) systems. Finally, it describes processes for data extraction, cleansing, and transformation when designing a data warehouse for the ministry of health.

Uploaded by

VAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

2019-FEB-BIT-B224739-WKD

NAME: KISALA MICHEAL

INDEX NO. 2019-FEB-BIT-B224739-WKD

COURSE UNIT: DATA MINING AND WAREHOUSING

LECTURER: DR GOLOBA

STUDENTS NO. 1800103221

QTN I
1) In the context of Uganda’s ministry of economic planning, discuss what is meant by the
following terms when describing the characteristics of data in data warehouse.
A warehouse is a subject-oriented, integrated, time-variant and non- volatile
collection of data in support of management's decision making process".

Subject Oriented: Data that gives information about a particular subject instead of
about a company's ongoing operations.

Integrated: Data that is gathered into the data warehouse from a variety of sources and
merged into a coherent whole.

Time-variant: All data in the data warehouse is identified with a particular time period.

Non-volatile: Data is stable in a data warehouse. More data is added but data is never removed

11) if you were to develop a data warehouse for the Uganda’s ministry of economic planning,
explain to management how online transaction processing (OLTP) systems would differ from data
online analytical processing (OLAP)
The major distinguishing features between OLTP and OLAP are summarized as
follows.

1. Users and system orientation: An OLTP system is customer-oriented and is used for
transaction and query processing by clerks, clients, and information technology professionals.
An OLAP system is market-oriented and is used for data analysis by knowledge workers,
including managers, executives, and analysts.

2. Data contents: An OLTP system manages current data that, typically, are too detailed to be
easily used for decision making. An OLAP system manages large amounts of historical data,
provides facilities for summarization and aggregation, and stores and manages information at
different levels of granularity. These features make the data easier for use in informed decision
making.

3. Database design: An OLTP system usually adopts an entity-relationship (ER) data model
and an application oriented database design. An OLAP system typically adopts either a
star or snowflake model and a subject-oriented database design.

4. View: An OLTP system focuses mainly on the current data within an enterprise or
department, without referring to historical data or data in different organizations. In
contrast, an OLAP

system often spans multiple versions of a database schema. OLAP systems also deal
with information that originates from different organizations, integrating information from
many data stores. Because of their huge volume, OLAP data are stored on multiple storage
media.

5. Access patterns: The access patterns of an OLTP system consist mainly of short, atomic
transactions. Such a system requires concurrency control and recovery mechanisms.
However, accesses to OLAP systems are mostly read-only operations although many could be
complex queries.

111) Discuss the three main tasks which will be associated with the administration and
management of Uganda’s ministry of economic planning data warehouse’

QUESTION TWO

1) Data warehouse architecture consists of many components. Explain the role of each
component in case you were designing a data warehouse for ministry of health.
A database, data warehouse, or other information repository, which consists of
the set of d a t a b a s e s , data warehouses, spreadsheets, or o t h e r kinds of
information repositories containing the student and course information.
2. A database or data warehouse server which fetches the relevant data based on
users’ data mining requests.
3. A knowledge base that contains the domain knowledge used to guide the search
or to evaluate the interestingness of resulting patterns. For example,
the knowledge base may contain metadata which describes data from
multiple heterogeneous sources.
4. A data mining engine, which consists of a set of functional modules for tasks
such as classification, association, classification, cluster analysis, and
evolution and deviation analysis.
5. A pattern evaluation module that works in tandem with the data
mining modules by employing interestingness measures to help focus the
search towards interestingness patterns.
6. A graphical user interface that allows the user an interactive approach

11) Describe the processes which will be associated with data extraction, cleansing, and
transformation when designing a data warehouse for ministry of health.

EXTRACT
Some of the data elements in the operational database can be reasonably be expected to be
useful in the decision making, but others are of less value for that purpose. For this
reason, it is necessary to extract the relevant data from the operational database before
bringing into the data warehouse. Many commercial tools are available to help with the
extraction process. Data Junction is one of the commercial products. The user of one of these
tools typically has an easy- to-use windowed interface by which to specify the following:

(i) Which files and tables are to be accessed in the source database?
(ii) Which fields are to be extracted from them? This is often done internally by
SQL Select statement.
(iii) What are those to be called in the resulting database?
(iv) What is the target machine and database format of the output?
(v) On what schedule should the extraction process be repeated?
T
R
A
N
S
F
O
R
M

The operational databases developed can be based on any set of priorities, which keeps
changing with the requirements. Therefore, those who develop data warehouse based on these
databases are typically faced with inconsistency among their data sources. Transformation
process deals with rectifying any inconsistency (if any).

One of the most common transformation issues is. Attribute Naming Inconsistency‘. It is
common for the given data element to be referred to by different data names in different
databases. Employee Name may be EMP_NAME in one database, ENAME in the other.
Thus one set of Data Names are picked and used consistently in the data warehouse. Once all
the data elements have right names, they must be converted to common formats. The
conversion may encompass the following:

(i) Characters must be converted ASCII to EBCDIC or vice


versa. (ii) Mixed Text may be converted to all uppercase for
consistency. (iii) Numerical data must be converted in to a
common format.
(iv) Data Format has to be
standardized.
(v) Measurement may have to convert. (Rs/
$)
(vi) Coded data (Male/ Female, M/F) must be converted into a common
format.
All these transformation activities are automated and many commercial products are available
to
perform the tasks. Data MAPPER from Applied Database Technologies is one such
comprehensive tool.

CLEANSING
Information quality is the key consideration in determining the value of the information. The
developer of the data warehouse is not usually in a position to change the quality of its
underlying historic data, though a data warehousing project can put spotlight on the data quality
issues and lead to improvements for the future. It is, therefore, usually necessary to go
through the data entered into the data warehouse and make it as error free as possible. This
process is known as Data Cleansing.
Data Cleansing must deal with many types of possible errors. These include missing data and
incorrect data at one source; inconsistent data and conflicting data when two or more source are
involved. There are several algorithms followed to clean the data, which will be discussed in
the coming lecture notes.

111) Describe the real time and near-real-time data warehouse in the context of ministry of health

Real-time data is data that’s collected, processed, and analyzed on a continual basis. It’s
information that’s available for use immediately after being generated. Near real-time data is a
snapshot of historical data, so teams are left viewing a situation as it existed in the recent past
rather than as it is now. Batched data is even slower and may be days old by the time it’s ready for
use.

There’s no industry-standard definition of how much time needs to elapse before real-time data
transitions to near real-time data. But as a general rule, real-time data is measured in seconds,
whereas near real-time data may be days old by the time the BI team works through their queue to
provide a report. And with spreadsheet extracts, the data may be even older by the time the
decision-maker receives it. Powered by a modern, cloud data platform with its centralized data
stores and ability to provide nearly unlimited computing power, there’s no need to settle for
batched or even near-time data when you need to conduct analyses

1v) Discuss how Nkumba University data marts would differ from data warehouses and identify
the main reasons for implementing data marts for Nkumba University.

Size: a data mart is typically less than 100 GB; a data warehouse is typically larger than 100 GB
and often a terabyte or more.

>Range: a data mart is limited to a single focus for one line of business; a data warehouse is
typically enterprise-wide and ranges across multiple areas.

Sources: a data mart includes data from just a few sources; a data warehouse stores data from
multiple sources

QUESTION FIVE

(1) Data Mining Examples: Most Common Applications of Data Mining 2021

 Mobile Service Providers.


 Retail Sector.
 Artificial Intelligence.
 Ecommerce.
 Science and Engineering.
 Crime Prevention.
 Research.
 Farming.

(11) Data mining is the process of extracting valid, previously unknown, comprehensible, and
actionable information from large databases and using it to make crucial business decisions.

There are four main operations associated with data mining techniques which include:

• Predictive modeling

• Database segmentation

• Link analysis

• Deviation detection.

Techniques are specific implementations of the· data mining operations. However, each operation
has its own strengths and weaknesses. With this in mind, data mining tools sometimes offer a
choice of operations to implement a technique.

Predictive Modeling

It is designed on a similar pattern of the human learning experience in using observations to form a
model of the important characteristics of some task. It corresponds to the ‘real world’. It ‘is
developed using a supervised learning approach, which has to phases: training and testing.
Training phase is based on a large sample of historical data called a training set, while testing
involves trying out the model on new, previously unseen data to determine its accuracy and
physical performance characteristics.

It is commonly used in customer retention management, credit approval, cross-selling, and direct
marketing. There are two techniques associated with predictive modeling. These are:

• Classification

• Value prediction

Classification
Classification is used to classify the records to form a finite set of possible class values. There are
two specializations of classification: tree induction and neural induction. An example of
classification using tree induction

In this example, we are interested in predicting whether a customer who is currently renting
property is likely to be interested in buying property. A predictive model has determined that only
two variables are of interest: the length· of the customer has rented property and the age of the
customer. The model predicts that those customers who have rented for more than two years and
are over 25 years old are the most likely to .be interested in buying property. An example of
classification using neural induction is shown in Figure.

A neural network contains collections of connected nodes with input, output, and processing at
each node. Between the visible input and output layers may be a number of hidden processing
layers. Each processing unit (circle) in one layer is connected to each processing unit in the next
layer by a weighted value, expressing the strength of the relationship. This approach is an attempt
to copy the way the human brain works· in recognizing patterns by arithmetically combining all the
variables associated with a given data point.

Value prediction

It uses the traditional statistical techniques of linear regression and nonlinear regression. These
techniques are easy to use and understand. Linear regression attempts to fit a straight line through a
plot of the data, such that the line is the best representation of the average of all observations at that
point in the plot. The problem with linear regression is that the technique only works well with
linear data and is sensitive to those data values which do not conform to the expected norm.
Although nonlinear regression avoids the main problems of linear regression, it is still not flexible
enough to handle all possible shapes of the data plot. This is where the traditional statistical
analysis methods and data mining methods begin to diverge. Applications of value prediction
include credit card fraud detection and target mailing list identification.

Database Segmentation

Segmentation is a group of similar records that share a number of properties. The aim of database
segmentation is to partition a database into an unknown number of segments, or clusters.

This approach uses unsupervised learning to discover homogeneous sub-populations in a database


to improve the accuracy of the profiles. Applications of database segmentation include customer
profiling, direct marketing, and cross-selling

As shown in figure, using database segmentation, we identify the cluster that corresponds to legal
tender and forgeries. Note that there are two clusters of forgeries, which is attributed to at least two
gangs of forgers working on falsifying the banknotes.

Link Analysis
Link analysis aims to establish links, called associations, between the individual record sets of
records, in a database. There are three specializations of link analysis. These are:

• Associations discovery

• Sequential pattern discovery

• Similar time sequence discovery.

Association’s discovery finds items that imply the presence of other items in the same event. There
are association rules which are used to define association. For example, ‘when a customer rents
property for more than two years and is more than 25 years old, in 40% of cases, the customer will
buy a property. This association happens in 35% of all customers who rent properties’.

Sequential pattern discovery finds patterns between events such that the presence of one set of item
is followed by another set of items in a database of events over a period of the. For example, this
approach can be used to understand long-term customer buying behavior.

Time sequence discovery is used in the discovery of links between two sets of data that are time-
dependent. For example, within three months of buying property, new home owners will purchase
goo

ds such as cookers, freezers, and washing machines.

Applications of link analysis include product affinity analysis, direct marketing, and stock price
movement.

Deviation Detection

Deviation detection is a relatively new technique in terms of commercially available data mining
tools. However, deviation detection is often a source of true discovery because it identifies outliers,
which express deviation from some previously known expectation “and norm. This operation can
be performed using statistics and visualization techniques.

Applications of deviation detection include fraud detection in the use of credit cards and insurance
claims, quality control, and defects tracing.

1v)

Data mining benefits include:

 It helps companies gather reliable information.


 It's an efficient, cost-effective solution compared to other data applications.
 It helps businesses make profitable production and operational adjustments.
 Data mining uses both new and legacy systems.
 It helps businesses make informed decisions

Common questions

Powered by AI

Data mining techniques are applied across various industries to extract valuable insights from large datasets. In retail, data mining enhances customer profiling and market basket analysis to inform strategic selling . In healthcare, it supports patient data analysis for improved diagnostics and treatment customization. In finance, it aids in fraud detection and risk management, providing predictive insights for credit scoring and investment decisions . Overall, data mining offers benefits such as the ability to make informed decisions, optimize operations, and gain competitive advantages by uncovering hidden patterns and correlations within data .

The processes involved in setting up a data warehouse include data extraction, cleansing, and transformation. Data extraction involves selecting relevant data from operational databases, often using commercial tools like Data Junction . Transformation deals with standardizing data formats and resolving inconsistencies, such as attribute naming discrepancies . It involves converting data elements to common formats and resolving measurement differences . Data cleansing ensures that information entered into the warehouse is error-free, addressing issues like missing and conflicting data from various sources . These steps are crucial for ensuring high data quality, which is key to deriving reliable insights for decision-making.

Real-time data warehouses process and provide data available for use immediately upon generation, allowing for up-to-date analytics and decision-making . In contrast, near-real-time warehouses provide snapshots of data that are close to real-time but typically involve some delay, leading to decisions based on slightly older data . The distinction is significant in health contexts where immediate access to data can improve outcomes by facilitating timely responses to changing conditions. Real-time analysis is crucial for scenarios requiring instant insight, while near-real-time systems may suffice for monitoring trends with less urgency.

Predictive modeling and value prediction are both data mining techniques with distinct approaches. Predictive modeling mimics human learning through supervised learning, developing models based on historical data (training set) to predict outcomes on new data . It is useful for applications like customer retention and credit approval. Conversely, value prediction uses statistical regression (linear and nonlinear) to forecast continuous outcomes such as sales or market trends . While predictive modeling focuses on categorizing or classifying data, value prediction emphasizes estimating numerical values, providing quantitative forecasts essential in financial and resource planning applications.

Data cleansing in a data warehouse context involves dealing with a wide range of errors, such as missing, incorrect, inconsistent, and conflicting data across different sources . The challenges include reconciling these discrepancies without altering the integrity of the original data sources, which requires careful algorithmic approaches. Poor data cleansing can lead to datasets that misrepresent reality, result in inaccurate analyses, and ultimately affect decision-making processes negatively by providing unreliable insights . Effective data cleansing ensures high-quality data entry, which is crucial for maintaining the fidelity of analyses and supporting accurate business intelligence.

In a data warehouse architecture for the Ministry of Health, various components play distinct roles. A central data repository aggregates data from multiple sources into a unified database or warehouse . Data warehouse servers process user data requests . A knowledge base provides domain knowledge, guiding data interpretation . The data mining engine performs key functions such as classification, association, and cluster analysis . The pattern evaluation module helps prioritize findings by measuring their interestingness . Finally, a graphical user interface enables users to interact with data and perform analyses efficiently . Together, these components facilitate the transformation of raw health data into actionable knowledge for better decision-making.

Link analysis in data mining establishes relationships between data items, identifying associations, sequential patterns, and time sequences that inform behavior predictions and trend analysis . It is instrumental in uncovering complex relationships, such as in product affinity and customer behavior analysis. Deviation detection focuses on identifying anomalies in data that diverge from established norms . In fraud detection, these techniques reveal unusual patterns of activity or associations that could indicate fraudulent behaviors, such as atypical transactional patterns in financial data. By highlighting outliers, deviation detection aids proactive risk management and security measures.

A data warehouse is described as a subject-oriented, integrated, time-variant, and non-volatile collection of data that supports an organization's decision-making process. 'Subject oriented' means data is organized around major subjects of the business rather than day-to-day operations . 'Integrated' indicates data from various sources is unified for consistent analysis . 'Time-variant' implies the data is identified and stored with time relevance, allowing for historical trend analyses . 'Non-volatile' means the data is stable; it is read-only for analyses purposes and does not change except to add new data . These characteristics enable decision-makers to glean insights and trends over time, essential for strategic planning and policy formulation.

Data marts differ from data warehouses in size, scope, and data sources. Typically, data marts are below 100 GB and are limited to specific subjects within a single business line, whereas data warehouses are more extensive, often exceeding 100 GB, and cover multiple business areas . Data marts derive data from fewer sources than data warehouses, which integrate multiple data sources . Nkumba University might implement data marts to provide targeted insights for specific departments or functions, enabling focused analysis without the complexity and cost of developing a full-scale data warehouse. This allows for more efficient data management aligned with particular organizational needs.

An OLTP (Online Transaction Processing) system is customer-oriented, focusing on transaction processing and handling current, detailed data, typically using an ER data model . In contrast, an OLAP (Online Analytical Processing) system is market-oriented, designed for complex analytical queries that help knowledge workers make informed decisions. OLAP manages vast amounts of historical data enabling summarization and aggregation . While OLTP focuses on current data contained within a department, OLAP spans multiple database schemas and integrates information from diverse sources to provide comprehensive insights . These differences impact how decisions are informed by providing breadth in data analysis and the ability to uncover historical trends and patterns critical for strategic planning.

You might also like