unit2--- 5marks(datascience)
unit2--- 5marks(datascience)
(4/6)
A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and
usually residing at a single site. Data warehouses are constructed via a process of data cleaning, data integration, data
transformation, data loading, and periodic data refreshing.
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than transaction
processing. It includes historical data derived from transaction data from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support for decision-
makers for data modelling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular group of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in support of management's
decisions."
Characteristics of Data Warehouse:
Subject-Oriented
A data warehouse target on the modelling and analysis of data for decision-makers. Therefore, data warehouses
typically provide a concise and straightforward view around a particular subject, such as customer, product, or sales,
instead of the global organization's ongoing operations. This is done by excluding data that are not useful concerning
the subject and including all data needed by the users to understand the subject .
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and online transaction
records. It requires performing data cleaning and integration during data warehousing to ensure consistency in naming
conventions, attributes types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3 months, 6 months, 12
months, or even previous data from a data warehouse. These variations with a transactions system, where often only
the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source operational RDBMS.
The operational updates of data do not occur in the data warehouse, i.e., update, insert, and delete operations are not
performed. It usually requires only two procedures in data accessing: Initial loading of data and access to data.
Therefore, the DW does not require transaction processing, recovery, and concurrency capabilities, which allows for
substantial speedup of data retrieval. Non-Volatile defines that once entered into the warehouse, and data should not
change.
2) Explain any four/six differences between operational database system and data
warehouse. (4/6)
3) Explain the following Multidimensional database schema with example. (3 marks each)
i.) Star schema
A star schema is the elementary form of a dimensional model, in which data are organized
into facts and dimensions. A fact is an event that is counted or measured, such as a sale or log in. A dimension
includes reference data about the fact, such as date, item, or customer.
A star schema is a relational schema where a relational schema whose design represents a multidimensional data
model. The star schema is the explicit data warehouse schema. It is known as star schema because the entity-
relationship diagram of this schemas simulates a star, with points, diverge from a central table. The center of the
schema consists of a large fact table, and the points of the star are the dimension tables.
The star schema is intensely suitable for data warehouse database design because of the following features:
o It creates a DE-normalized database that can quickly provide query responses.
o It provides a flexible design that can be changed easily or added to throughout the development cycle, and
as the database grows.
o It provides a parallel in design to how end-users typically think of and use the data.
o It reduces the complexity of metadata for both developers and end-users.
Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up. Drill-down is
like zooming-in on the data cube. It navigates from less detailed record to more detailed data. Drill-down
can be performed by either stepping down a concept hierarchy for a dimension or adding additional
dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping down a concept hierarchy
which is defined as day, month, quarter, and year. Drill-down appears by descending the time hierarchy from
the level of the quarter to a more detailed level of the month.
Because a drill-down adds more details to the given data, it can also be performed by adding a new
dimension to a cube. For example, a drill-down on the central cubes of the figure can occur by introducing
an additional dimension, such as a customer group.
Slice
A slice is a subset of the cubes corresponding to a single value for one or more members of the dimension.
For example, a slice operation is executed when the customer wants a selection on one dimension of a three-
dimensional cube resulting in a two-dimensional site. So, the Slice operations perform a selection on one
dimension of the given cube, thus resulting in a subcube.
Dice
The dice operation describes a subcube by operating a selection on two or more dimension.
Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which rotates the data
axes in view to provide an alternative presentation of the data. It may contain swapping the rows and
columns or moving one of the row-dimensions into the column dimensions.
6) Explain different views regarding the business analysis framework design of a data
warehouse. (4)
Four different views regarding a data warehouse design must be considered: the topdown
view, the data source view, the data warehouse view, and the business query view.
The top-down view: allows the selection of the relevant information necessary for the
data warehouse. This information matches current and future business needs.
The data source view :exposes the information being captured, stored, and managed
by operational systems. This information may be documented at various levels
of detail and accuracy, from individual data source tables to integrated data source
tables. Data sources are often modeled by traditional data modeling techniques, such
as the entity-relationship model or CASE (computer-aided software engineering)
tools.
The data warehouse view: includes fact tables and dimension tables. It represents the
information that is stored inside the data warehouse, including precalculated totals
and counts, as well as information regarding the source, date, and time of origin,
added to provide historical context.
Finally, the business query view :is the data perspective in the data warehouse from
the end-user’s viewpoint.
9) Explain the different data warehouse models from the architecture point of view. (4)
From the architecture point of view, there are three data warehouse models: the
enterprise warehouse, the data mart, and the virtual warehouse.
Enterprise warehouse: An enterprise warehouse collects all of the information about
subjects spanning the entire organization. It provides corporate-wide data integration,
usually from one or more operational systems or external information
providers, and is cross-functional in scope. It typically contains detailed data as
well as summarized data, and can range in size from a few gigabytes to hundreds
of gigabytes, terabytes, or beyond. An enterprise data warehouse may be implemented
on traditional mainframes, computer superservers, or parallel architecture
platforms. It requires extensive business modeling and may take years to design
and build.
Data mart: A data mart contains a subset of corporate-wide data that is of value to a
specific group of users. The scope is confined to specific selected subjects. For example,
a marketing data mart may confine its subjects to customer, item, and sales. The
data contained in data marts tend to be summarized.
Data marts are usually implemented on low-cost departmental servers that are
Unix/Linux or Windows based. The implementation cycle of a data mart is more
likely to be measured in weeks rather than months or years. However, it may
involve complex integration in the long run if its design and planning were not
enterprise-wide.
Depending on the source of data, data marts can be categorized as independent
or dependent. Independent data marts are sourced from data captured from one or
more operational systems or external information providers, or from data generated
locally within a particular department or geographic area. Dependent data marts are
sourced directly from enterprise data warehouses.
Virtual warehouse: A virtual warehouse is a set of views over operational databases.
For efficient query processing, only some of the possible summary views may be
materialized. A virtual warehouse is easy to build but requires excess capacity on
operational database servers
10) Explain the different back-end tools and utilities included in data warehouse. (4)
The back-end tools of a data warehouse are pieces of software responsible for the extraction of data
from several sources, their cleansing, customization, and insertion into a data warehouse. They are
known under the general term extraction, transformation and loading (ETL) tools.
In all the phases of an ETL process (extraction and exportation, transformation and cleaning, and
loading), individual issues arise and, along with the problems and constraints that concern the overall
ETL process, make its lifecycle a very complex task.
Phase I: Extraction and Transportation
During the ETL process, a first task that must be performed is the extraction of the relevant
information that has to be propagated further to the warehouse.
Phase II: Transformation and Cleaning
It is possible to determine typical tasks that take place during the transformation and cleaning phase
of an ETL process.
or
Data Warehouse Back-End Tools and Utilities
Data extraction:
o get data from multiple, heterogeneous, and external sources
Data cleaning:
o detect errors in the data and rectify them when possible
Data transformation:
o convert data from legacy or host format to warehouse format
Load:
o sort, summarize, consolidate, compute views, check integrity, and build indicies and
partitions
Refresh
o propagate the updates from the data sources to the warehouse
14) Explain (any four) the different ways of handling missing values. (4/6)
different ways of handling missing values in a dataset:
1. Ignore the tuple (Listwise Deletion):
This approach involves removing entire rows (tuples) from the dataset if they
contain any missing values.
It ensures that the analysis is performed only on complete cases, but it may lead
to a loss of valuable data.
2. Fill in the missing value manually:
In this method, missing values are identified, and the analyst or domain expert
manually fills in the missing values based on their knowledge or intuition.
While this approach allows for personalized handling of missing values, it can
be time-consuming and subjective.
3. Use a global constant to fill in the missing value:
Missing values are replaced with a predetermined global constant, such as 0 or -
1.
This approach is simple and quick but may not reflect the true nature of the
data, potentially introducing bias.
4. Use a measure of central tendency (e.g., mean or median) to fill in the missing value:
Missing values in numerical attributes are replaced with the mean or median of
the available values in the same attribute.
This method preserves the overall distribution of the data but may be sensitive
to outliers.
5. Use the attribute mean or median for all samples belonging to the same class:
If the dataset contains categorical variables, missing values in numerical
attributes can be replaced with the mean or median of the respective attribute
values within the same class or category.
This approach accounts for potential differences in attribute distributions across
different classes.
6. Use the most probable value to fill in the missing value:
Missing values are replaced with the most probable value based on the observed
data and possibly other attributes.
This method can be implemented using techniques such as regression, k-nearest
neighbors, or decision trees to predict missing values.
It can capture complex relationships in the data but requires additional
computational resources.
15) Define noise and explain different data smoothing techniques. (6)
Noise is a random error or variance in a measured variable. Noise can arise from various
sources, including measurement errors, data collection artifacts, or inherent variability in the
phenomenon being studied.
different data smoothing techniques.
1. Binning:
Binning is a data smoothing technique that involves dividing the data into
intervals, or bins, and replacing the values within each bin with a representative
value, such as the mean, median, or mode of the values in that bin.
This technique is often used to reduce the effects of noise and variability in
continuous data by grouping similar values together.
Binning can help simplify the data, make it more interpretable, and identify
trends or patterns that may not be apparent in raw data.
However, binning may also lead to information loss and loss of granularity,
especially if the number of bins is chosen arbitrarily or if important details are
obscured by bin boundaries.
2. Regression:
Regression analysis is a statistical technique used to model the relationship
between one or more independent variables (predictors) and a dependent
variable (outcome) based on observed data.
In the context of data smoothing, regression can be used to fit a curve or surface
to the data, allowing for the estimation of values between observed data points.
Regression models can help identify trends, patterns, or underlying
relationships in the data, making it easier to interpret and make predictions.
Common regression techniques include linear regression, polynomial
regression, logistic regression, and more advanced methods such as ridge
regression or lasso regression.
Regression can effectively smooth noisy data, but it may also introduce bias if
the model assumptions are violated or if overfitting occurs.
3. Outlier Analysis:
Outlier analysis is a data smoothing technique focused on identifying and
handling outliers, which are data points that deviate significantly from the rest
of the data.
Outliers can distort statistical analyses and modeling results, leading to biased
conclusions or inaccurate predictions.
Outlier analysis involves techniques such as visual inspection, statistical tests,
or machine learning algorithms to detect outliers based on their distance,
density, or deviation from the expected distribution.
Once outliers are identified, they can be removed, transformed, or treated
separately to minimize their impact on the analysis.
Outlier analysis is essential for improving the robustness and accuracy of data
analysis and modeling by reducing the influence of extreme values that may not
represent the underlying phenomenon accurately.
17) What is data reduction? Explain the strategies for data reduction. (6)
Data reduction techniques can be applied to obtain a reduced representation of the
data set that ismuch smaller in volume, yet closely maintains the integrity of the original
data. That is, mining on the reduced data set should be more efficient yet produce the
same (or almost the same) analytical results.
Data reduction strategies include dimensionality reduction, numerosity reduction, and
data compression.
Dimensionality reduction is the process of reducing the number of randomvariables
or attributes under consideration. Dimensionality reduction methods include wavelet transforms (Section
3.4.2) and principal components analysis (Section 3.4.3), which
transform or project the original data onto a smaller space. Attribute subset selection is a
method of dimensionality reduction in which irrelevant, weakly relevant, or redundant
attributes or dimensions are detected and removed (Section 3.4.4).
Numerosity reduction techniques replace the original data volume by alternative,
smaller forms of data representation. These techniques may be parametric or nonparametric.
For parametric methods, a model is used to estimate the data, so that
typically only the data parameters need to be stored, instead of the actual data. (Outliers
may also be stored.) Regression and log-linear models (Section 3.4.5) are examples.
Nonparametric methods for storing reduced representations of the data include histograms
(Section 3.4.6), clustering (Section 3.4.7), sampling (Section 3.4.8), and data
cube aggregation (Section 3.4.9).
In data compression, transformations are applied so as to obtain a reduced or “compressed”
representation of the original data. If the original data can be reconstructed
from the compressed data without any information loss, the data reduction is called
lossless. If, instead, we can reconstruct only an approximation of the original data, then
the data reduction is called lossy. There are several lossless algorithms for string compression;
however, they typically allow only limited data manipulation. Dimensionality
reduction and numerosity reduction techniques can also be considered forms of data
compression.
18) Write a note on Wavelet Transform and Principal Component Analysis. (6)
Wavelet Transform in Data Science:
Wavelet Transform is a mathematical technique used for analyzing signals and images at
different scales or resolutions. It decomposes a signal into a set of wavelet functions, known as
wavelets, which are scaled and translated versions of a base wavelet function.
Wavelet Transform is a powerful technique in data science with various applications:
1. Signal Processing: In data science, signals often contain valuable information at different
scales. Wavelet Transform allows data scientists to analyze signals effectively by
decomposing them into different frequency components, enabling the identification of
patterns, trends, and anomalies.
2. Time Series Analysis: Time series data frequently exhibit complex patterns and trends that
can be challenging to analyze. Wavelet Transform provides a way to decompose time
series data into different frequency components, helping to identify periodicities, trends,
and irregularities.
3. Image Processing: Images in data science often require processing for analysis,
classification, or feature extraction. Wavelet Transform is used for tasks such as image
compression, denoising, edge detection, and feature extraction, offering advantages over
traditional techniques by capturing both spatial and frequency information simultaneously.
4. Feature Extraction: In machine learning and pattern recognition tasks, feature extraction
plays a crucial role in representing data effectively for modeling. Wavelet Transform can
be used to extract relevant features from signals or images, reducing dimensionality while
preserving important information for classification or regression tasks.
Principal Component Analysis (PCA) in Data Science:
Principal Component Analysis (PCA) is a statistical technique used for dimensionality
reduction and data compression. It transforms the original features of a dataset into a new set of
orthogonal variables, called principal components, which capture the maximum variance in the
data.
PCA is a fundamental technique in data science with widespread applications:
1. Dimensionality Reduction: In data science, datasets often contain high-dimensional
features, which can lead to issues such as overfitting, increased computational complexity,
and difficulty in interpretation. PCA addresses these challenges by transforming high-
dimensional data into a lower-dimensional space while retaining most of the variance,
thereby reducing redundancy and improving model performance.
2. Data Visualization: Visualizing high-dimensional data is challenging, but PCA can help by
projecting data onto a lower-dimensional space while preserving as much variance as
possible. This allows data scientists to visualize complex datasets in two or three
dimensions, facilitating exploration, interpretation, and communication of insights.
3. Data Compression: PCA can be used for data compression by representing data using a
smaller number of principal components, which are linear combinations of the original
features. This reduces storage requirements and computational complexity while
minimizing information loss, making it useful for handling large datasets efficiently.
4. Noise Reduction: PCA can also be employed for noise reduction by removing principal
components with low variance, which are assumed to correspond to noise or irrelevant
information. This helps to improve the signal-to-noise ratio and enhance the quality of the
data for subsequent analysis or modeling tasks.
21) Write a note on different methods used for the generation of concept hierarchies for
categorical data. (4/6)
1. Schema Hierarchy: Schema Hierarchy is a type of concept hierarchy that is used to organize the
schema of a database in a logical and meaningful way, grouping similar objects together. A
schema hierarchy can be used to organize different types of data, such as tables, attributes, and
relationships, in a logical and meaningful way. This can be useful in data warehousing, where data
from multiple sources needs to be integrated into a single database.
2. Set-Grouping Hierarchy: Set-Grouping Hierarchy is a type of concept hierarchy that is based on
set theory, where each set in the hierarchy is defined in terms of its membership in other sets. Set-
grouping hierarchy can be used for data cleaning, data pre-processing and data integration. This
type of hierarchy can be used to identify and remove outliers, noise, or inconsistencies from the
data and to integrate data from multiple sources.
3. Operation-Derived Hierarchy: An Operation-Derived Hierarchy is a type of concept hierarchy
that is used to organize data by applying a series of operations or transformations to the data. The
operations are applied in a top-down fashion, with each level of the hierarchy representing a more
general or abstract view of the data than the level below it. This type of hierarchy is typically used
in data mining tasks such as clustering and dimensionality reduction. The operations applied can
be mathematical or statistical operations such as aggregation, normalization
4. Rule-based Hierarchy: Rule-based Hierarchy is a type of concept hierarchy that is used to
organize data by applying a set of rules or conditions to the data. This type of hierarchy is useful in
data mining tasks such as classification, decision-making, and data exploration. It allows to the
assignment of a class label or decision to each data point based on its characteristics and identifies
patterns and relationships between different attributes of the data.
multidimensional aggregated information. For example, Figure 3.11 shows a data cube
for multidimensional analysis of sales data with respect to annual sales per item type
for each AllElectronics branch. Each cell holds an aggregate data value, corresponding
to the data point in multidimensional space. (For readability, only some cell values are
shown.) Concept hierarchies may exist for each attribute, allowing the analysis of data
at multiple abstraction levels. For example, a hierarchy for branch could allow branches
to be grouped into regions, based on their address. Data cubes provide fast access to
precomputed, summarized data, thereby benefiting online analytical processing as well
as data mining.
The cube created at the lowest abstraction level is referred to as the base cuboid. The
base cuboid should correspond to an individual entity of interest such as sales or customer.
In other words, the lowest level should be usable, or useful for the analysis. A cube
at the highest level of abstraction is the apex cuboid. For the sales data in Figure 3.11,
the apex cuboid would give one total—the total sales for all three years, for all item
types, and for all branches. Data cubes created for varying levels of abstraction are often
referred to as cuboids, so that a data cube may instead refer to a lattice of cuboids. Each
higher abstraction level further reduces the resulting data size. When replying to data
mining requests, the smallest available cuboid relevant to the given task should be used.
23) Explain heuristic methods of attribute subset selection with an example and explain its
techniques. (6)
Attribute subset selection reduces the data set size by removing irrelevant or
redundant attributes (or dimensions). The goal of attribute subset selection is to find
a minimum set of attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained using all attributes.
Mining on a reduced set of attributes has an additional benefit: It reduces the number
of attributes appearing in the discovered patterns, helping to make the patterns easier to
understand.
Basic heuristic methods of attribute subset selection include the techniques that
follow, some of which are illustrated in Figure 3.6.
1. Stepwise forward selection: The procedure starts with an empty set of attributes as
the reduced set. The best of the original attributes is determined and added to the
reduced set. At each subsequent iteration or step, the best of the remaining original
attributes is added to the set.
2. Stepwise backward elimination: The procedure starts with the full set of attributes.
At each step, it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The stepwise forward
selection and backward elimination methods can be combined so that, at each
step, the procedure selects the best attribute and removes the worst from among the
remaining attributes.
4. Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART) were
originally intended for classification. Decision tree induction constructs a flowchartlike
structure where each internal (nonleaf) node denotes a test on an attribute, each
branch corresponds to an outcome of the test, and each external (leaf) node denotes a
class prediction. At each node, the algorithm chooses the “best” attribute to partition
the data into individual classes.
When decision tree induction is used for attribute subset selection, a tree is constructed
fromthe given data. All attributes that do not appear in the tree are assumed
to be irrelevant. The set of attributes appearing in the tree form the reduced subset
of attributes.
24) What is sampling? Explain the ways used to sample for data reduction?
Sampling can be used as a data reduction technique because it allows a large data set to
be represented by a much smaller random data sample (or subset). Suppose that a large
data set, D, contains N tuples. Let’s look at the most common ways that we could sample
D for data reduction, as illustrated in Figure 3.9.
Simple random sample without replacement (SRSWOR) of size s: This is created by drawing s of the N tuples from
D (s < N), where the probability of drawing any tuple in D is 1=N, that is, all tuples are equally likely to be sampled.
Simple random sample with replacement (SRSWR) of size s: This is similar to SRSWOR, except that each time a
tuple is drawn from D, it is recorded and thenreplaced. That is, after a tuple is drawn, it is placed back in D so that it
may be drawn again.
Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters,” then an SRS of s clusters can be
obtained, where s < M. For example, tuples in a database are usually retrieved a page at a time, so that each page can be
considered a cluster. A reduced data representation can be obtained by applying, say, SRSWOR to the pages, resulting
in a cluster sample of the tuples. Other clustering criteria conveying rich semantics can also be explored. For example,
in a spatial database, we may choose to define clusters geographically based on how closely different areas are located.
Stratified sample: If D is divided intomutually disjoint parts called strata, a stratified sample of D is generated by
obtaining an SRS at each stratum. This helps ensure a representative sample, especially when the data are skewed. For
example, a stratified sample may be obtained fromcustomer data, where a stratum is created for each customer age
group. In this way, the age group having the smallest number of customers
will be sure to be represented.