0% found this document useful (0 votes)
3 views

unit2--- 5marks(datascience)

BCA

Uploaded by

SPANDANA
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

unit2--- 5marks(datascience)

BCA

Uploaded by

SPANDANA
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

1) Define and explain data warehouse.

(4/6)
A data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and
usually residing at a single site. Data warehouses are constructed via a process of data cleaning, data integration, data
transformation, data loading, and periodic data refreshing.
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than transaction
processing. It includes historical data derived from transaction data from single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support for decision-
makers for data modelling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular group of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in support of management's
decisions."
Characteristics of Data Warehouse:
Subject-Oriented
A data warehouse target on the modelling and analysis of data for decision-makers. Therefore, data warehouses
typically provide a concise and straightforward view around a particular subject, such as customer, product, or sales,
instead of the global organization's ongoing operations. This is done by excluding data that are not useful concerning
the subject and including all data needed by the users to understand the subject .
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and online transaction
records. It requires performing data cleaning and integration during data warehousing to ensure consistency in naming
conventions, attributes types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3 months, 6 months, 12
months, or even previous data from a data warehouse. These variations with a transactions system, where often only
the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source operational RDBMS.
The operational updates of data do not occur in the data warehouse, i.e., update, insert, and delete operations are not
performed. It usually requires only two procedures in data accessing: Initial loading of data and access to data.
Therefore, the DW does not require transaction processing, recovery, and concurrency capabilities, which allows for
substantial speedup of data retrieval. Non-Volatile defines that once entered into the warehouse, and data should not
change.

2) Explain any four/six differences between operational database system and data
warehouse. (4/6)
3) Explain the following Multidimensional database schema with example. (3 marks each)
i.) Star schema
A star schema is the elementary form of a dimensional model, in which data are organized
into facts and dimensions. A fact is an event that is counted or measured, such as a sale or log in. A dimension
includes reference data about the fact, such as date, item, or customer.
A star schema is a relational schema where a relational schema whose design represents a multidimensional data
model. The star schema is the explicit data warehouse schema. It is known as star schema because the entity-
relationship diagram of this schemas simulates a star, with points, diverge from a central table. The center of the
schema consists of a large fact table, and the points of the star are the dimension tables.
The star schema is intensely suitable for data warehouse database design because of the following features:
o It creates a DE-normalized database that can quickly provide query responses.
o It provides a flexible design that can be changed easily or added to throughout the development cycle, and
as the database grows.
o It provides a parallel in design to how end-users typically think of and use the data.
o It reduces the complexity of metadata for both developers and end-users.

ii.) Snowflake schema


A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if one or more
dimension tables do not connect directly to the fact table but must join through other dimension tables."
The snowflake schema is an expansion of the star schema where each point of the star explodes into more
points. It is called snowflake schema because the diagram of snowflake schema resembles a
snowflake. Snowflaking is a method of normalizing the dimension tables in a STAR schemas. When we
normalize all the dimension tables entirely, the resultant structure resembles a snowflake with the fact table in
the middle.
Snowflaking is used to develop the performance of specific queries. The schema is diagramed with each fact
surrounded by its associated dimensions, and those dimensions are related to other dimensions, branching out
into a snowflake pattern.
The snowflake schema consists of one fact table which is linked to many dimension tables, which can be linked
to other dimension tables through a many-to-one relationship. Tables in a snowflake schema are generally
normalized to the third normal form. Each dimension table performs exactly one level in a hierarchy.
Advantage of Snowflake Schema
1) The primary advantage of the snowflake schema is the development in query performance due to minimized
disk storage requirements and joining smaller lookup tables.
2) It provides greater scalability in the interrelationship between dimension levels and components.
3) No redundancy, so it is easier to maintain.
Disadvantage of Snowflake Schema
1) The primary disadvantage of the snowflake schema is the additional maintenance efforts required due to the
increasing number of lookup tables. It is also known as a multi fact star schema.
2) There are more complex queries and hence, difficult to understand.
3) More tables more join so more query execution time.

iii.) Fact constellation schema


A Fact constellation means two or more fact tables sharing one or more dimensions. It is also called Galaxy
schema. Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact Constellation
Schema can design with a collection of de-normalized FACT, Shared, and Conformed Dimension tables.
Fact Constellation Schema is a sophisticated database design that is difficult to summarize information. Fact Constellation
Schema can implement between aggregate Fact tables or decompose a complex Fact table into independent simplex Fact
tables. The primary disadvantage of the fact constellation schema is that it is a more challenging design because many
variants for specific kinds of aggregation must be considered and selected.

4) Define Measure. Explain different categories of measures. (4)


A data cube measure is a numeric function that can be evaluated
at each point in the data cube space. A measure value is computed for a given point by
aggregating the data corresponding to the respective dimension–value pairs defining the
given point.We will look at concrete examples of this shortly.
Measures can be organized into three categories—distributive, algebraic, and holistic—
based on the kind of aggregate functions used.
Distributive:
If any function is calculated in a delivered manner as listed then it is said to be a distributive function. Let us consider
the data to be independent into m sets. It should be able to use the services of each partition resulting in m aggregate
values. When the result obtained by applying the function to the n aggregate values is identical to the result obtained
by applying the function to the entire data set (without partitioning), the function is said to be applied in a dispersed
manner.
For example, count() for a data cube can be calculated by dividing or partitioning the cube into a group of sub-cubes of
the same size, We can calculate count() for each sub-cube and then add them to get the total. so we can conclude that
the count() function is a distributive aggregate service.
A y measure is said to be distributive if it can be obtained by using the distributive aggregate service.
Examples : Sum(), Count(), Minimum()
Algebraic:
If any aggregated function can be calculated by using an algebraic service then it is said to be algebraic. It is calculated
by an algebraic function of N arguments where N is a positive integer.
We can consider an example of an average function or avg(). The average function is mainly calculated by sum() or
count() or both(). In this case both Count() and Sum() are distributive aggregate services but their division of them
leads to an algebraic function. similarly min() and max() are also algebraic.If any measure is acquired by using any
algebraic aggregate service then it can be called an algebraic function.
Example: Average(), ManN(), MinN(), CenterofMass()
Holistic:
If there is no defined constraint or limit on the storage amount needed to define the sub-aggregate, any given
aggregate function is said to be holistic. It can be described as an algebraic function with n arguments.
For example, median(), rank(), and mode() are holistic measures. If any measure uses the holistic aggregate
function then it can be said to be holistic. The majority of cube applications that work with big amounts of
data demand quick computations of distributive and algebraic measurements.

5) Explain any four/six OLAP operations. (4/6)


(you can use assignment answers also)
Roll-Up
The roll-up operation (also known as drill-up or aggregation operation) performs aggregation on a data cube,
by climbing down concept hierarchies, i.e., dimension reduction. Roll-up is like zooming-out on the data
cubes. Figure shows the result of roll-up operations performed on the dimension location. The hierarchy for
the location is defined as the Order Street, city, province, or state, country. The roll-up operation aggregates
the data by ascending the location hierarchy from the level of the city to the level of the country.
When a roll-up is performed by dimensions reduction, one or more dimensions are removed from the cube.
For example, consider a sales data cube having two dimensions, location and time. Roll-up may be
performed by removing, the time dimensions, appearing in an aggregation of the total sales by location,
relatively than by location and by time.

Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up. Drill-down is
like zooming-in on the data cube. It navigates from less detailed record to more detailed data. Drill-down
can be performed by either stepping down a concept hierarchy for a dimension or adding additional
dimensions.
Figure shows a drill-down operation performed on the dimension time by stepping down a concept hierarchy
which is defined as day, month, quarter, and year. Drill-down appears by descending the time hierarchy from
the level of the quarter to a more detailed level of the month.
Because a drill-down adds more details to the given data, it can also be performed by adding a new
dimension to a cube. For example, a drill-down on the central cubes of the figure can occur by introducing
an additional dimension, such as a customer group.
Slice
A slice is a subset of the cubes corresponding to a single value for one or more members of the dimension.
For example, a slice operation is executed when the customer wants a selection on one dimension of a three-
dimensional cube resulting in a two-dimensional site. So, the Slice operations perform a selection on one
dimension of the given cube, thus resulting in a subcube.
Dice
The dice operation describes a subcube by operating a selection on two or more dimension.

Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which rotates the data
axes in view to provide an alternative presentation of the data. It may contain swapping the rows and
columns or moving one of the row-dimensions into the column dimensions.

6) Explain different views regarding the business analysis framework design of a data
warehouse. (4)
Four different views regarding a data warehouse design must be considered: the topdown
view, the data source view, the data warehouse view, and the business query view.
The top-down view: allows the selection of the relevant information necessary for the
data warehouse. This information matches current and future business needs.
The data source view :exposes the information being captured, stored, and managed
by operational systems. This information may be documented at various levels
of detail and accuracy, from individual data source tables to integrated data source
tables. Data sources are often modeled by traditional data modeling techniques, such
as the entity-relationship model or CASE (computer-aided software engineering)
tools.
The data warehouse view: includes fact tables and dimension tables. It represents the
information that is stored inside the data warehouse, including precalculated totals
and counts, as well as information regarding the source, date, and time of origin,
added to provide historical context.
Finally, the business query view :is the data perspective in the data warehouse from
the end-user’s viewpoint.

7) Explain the steps in data warehouse design process. (4)


In general, the warehouse design process consists of the following steps:
1. Choose a business process to model (e.g., orders, invoices, shipments, inventory,
account administration, sales, or the general ledger). If the business process is organizational
and involves multiple complex object collections, a data warehouse model
should be followed. However, if the process is departmental and focuses on the
analysis of one kind of business process, a data mart model should be chosen.
2. Choose the business process grain, which is the fundamental, atomic level of data
to be represented in the fact table for this process (e.g., individual transactions,
individual daily snapshots, and so on).
3. Choose the dimensions that will apply to each fact table record. Typical dimensions
are time, item, customer, supplier, warehouse, transaction type, and status.
4. Choose the measures that will populate each fact table record. Typical measures are
numeric additive quantities like dollars sold and units sold.
8) Explain the three-tier architecture of Data warehouse with neat diagram. (6)
1. The bottom tier is a warehouse database server that is almost always a relational
database system. Back-end tools and utilities are used to feed data into the bottom
tier from operational databases or other external sources (e.g., customer profile
information provided by external consultants). These tools and utilities performdata
extraction, cleaning, and transformation (e.g., to merge similar data from different
sources into a unified format), as well as load and refresh functions to update the
data warehouse. The data are extracted using application program
interfaces known as gateways. A gateway is supported by the underlying DBMS
and allows client programs to generate SQL code to be executed at a server. Examples
of gateways include ODBC (Open Database Connection) and OLEDB (Object Linking and Embedding Database) by
Microsoft and JDBC (Java Database Connection).
This tier also contains a metadata repository, which stores information about
the data warehouse and its contents.
2. The middle tier is an OLAP server that is typically implemented using either (1) a
relationalOLAP(ROLAP) model (i.e., an extended relational DBMS that maps operations
on multidimensional data to standard relational operations); or (2) a multidimensional
OLAP (MOLAP) model (i.e., a special-purpose server that directly
implements multidimensional data and operations).
3. The top tier is a front-end client layer, which contains query and reporting tools,
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).

9) Explain the different data warehouse models from the architecture point of view. (4)
From the architecture point of view, there are three data warehouse models: the
enterprise warehouse, the data mart, and the virtual warehouse.
Enterprise warehouse: An enterprise warehouse collects all of the information about
subjects spanning the entire organization. It provides corporate-wide data integration,
usually from one or more operational systems or external information
providers, and is cross-functional in scope. It typically contains detailed data as
well as summarized data, and can range in size from a few gigabytes to hundreds
of gigabytes, terabytes, or beyond. An enterprise data warehouse may be implemented
on traditional mainframes, computer superservers, or parallel architecture
platforms. It requires extensive business modeling and may take years to design
and build.
Data mart: A data mart contains a subset of corporate-wide data that is of value to a
specific group of users. The scope is confined to specific selected subjects. For example,
a marketing data mart may confine its subjects to customer, item, and sales. The
data contained in data marts tend to be summarized.
Data marts are usually implemented on low-cost departmental servers that are
Unix/Linux or Windows based. The implementation cycle of a data mart is more
likely to be measured in weeks rather than months or years. However, it may
involve complex integration in the long run if its design and planning were not
enterprise-wide.
Depending on the source of data, data marts can be categorized as independent
or dependent. Independent data marts are sourced from data captured from one or
more operational systems or external information providers, or from data generated
locally within a particular department or geographic area. Dependent data marts are
sourced directly from enterprise data warehouses.
Virtual warehouse: A virtual warehouse is a set of views over operational databases.
For efficient query processing, only some of the possible summary views may be
materialized. A virtual warehouse is easy to build but requires excess capacity on
operational database servers
10) Explain the different back-end tools and utilities included in data warehouse. (4)
The back-end tools of a data warehouse are pieces of software responsible for the extraction of data
from several sources, their cleansing, customization, and insertion into a data warehouse. They are
known under the general term extraction, transformation and loading (ETL) tools.
In all the phases of an ETL process (extraction and exportation, transformation and cleaning, and
loading), individual issues arise and, along with the problems and constraints that concern the overall
ETL process, make its lifecycle a very complex task.
Phase I: Extraction and Transportation
During the ETL process, a first task that must be performed is the extraction of the relevant
information that has to be propagated further to the warehouse.
Phase II: Transformation and Cleaning
It is possible to determine typical tasks that take place during the transformation and cleaning phase
of an ETL process.
or
Data Warehouse Back-End Tools and Utilities
 Data extraction:
o get data from multiple, heterogeneous, and external sources
 Data cleaning:
o detect errors in the data and rectify them when possible
 Data transformation:
o convert data from legacy or host format to warehouse format
 Load:
o sort, summarize, consolidate, compute views, check integrity, and build indicies and
partitions
 Refresh
o propagate the updates from the data sources to the warehouse

11) What is metadata? Explain metadata repository. (6)


Metadata is data about the data or documentation about the information which is required by the users. In
data warehousing, metadata is one of the essential aspects.
Metadata are data about data. When used in a data warehouse, metadata are the data
that define warehouse objects. Metadata are created for the data names
and definitions of the given warehouse. Additional metadata are created and captured
for timestamping any extracted data, the source of the extracted data, and missing fields
that have been added by data cleaning or integration processes.
A metadata repository should contain the following:
*A description of the data warehouse structure, which includes the warehouse schema,
view, dimensions, hierarchies, and derived data definitions, as well as data mart
locations and contents.
*Operational metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or
purged), and monitoring information (warehouse usage statistics, error reports, and
audit trails).
*The algorithms used for summarization, which include measure and dimension
definition algorithms, data on granularity, partitions, subject areas, aggregation,
summarization, and predefined queries and reports.
*Mapping from the operational environment to the data warehouse, which includes
source databases and their contents, gateway descriptions, data partitions, data
extraction, cleaning, transformation rules and defaults, data refresh and purging
rules, and security (user authorization and access control).
*Data related to system performance, which include indices and profiles that improve
data access and retrieval performance, in addition to rules for the timing and
scheduling of refresh, update, and replication cycles.
*Business metadata, which include business terms and definitions, data ownership
information, and charging policies.
12) Explain different types of OLAP servers. (4/6)
Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a relational
back-end server and client front-end tools. They use a relational or extended-relational DBMS to store and
manage warehouse data, and OLAP middleware to support missing pieces. ROLAP servers include
optimization for each DBMS back end, implementation of aggregation navigation logic, and additional
tools and services. ROLAP technology tends to have greater scalability than MOLAP technology. The DSS
server of Microstrategy, for example, adopts the ROLAP approach.
Multidimensional OLAP (MOLAP) servers: These servers support multidimensional
data views through array-based multidimensional storage engines. They map multidimensional
views directly to data cube array structures. The advantage of using a data cube is that it allows fast indexing
to precomputed summarized data. Notice that with multidimensional data stores, the storage utilization may
be low if the data set is sparse. In such cases, sparse matrix compression techniques should be explored
(Chapter 5).Many MOLAP servers adopt a two-level storage representation to handle dense
and sparse data sets: Denser subcubes are identified and stored as array structures, whereas sparse subcubes
employ compression technology for efficient storage utilization.
Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP technology,
benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. For example, a
HOLAP server may allow large volumes of detailed data to be stored in a relational database, while
aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000 supports a hybrid OLAP
server.
13) Explain the different forms of data preprocessing. (4)
The major steps involved in data preprocessing, namely, data
cleaning, data integration, data reduction, and data transformation.
1. Data Cleaning:
 Handling Missing Values: Techniques such as mean imputation, median imputation,
mode imputation, or advanced methods like predictive modeling can be used to fill in
missing values.
 Removing Duplicates: Identifying and removing identical records to ensure data
integrity.
 Error Correction: Correcting inaccuracies such as typos, inconsistencies, or invalid
values.
 Outlier Detection and Treatment: Identifying and handling outliers that can skew
analysis results, through methods like trimming, winsorization, or replacing outliers.
2. Data Integration:
 Schema Integration: Resolving differences in data structure and format across multiple
sources.
 Entity Resolution: Identifying and resolving inconsistencies in how entities are
represented across datasets.
 Data Consolidation: Combining data from different sources while ensuring consistency
and integrity.
3. Data Reduction:
 Dimensionality Reduction: Techniques like Principal Component Analysis (PCA),
Singular Value Decomposition (SVD), or feature selection algorithms to reduce the
number of variables while preserving as much information as possible.
 Sampling: Selecting a representative subset of the data for analysis, such as random
sampling, stratified sampling, or systematic sampling.
 Aggregation: Summarizing data by grouping similar instances together, reducing the
overall size while preserving key information.
4. Data Transformation:
 Normalization: Scaling numerical data to a standard range, such as [0, 1] or [-1, 1], to
ensure that features contribute equally to the analysis.
 Encoding Categorical Variables: Converting categorical data into numerical form
suitable for machine learning algorithms, through techniques like one-hot encoding,
label encoding, or target encoding.
 Feature Engineering: Creating new features or transforming existing ones to enhance
the predictive power of the data, by methods like binning, polynomial features, or
interaction features.
 Discretization: Binning continuous data into discrete intervals, either equal-width
binning or equal-frequency binning, to simplify the data and improve model
performance.

14) Explain (any four) the different ways of handling missing values. (4/6)
different ways of handling missing values in a dataset:
1. Ignore the tuple (Listwise Deletion):
 This approach involves removing entire rows (tuples) from the dataset if they
contain any missing values.
 It ensures that the analysis is performed only on complete cases, but it may lead
to a loss of valuable data.
2. Fill in the missing value manually:
 In this method, missing values are identified, and the analyst or domain expert
manually fills in the missing values based on their knowledge or intuition.
 While this approach allows for personalized handling of missing values, it can
be time-consuming and subjective.
3. Use a global constant to fill in the missing value:
 Missing values are replaced with a predetermined global constant, such as 0 or -
1.
 This approach is simple and quick but may not reflect the true nature of the
data, potentially introducing bias.
4. Use a measure of central tendency (e.g., mean or median) to fill in the missing value:
 Missing values in numerical attributes are replaced with the mean or median of
the available values in the same attribute.
 This method preserves the overall distribution of the data but may be sensitive
to outliers.
5. Use the attribute mean or median for all samples belonging to the same class:
 If the dataset contains categorical variables, missing values in numerical
attributes can be replaced with the mean or median of the respective attribute
values within the same class or category.
 This approach accounts for potential differences in attribute distributions across
different classes.
6. Use the most probable value to fill in the missing value:
 Missing values are replaced with the most probable value based on the observed
data and possibly other attributes.
 This method can be implemented using techniques such as regression, k-nearest
neighbors, or decision trees to predict missing values.
 It can capture complex relationships in the data but requires additional
computational resources.

15) Define noise and explain different data smoothing techniques. (6)
Noise is a random error or variance in a measured variable. Noise can arise from various
sources, including measurement errors, data collection artifacts, or inherent variability in the
phenomenon being studied.
different data smoothing techniques.
1. Binning:
 Binning is a data smoothing technique that involves dividing the data into
intervals, or bins, and replacing the values within each bin with a representative
value, such as the mean, median, or mode of the values in that bin.
 This technique is often used to reduce the effects of noise and variability in
continuous data by grouping similar values together.
 Binning can help simplify the data, make it more interpretable, and identify
trends or patterns that may not be apparent in raw data.
 However, binning may also lead to information loss and loss of granularity,
especially if the number of bins is chosen arbitrarily or if important details are
obscured by bin boundaries.
2. Regression:
 Regression analysis is a statistical technique used to model the relationship
between one or more independent variables (predictors) and a dependent
variable (outcome) based on observed data.
 In the context of data smoothing, regression can be used to fit a curve or surface
to the data, allowing for the estimation of values between observed data points.
 Regression models can help identify trends, patterns, or underlying
relationships in the data, making it easier to interpret and make predictions.
 Common regression techniques include linear regression, polynomial
regression, logistic regression, and more advanced methods such as ridge
regression or lasso regression.
 Regression can effectively smooth noisy data, but it may also introduce bias if
the model assumptions are violated or if overfitting occurs.
3. Outlier Analysis:
 Outlier analysis is a data smoothing technique focused on identifying and
handling outliers, which are data points that deviate significantly from the rest
of the data.
 Outliers can distort statistical analyses and modeling results, leading to biased
conclusions or inaccurate predictions.
 Outlier analysis involves techniques such as visual inspection, statistical tests,
or machine learning algorithms to detect outliers based on their distance,
density, or deviation from the expected distribution.
 Once outliers are identified, they can be removed, transformed, or treated
separately to minimize their impact on the analysis.
 Outlier analysis is essential for improving the robustness and accuracy of data
analysis and modeling by reducing the influence of extreme values that may not
represent the underlying phenomenon accurately.

16) Explain the different steps involved in data transformation. (4)


different steps involved in data transformation.
1. Data Discovery
You begin by identifying and understanding the source format of your data. A data profiling
tool is helpful at this stage. Or, if you’re using an ELT workflow, then the data will be
extracted from its original sources and loaded into the target data warehouse.
2. Data Mapping
Once in the data warehouse, it’s time for data exploration, or data mapping. During this
phase, you get to see how the data looks and identify if any information is missing. Data
mapping sets the action plan for the data and can end up being time-consuming without the
help of automation software.
3. Data Transformation
During the data transformation stage, the main work takes place. At this step, you’re aware
of how the data is structured as well as how it needs to be structured.
This step consists of two main efforts, namely:
 Generating Code: At this step, the code generation is performed so that you can
transform data into the required format.
 Executing Code: Once the code is ready, the work really begins. This is the process of
editing the format from the source system into the right format for the target source.
Transformation can be either light or heavy. Light transformation includes renaming tables
and fields, casting fields correctly, and creating uniformity. Heavy transformation involves
adding business logic, data aggregation, and the like.
4. Review: Data Testing
Once data is modeled and ready to go, you can test it out in action by ensuring column
values fall into the expected range, checking model relations line up, etc.
5. Data Documentation
When the testing has proven the data to be in good standing, you can expose results to end
users. Making data transformations usable and impactful requires documentation.
Documentation covers and outlines the purpose of the data model and transformation in the
first place, as well as defines key metrics and business logic that has been applied.

17) What is data reduction? Explain the strategies for data reduction. (6)
Data reduction techniques can be applied to obtain a reduced representation of the
data set that ismuch smaller in volume, yet closely maintains the integrity of the original
data. That is, mining on the reduced data set should be more efficient yet produce the
same (or almost the same) analytical results.
Data reduction strategies include dimensionality reduction, numerosity reduction, and
data compression.
Dimensionality reduction is the process of reducing the number of randomvariables
or attributes under consideration. Dimensionality reduction methods include wavelet transforms (Section
3.4.2) and principal components analysis (Section 3.4.3), which
transform or project the original data onto a smaller space. Attribute subset selection is a
method of dimensionality reduction in which irrelevant, weakly relevant, or redundant
attributes or dimensions are detected and removed (Section 3.4.4).
Numerosity reduction techniques replace the original data volume by alternative,
smaller forms of data representation. These techniques may be parametric or nonparametric.
For parametric methods, a model is used to estimate the data, so that
typically only the data parameters need to be stored, instead of the actual data. (Outliers
may also be stored.) Regression and log-linear models (Section 3.4.5) are examples.
Nonparametric methods for storing reduced representations of the data include histograms
(Section 3.4.6), clustering (Section 3.4.7), sampling (Section 3.4.8), and data
cube aggregation (Section 3.4.9).
In data compression, transformations are applied so as to obtain a reduced or “compressed”
representation of the original data. If the original data can be reconstructed
from the compressed data without any information loss, the data reduction is called
lossless. If, instead, we can reconstruct only an approximation of the original data, then
the data reduction is called lossy. There are several lossless algorithms for string compression;
however, they typically allow only limited data manipulation. Dimensionality
reduction and numerosity reduction techniques can also be considered forms of data
compression.
18) Write a note on Wavelet Transform and Principal Component Analysis. (6)
Wavelet Transform in Data Science:
Wavelet Transform is a mathematical technique used for analyzing signals and images at
different scales or resolutions. It decomposes a signal into a set of wavelet functions, known as
wavelets, which are scaled and translated versions of a base wavelet function.
Wavelet Transform is a powerful technique in data science with various applications:
1. Signal Processing: In data science, signals often contain valuable information at different
scales. Wavelet Transform allows data scientists to analyze signals effectively by
decomposing them into different frequency components, enabling the identification of
patterns, trends, and anomalies.
2. Time Series Analysis: Time series data frequently exhibit complex patterns and trends that
can be challenging to analyze. Wavelet Transform provides a way to decompose time
series data into different frequency components, helping to identify periodicities, trends,
and irregularities.
3. Image Processing: Images in data science often require processing for analysis,
classification, or feature extraction. Wavelet Transform is used for tasks such as image
compression, denoising, edge detection, and feature extraction, offering advantages over
traditional techniques by capturing both spatial and frequency information simultaneously.
4. Feature Extraction: In machine learning and pattern recognition tasks, feature extraction
plays a crucial role in representing data effectively for modeling. Wavelet Transform can
be used to extract relevant features from signals or images, reducing dimensionality while
preserving important information for classification or regression tasks.
Principal Component Analysis (PCA) in Data Science:
Principal Component Analysis (PCA) is a statistical technique used for dimensionality
reduction and data compression. It transforms the original features of a dataset into a new set of
orthogonal variables, called principal components, which capture the maximum variance in the
data.
PCA is a fundamental technique in data science with widespread applications:
1. Dimensionality Reduction: In data science, datasets often contain high-dimensional
features, which can lead to issues such as overfitting, increased computational complexity,
and difficulty in interpretation. PCA addresses these challenges by transforming high-
dimensional data into a lower-dimensional space while retaining most of the variance,
thereby reducing redundancy and improving model performance.
2. Data Visualization: Visualizing high-dimensional data is challenging, but PCA can help by
projecting data onto a lower-dimensional space while preserving as much variance as
possible. This allows data scientists to visualize complex datasets in two or three
dimensions, facilitating exploration, interpretation, and communication of insights.
3. Data Compression: PCA can be used for data compression by representing data using a
smaller number of principal components, which are linear combinations of the original
features. This reduces storage requirements and computational complexity while
minimizing information loss, making it useful for handling large datasets efficiently.
4. Noise Reduction: PCA can also be employed for noise reduction by removing principal
components with low variance, which are assumed to correspond to noise or irrelevant
information. This helps to improve the signal-to-noise ratio and enhance the quality of the
data for subsequent analysis or modeling tasks.

19) Explain any two numerosity reduction techniques. (4)


1. Sampling:
 Sampling involves selecting a subset of data instances from the original dataset to
form a representative sample. This technique is particularly useful for large datasets
where analyzing the entire dataset may be computationally expensive or impractical.
 There are various sampling methods, including:
 Simple Random Sampling: Each data instance has an equal chance of being
selected, and the selection is made independently of other instances.
 Stratified Sampling: The dataset is divided into strata based on certain
attributes, and random samples are selected from each stratum to ensure that
the sample represents the population's diversity.
 Systematic Sampling: Data instances are selected at regular intervals from an
ordered list of the dataset.
 Cluster Sampling: The dataset is divided into clusters, and a random sample
of clusters is selected. Then, data instances are sampled from the selected
clusters.
 Sampling allows for faster analysis, reduces computational resources, and can
provide insights into the dataset's overall characteristics without analyzing every
data instance. However, the representativeness of the sample is crucial, and biased
sampling can lead to inaccurate conclusions.
2. Data Clustering:
 Data clustering groups similar data instances into clusters based on certain similarity
or distance measures. This technique reduces numerosity by replacing similar
instances with representative cluster centroids or prototypes.
 Clustering algorithms, such as K-means, hierarchical clustering, and DBSCAN,
partition the dataset into clusters based on the inherent structure of the data.
 Once clusters are formed, representative prototypes, such as cluster centroids or
medoids, can be used to summarize each cluster's characteristics.
 Data clustering is useful for exploratory analysis, pattern recognition, and
summarization of large datasets. It helps identify meaningful groups within the data,
reduces redundancy, and provides insights into the dataset's underlying structure.
 However, the choice of clustering algorithm and parameters, as well as the
interpretation of the clusters, require careful consideration to ensure meaningful
results.

20) Explain any three methods for data discretization. (6)


1. Clustering:
 Clustering techniques like K-means or hierarchical clustering group similar data points together
based on a chosen similarity measure.
 While clustering itself doesn't discretize data directly, it can indirectly lead to data discretization
by identifying clusters of similar data points.
 Once clusters are identified, each cluster can be treated as a discrete group, effectively
discretizing the data based on similarities between data points within each cluster.
 This approach can be useful for segmenting data into distinct groups for further analysis or
modeling.
2. Decision Tree Analysis:
 Decision trees are predictive models that recursively split the data into subsets based on the
values of input features.
 While decision trees are primarily used for prediction or classification tasks, they inherently
perform a form of data discretization.
 At each node of the decision tree, the algorithm chooses a feature and a threshold to split the
data into two or more subsets.
 This process effectively discretizes the continuous feature space into intervals defined by the
decision tree splits.
 Decision tree analysis can thus indirectly contribute to data discretization by partitioning the data
into discrete regions based on feature values.
3. Correlation Analysis:
 Correlation analysis measures the strength and direction of the relationship between two
variables.
 While correlation analysis doesn't directly discretize data, it can provide insights into which
features are most strongly associated with each other.
 Strongly correlated features may be candidates for aggregation or feature engineering, which can
indirectly lead to data discretization.
 For example, if two highly correlated features are identified, they may be combined into a single
feature or used to define discrete categories based on certain thresholds.
 Correlation analysis can inform decisions about how to transform or discretize features to
improve model performance or interpretability.

21) Write a note on different methods used for the generation of concept hierarchies for
categorical data. (4/6)
1. Schema Hierarchy: Schema Hierarchy is a type of concept hierarchy that is used to organize the
schema of a database in a logical and meaningful way, grouping similar objects together. A
schema hierarchy can be used to organize different types of data, such as tables, attributes, and
relationships, in a logical and meaningful way. This can be useful in data warehousing, where data
from multiple sources needs to be integrated into a single database.
2. Set-Grouping Hierarchy: Set-Grouping Hierarchy is a type of concept hierarchy that is based on
set theory, where each set in the hierarchy is defined in terms of its membership in other sets. Set-
grouping hierarchy can be used for data cleaning, data pre-processing and data integration. This
type of hierarchy can be used to identify and remove outliers, noise, or inconsistencies from the
data and to integrate data from multiple sources.
3. Operation-Derived Hierarchy: An Operation-Derived Hierarchy is a type of concept hierarchy
that is used to organize data by applying a series of operations or transformations to the data. The
operations are applied in a top-down fashion, with each level of the hierarchy representing a more
general or abstract view of the data than the level below it. This type of hierarchy is typically used
in data mining tasks such as clustering and dimensionality reduction. The operations applied can
be mathematical or statistical operations such as aggregation, normalization
4. Rule-based Hierarchy: Rule-based Hierarchy is a type of concept hierarchy that is used to
organize data by applying a set of rules or conditions to the data. This type of hierarchy is useful in
data mining tasks such as classification, decision-making, and data exploration. It allows to the
assignment of a class label or decision to each data point based on its characteristics and identifies
patterns and relationships between different attributes of the data.

22) Explain Data cube aggregation? (4)


Imagine that you have collected the data for your analysis. These data consist of the
AllElectronics sales per quarter, for the years 2008 to 2010. You are, however, interested
in the annual sales (total per year), rather than the total per quarter. Thus, the data can
be aggregated so that the resulting data summarize the total sales per year instead of per
quarter. This aggregation is illustrated in Figure 3.10. The resulting data set is smaller in
volume, without loss of information necessary for the analysis task.Data cubes store

multidimensional aggregated information. For example, Figure 3.11 shows a data cube
for multidimensional analysis of sales data with respect to annual sales per item type
for each AllElectronics branch. Each cell holds an aggregate data value, corresponding
to the data point in multidimensional space. (For readability, only some cell values are
shown.) Concept hierarchies may exist for each attribute, allowing the analysis of data
at multiple abstraction levels. For example, a hierarchy for branch could allow branches
to be grouped into regions, based on their address. Data cubes provide fast access to
precomputed, summarized data, thereby benefiting online analytical processing as well
as data mining.
The cube created at the lowest abstraction level is referred to as the base cuboid. The
base cuboid should correspond to an individual entity of interest such as sales or customer.
In other words, the lowest level should be usable, or useful for the analysis. A cube
at the highest level of abstraction is the apex cuboid. For the sales data in Figure 3.11,
the apex cuboid would give one total—the total sales for all three years, for all item
types, and for all branches. Data cubes created for varying levels of abstraction are often
referred to as cuboids, so that a data cube may instead refer to a lattice of cuboids. Each
higher abstraction level further reduces the resulting data size. When replying to data
mining requests, the smallest available cuboid relevant to the given task should be used.
23) Explain heuristic methods of attribute subset selection with an example and explain its
techniques. (6)
Attribute subset selection reduces the data set size by removing irrelevant or
redundant attributes (or dimensions). The goal of attribute subset selection is to find
a minimum set of attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained using all attributes.
Mining on a reduced set of attributes has an additional benefit: It reduces the number
of attributes appearing in the discovered patterns, helping to make the patterns easier to
understand.
Basic heuristic methods of attribute subset selection include the techniques that
follow, some of which are illustrated in Figure 3.6.

1. Stepwise forward selection: The procedure starts with an empty set of attributes as
the reduced set. The best of the original attributes is determined and added to the
reduced set. At each subsequent iteration or step, the best of the remaining original
attributes is added to the set.
2. Stepwise backward elimination: The procedure starts with the full set of attributes.
At each step, it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The stepwise forward
selection and backward elimination methods can be combined so that, at each
step, the procedure selects the best attribute and removes the worst from among the
remaining attributes.
4. Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART) were
originally intended for classification. Decision tree induction constructs a flowchartlike
structure where each internal (nonleaf) node denotes a test on an attribute, each
branch corresponds to an outcome of the test, and each external (leaf) node denotes a
class prediction. At each node, the algorithm chooses the “best” attribute to partition
the data into individual classes.
When decision tree induction is used for attribute subset selection, a tree is constructed
fromthe given data. All attributes that do not appear in the tree are assumed
to be irrelevant. The set of attributes appearing in the tree form the reduced subset
of attributes.
24) What is sampling? Explain the ways used to sample for data reduction?
Sampling can be used as a data reduction technique because it allows a large data set to
be represented by a much smaller random data sample (or subset). Suppose that a large
data set, D, contains N tuples. Let’s look at the most common ways that we could sample
D for data reduction, as illustrated in Figure 3.9.
Simple random sample without replacement (SRSWOR) of size s: This is created by drawing s of the N tuples from
D (s < N), where the probability of drawing any tuple in D is 1=N, that is, all tuples are equally likely to be sampled.
Simple random sample with replacement (SRSWR) of size s: This is similar to SRSWOR, except that each time a
tuple is drawn from D, it is recorded and thenreplaced. That is, after a tuple is drawn, it is placed back in D so that it
may be drawn again.
Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters,” then an SRS of s clusters can be
obtained, where s < M. For example, tuples in a database are usually retrieved a page at a time, so that each page can be
considered a cluster. A reduced data representation can be obtained by applying, say, SRSWOR to the pages, resulting
in a cluster sample of the tuples. Other clustering criteria conveying rich semantics can also be explored. For example,
in a spatial database, we may choose to define clusters geographically based on how closely different areas are located.
Stratified sample: If D is divided intomutually disjoint parts called strata, a stratified sample of D is generated by
obtaining an SRS at each stratum. This helps ensure a representative sample, especially when the data are skewed. For
example, a stratified sample may be obtained fromcustomer data, where a stratum is created for each customer age
group. In this way, the age group having the smallest number of customers
will be sure to be represented.

You might also like