Data Warehouse Interview Questions:
What is Data Warehouse?
A data warehouse is a centralized repository that stores large volumes of structured and
semi-structured data from multiple sources. It is like a relational database designed
specifically for query and analysis, making it a key component in business intelligence and
analytics.
Data warehouses facilitate the use of BI tools, enabling users to create reports, dashboards,
and visualizations for better decision-making.
What is the basic difference between a Data Warehouse and an Operational Database?
Data Warehouse:
1. Contains historical information which helps analysing business metrics.
2. DW is mainly used to read data.
3. End users are business analyst/ data analyst.
4. Used for analytical processing and reporting (OLAP).
Operational Database:
1. Contains current information that is required to run the business.
2. Database is mainly used to write the data.
3. End users are operational team members.
4. Used for transactional processing (OLTP).
What is Data Warehousing?
Data Warehousing is the act of organising and storing data in a way so as to make it retrieval
efficient and insightful.
It is also called the process of transforming data into information.
What is OLAP?
OLAP is a flexible way to make complicated analysis of multidimensional data.
Data present in a data Warehouse is accessed by running OLAP queries. DBs are queried by
running OLTP operations.
OLAP activities are performed by converting the multi-dimensional data in a data warehouse
into a OLAP cube.
What is OLTP? How is OLAP different from OLTP?
OLTP (Online Transaction Processing) is a class of systems designed to manage and execute
day-to-day transactional operations efficiently. It is commonly used in database
environments that handle high volumes of short, real-time transactions, such as sales,
orders, payments, or other business activities.
Examples: 1. Banking systems for processing payments or withdrawals.
2. Airline reservation systems for booking tickets.
Any system that is absolutely critical for running the business can be categorised as OLTP.
Whereas any system that is used for analysing how the business is running can be
categorised as OLAP.
Characteristics of OLTP:
- Transactional Systems: OLTP systems are designed to handle frequent, small
transactions like inserting, updating, or deleting records.
- High Volume: OLTP systems process a large number of transactions per second,
ensuring fast, real-time updates.
- Data Integrity: OLTP systems prioritize maintaining the accuracy and integrity of data
across multiple users and transactions. This is often achieved through ACID
properties (Atomicity, Consistency, Isolation, Durability).
- Normalized Database: OLTP systems use normalized databases to minimize
redundancy, ensuring data consistency and quick updates.
The key differences between OLTP and OLAP:
- Purpose: OLTP is designed for day-to-day transaction processing, while OLAP is used
for complex queries and data analysis.
- Data Structure: OLTP systems use highly normalized databases (often in 3rd normal
form) to ensure data consistency and minimize redundancy. OLAP systems use
denormalized structures (e.g., star or snowflake schemas) to optimize query
performance.
- Data Volume: OLTP deals with smaller volumes of data from frequent, short
transactions. OLAP handles large volumes of aggregated, historical data for analytical
purposes.
- Query Type: OLTP queries are simple and involve operations like insert, update, and
delete. OLAP queries are complex, involving aggregations, multidimensional analysis,
and often long-running queries.
- Response Time: OLTP systems are optimized for fast response times (milliseconds) to
handle a high volume of transactions. OLAP queries may take longer (seconds to
minutes) as they involve complex computations.
- Users: OLTP supports a large number of concurrent users, typically performing
transactions. OLAP is designed for fewer users, usually analysts or decision-makers,
who perform in-depth data analysis.
- Examples: OLTP systems are used in banking, e-commerce, and retail for managing
day-to-day transactions. OLAP systems are used in business intelligence, financial
reporting, and data warehousing for decision support and trend analysis
What is a Dimension Table?
- A Dimension Table is a table in a data warehouse that contains descriptive attributes
(or fields) that describe the objects in a fact table. It helps provide context for the
measures or facts in the fact table by offering a more detailed description of the
entities involved in the data. Dimension tables are used in Online Analytical
Processing (OLAP)
What is a Fact Table?
- A Fact Table is a table containing the measure of the dimensions in a dimension
table.
- Fact is measured by summing, averaging, or manipulating the data in a dimension
table.
- A Fact table contains 2 kinds of data – a dimension key (foreign key) and a measure.
What is the level of Granularity of a fact table?
- The level of granularity of a fact table refers to the level of detail or specificity at
which the data in the fact table is stored. It defines what each record or row in the
fact table represents. Granularity is a critical design decision in a data warehouse
because it determines the scope of the data that can be analysed and impacts both
the size of the fact table and the detail of the queries that can be run against it.
- The depth of data level is known as granularity.
- A Fact table is usually designed at a low level of Granularity.
What is the difference between Additive, Semi–additive and Non–additive facts?
- Additive Facts:
- Definition: Additive facts are facts that can be summed (or aggregated) across all
dimensions in the fact table. This is the most common type of fact in data
warehouses, as it can be used in a wide variety of analyses.
- Aggregation: These facts can be aggregated (using SUM, AVG, MIN, MAX, etc.) across
any dimension (e.g., time, location, product, customer).
- Semi-Additive Facts:
- Definition: Semi-additive facts can only be aggregated across some dimensions but
not all. A common limitation is with the time dimension, where aggregation (like
summing) doesn't make sense.
- Aggregation: These facts can be summed or aggregated across some dimensions but
not others (especially time). For example, you can sum semi-additive facts across
product or store, but not over time.
- Non-Additive Facts:
- Definition: Non-additive facts are facts that cannot be summed or aggregated
meaningfully across any dimension. They typically involve metrics like ratios or
percentages, where summing across dimensions would lead to incorrect results.
- Aggregation: These facts do not allow for any meaningful aggregation. You may need
special calculations (like averages or weighted averages) to analyse them across
dimensions.
What are Conformed dimensions and Conformed facts?
- Conformed Dimensions:
A conformed dimension is a dimension that is shared across multiple fact tables or
data marts within a data warehouse. It is used consistently across different areas of
the business, allowing for unified reporting and analysis across these different fact
tables. The data in a conformed dimension is consistent, meaning the same
definition, structure, and values are used across the warehouse, ensuring that
different parts of the organization are using the same information when analyzing
data.
- Conformed Facts:
A conformed fact refers to facts (measures) that are used consistently across
multiple fact tables or data marts, with the same meaning and calculation method.
This ensures that metrics like sales revenue or profit are calculated the same way
across different reports or analyses, promoting accuracy and comparability in
reporting.
What are Aggregate tables?
- Aggregate tables are specialized tables in a data warehouse that store pre-
calculated, summarized data. They are created to improve query performance by
reducing the amount of data that needs to be processed for certain types of queries,
especially when large volumes of detailed transactional data are involved. Instead of
querying the detailed fact tables, users can query the aggregate tables for faster
response times, especially when working with reports that don't require granular-
level detail.
- This table reduces the load in the database server and increases the performance of
the query.
What is Summary information?
- Summary Information is the area in the Data Warehouse where pre-defined
aggregations are kept.
- Can be stored in the form of tables or can be kept in the reporting layer such as
Tableau, Business Objects.
What is ETL?
- ETL stands for Extract - Transform - Load.
- It is the process of using the software to extract the desired data from various
sources, then transform that data by using rules and lookup tables to meet your
requirement and then loading it into a target data warehouse.
What are tools available for ETL?
- Informatica PowerCenter
- Talend Studio
- DataStage
- Oracle Warehouse Builder
- Ab Initio
- Data Junction
- SQL Server Integration Services (SSIS)
- SAP Data Services
- Data Migrator (IBI)
- IBM Infosphere Information Server
- Elixir Repertoire for Data ETL
- SAS Data Management
What is Metadata?
- Metadata is data about data.
- Metadata in a DWH defines the source data i.e Flat File, Relational Database and
other objects.
- Metadata is used to define which table is source and target, and which concept is
used to build business logic called transformation to the actual output.
What is Data Mining?
How is it different from data warehousing?
- Data mining is the process of analysing data in different dimensions and summarising
it into useful info. Data is searched, retrieved and analysed from a data warehouse
(or other data storage mechanism) to answer Business Questions.
- Data Warehousing is about storing analytical data in a structure suitable for data
mining. This analytical data is extracted from operational systems usually on daily
basis.
List the types of OLAP servers?
- MOLAP (Multidimensional OLAP): MOLAP stores data in a multidimensional cube
format, which allows for fast data retrieval. It pre-aggregates and pre-calculates data
in the form of cubes, enabling efficient querying and analysis.
- ROLAP (Relational OLAP): ROLAP stores data in relational databases (RDBMS) and
dynamically generates SQL queries to retrieve data. It doesn't use pre-aggregated
data, so queries take longer than in MOLAP but provide greater flexibility.
- HOLAP (Hybrid OLAP): HOLAP combines the features of both MOLAP and ROLAP. It
uses pre-calculated data cubes for quick access (like MOLAP) and dynamically
queries relational databases for detailed data (like ROLAP), providing a balance
between performance and scalability.
Which one is faster, Multidimensional OLAP or Relational OLAP?
MOLAP is faster than ROLAP. Since MOLAP is directly stored in memory all the data is pre-
processed.
What are the operations that can be performed by an OLAP cube? Explain each.0
- Drill-down: The drill-down operation allows users to navigate from summarized
(high-level) data to more detailed (lower-level) data. It increases the level of
granularity in the cube by moving deeper into the hierarchy.
- Roll-up: The drill-up (or roll-up) operation is the opposite of drill-down. It
summarizes detailed data into higher levels of aggregation by reducing granularity.
- Slice: The slice operation selects a specific subset of data from the OLAP cube by
fixing a value for one of the dimensions. It effectively reduces the dimensionality of
the data.
- Dice: The dice operation selects a sub-cube by choosing specific ranges of values
from multiple dimensions. It's similar to the slice operation but applies to multiple
dimensions simultaneously.
- Pivot (Rotate): The pivot operation changes the dimensional orientation of the data
to provide a different view of it. It involves rotating the cube to see data from
different perspectives by swapping rows and columns.
What is normalization? What is the benefit of Normalisation?
- Normalization is a database design process that organizes data in a way that reduces
redundancy and dependency by dividing large tables into smaller, related tables. This
process helps in establishing relationships between the tables through the use of
foreign keys. The main objective of normalization is to eliminate data anomalies and
ensure data integrity.
- Benefits of Normalization:
1. Reduced Data Redundancy: Normalization minimizes the amount of
duplicated data within the database, leading to efficient storage.
2. Improved Data Integrity: By enforcing rules for data relationships and
dependencies, normalization helps maintain data accuracy and consistency.
3. Efficient Data Retrieval: Normalized databases can result in faster queries, as
the data is structured logically and can be easily accessed through
relationships.
4. Simplified Database Maintenance: With reduced redundancy and clear
relationships, database maintenance tasks (like updates and deletions)
become more manageable.
5. Flexibility for Changes: Normalized structures allow for easier modification
of the database schema. Adding new data or adjusting relationships can be
done without significant changes to the overall design.
6. Enhanced Security: By separating sensitive data into distinct tables and
controlling access to those tables, normalization can enhance data security.