Data Engineering - Session 02
Data Engineering - Session 02
• Session 01 – Theory
• Introduction to Enterprise Data, Data Engineering,
Modern Data Applications & Patterns, Data
Frameworks, Components & Best Practices
• Session 02 – Theory & Lab Demos
• Introduction to Data stores: SQL, NoSQL, File Systems,
Data Lakes, Data Warehouses, Data Mesh Cloud Data
Products, Lab Demos of select data stores
• Session 03 – Theory & Lab Demos
• Data Architecture Layers, Data Pipelines,
Transformation, Orchestration, Data Aggregation vs
Federation, Lab Demos of sleect Data Pipeline
Products
• Session 04 – Theory & In-Class Design
• Data Governance: Data Catalogs, Data Quality,
Lineage, Provenance, Data Security, Regulatory
Compliance, Real-World Application Data Design
• Tutorials
Introduction to Data Modeling
Modeling • In a retail scenario, instead of just defining Customer and Product entities,
a semantic data model would include relationships such as:
• "A Customer purchases a Product."
• "A Product belongs to a Category."
• "A Customer has a Loyalty Status."
• This allows for more nuanced queries, such as finding all customers who have
purchased products in a particular category with a certain loyalty status.
• An Ontology is a formal, explicit specification of a shared conceptualization of a domain. It
represents knowledge as a set of concepts within a domain and the relationships between
those concepts.It uses a more structured and formal language (e.g., OWL - Web Ontology
Language) to define classes (concepts), properties (relationships), and instances (data
values).Ontologies are used to represent complex domains and enable systems to reason
about the data, making them foundational for the Semantic Web and Knowledge Graphs.
• Purpose
• Knowledge Representation Model: Ontology falls under knowledge representation
Ontology
and is considered a form of semantic modeling that captures domain knowledge,
rules, and relationships.
• Formal Model: Unlike semantic data models, ontologies are formalized using logic-
based languages like OWL, RDF (Resource Description Framework), and SPARQL,
Modeling
making them machine-readable and interpretable.
• Example:
• In an e-commerce ontology, you might define:
• Classes: Customer, Product, Order, Payment Method
• Properties: purchases, belongs_to_category, uses_payment_method
• Instances: John Doe (instance of Customer), "Smartphone" (instance of Product)
• This allows you to infer new knowledge, such as finding all customers who have purchased
electronics using a credit card, thanks to the relationships defined in the ontology.
A quick example for Ontology
Modeling
Artifact 03
Data Storage
A Deep Dive
Recap of Concepts
CAP Theorem
The CAP Theorem, also known as Brewer's Theorem, is a fundamental principle in distributed database
systems. It states that a distributed database can achieve only two out of three properties at any given time:
Consistency (C), Availability (A), and Partition Tolerance (P).
Key Concepts
Example 1: E-commerce
Product Catalog
In an e-commerce application, the product
catalog can vary greatly in structure. Some
products might have different attributes
like sizes, colors, or specifications. With
MongoDB, you can store each product as a
document in a collection
Querying Product Catalog in MongoDB
1. Find Products in a Specific Category
Objective: Find all products in the "Electronics" category.
• In a social media
application, user profiles
can have diverse
attributes. MongoDB’s
flexible schema allows for
varying data types and
structures:
NoSQL: Columnar Store
Columnar stores, or column-family databases, are databases Key Features
that store data by columns rather than by rows, making them
highly efficient for analytical and read-heavy workloads. Unlike Column-based Storage: Data is
High Performance for
traditional row-based storage, where data is stored and read stored column by column,
Aggregation: Since only the
row by row, columnar stores allow you to read only the relevant columns are read
enabling high compression
necessary columns, leading to faster query performance, during a query, operations like
rates and efficient data retrieval
SUM, AVG, COUNT, and other
especially for aggregation and analytics operations. for analytical queries.
aggregations are much faster.
IMP: Columnar stores like ClickHouse, Amazon Redshift, and Google BigQuery are essential for
modern data analytics, offering high performance, scalability, and efficient storage.
ClickHouse, in particular, stands out for its real-time analytics capabilities, making it an excellent
choice for log analysis, time-series data, and BI reporting. Its columnar architecture ensures that only
the necessary data is read, enabling rapid query performance and efficient data processing.
Columnar Store Usecases
1 2 3 4 5
Data Warehousing and Real-time Analytics: Time-Series Data Analysis: Big Data Processing: ETL (Extract, Transform, Load)
Business Intelligence (BI): •Suitable for processing and analyzing •Efficiently handles time-series data •Ideal for big data applications where Operations:
•Columnar databases excel in streaming data in real-time, enabling where large amounts of data are you need to process large datasets •Columnar databases are efficient for
aggregating and analyzing large insights into live data. collected over time, such as financial across distributed systems quickly. ETL workflows, where data needs to
volumes of historical data, making •Example: Monitoring website traffic market data or IoT sensor readings. •Example: Processing large volumes be extracted, transformed, and
them ideal for data warehousing and or user activity in real time. •Example: Tracking stock prices, of social media data to derive loaded into a data warehouse for
BI applications. energy consumption, or weather insights into user behavior. further analysis.
•Example: Analyzing sales data to patterns over time. •Example: Transforming and loading
generate monthly or quarterly customer data from multiple sources
reports. into a central data warehouse.
ClickHouse: A Deep Dive
ClickHouse is an open-source, high-performance columnar database management system designed for real-
time analytics. Developed by Yandex, it’s optimized for handling large volumes of data quickly, making it one of
the fastest columnar stores available.
Features
High-Performance Queries: Columnar Storage: Stores Distributed and Scalable: Data Compression: Uses
ClickHouse is optimized for data by columns, enabling Supports distributed advanced compression
fast analytical queries, efficient compression and deployment, allowing you techniques, reducing
capable of processing retrieval, particularly for to scale horizontally across storage costs and improving
billions of rows per second. read-heavy analytics multiple servers.SQL I/O performance.
workloads. Support: Offers a SQL-like
querying language, making
it accessible for users
familiar with SQL.
Example: Real-Time Web Traffic Analytics
A company wants to analyze real-time web traffic data to monitor website activity, such as page
views, unique visitors, and session duration.
Table Structure (Simplified Example) Insert some data
• Simplicity: The data model is straightforward, consisting of keys and their corresponding
values, which makes it easy to use and implement.
• High Performance: Optimized for quick read and write operations due to their simple data
structure.
NoSQL: Key- • Scalability: Can scale horizontally by distributing data across multiple nodes, making them
highly scalable and capable of handling large amounts of data.
Value Store • Flexibility: Values can store various data types, allowing for different data structures.
Session Management • Ideal for storing user session data in web applications, allowing quick access and updates.
• Example: Storing user login sessions in Redis for an e-commerce site.
• Suitable for applications requiring high-speed data access and updates, like tracking user activities in real
Real-time Analytics time.
• Example: Using Amazon DynamoDB to track user actions in a gaming application.
Leaderboards and Gaming • Excellent for maintaining real-time leaderboards in online games.
• Example: Using Redis to maintain a live ranking of players based on their scores.
IoT Data Storage • Can efficiently handle data generated by IoT devices, such as sensor readings.
• Example: Using DynamoDB to store sensor data from connected devices in a smart home system.
Q&A