0% found this document useful (0 votes)
9 views

Data Engineering - Session 02

Uploaded by

Divam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Data Engineering - Session 02

Uploaded by

Divam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Course Curriculum

• Session 01 – Theory
• Introduction to Enterprise Data, Data Engineering,
Modern Data Applications & Patterns, Data
Frameworks, Components & Best Practices
• Session 02 – Theory & Lab Demos
• Introduction to Data stores: SQL, NoSQL, File Systems,
Data Lakes, Data Warehouses, Data Mesh Cloud Data
Products, Lab Demos of select data stores
• Session 03 – Theory & Lab Demos
• Data Architecture Layers, Data Pipelines,
Transformation, Orchestration, Data Aggregation vs
Federation, Lab Demos of sleect Data Pipeline
Products
• Session 04 – Theory & In-Class Design
• Data Governance: Data Catalogs, Data Quality,
Lineage, Provenance, Data Security, Regulatory
Compliance, Real-World Application Data Design
• Tutorials
Introduction to Data Modeling

Data Modeling is the process of creating a visual


representation or blueprint of how data will be stored,
organized, and managed in a database or data system.
It involves defining data elements, structures, and
relationships, allowing organizations to understand and
organize their data effectively for various applications,
such as data storage, retrieval, and analytics.
Traditional Data Modeling Methods
Represents the actual implementation of the Ex: In a relational database, the "Customer"
database, including table structures, columns, table might be implemented with columns such
data types, indexes, primary keys, foreign keys, as customer_id (integer, primary key), name
and storage details. Guides database (varchar), email (varchar), and phone_number
administrators and engineers in building the (varchar), with details on indexing and
actual database structure. constraints.
A more detailed version of the conceptual
model that defines how data should be Ex: For a "Customer" entity, the logical model
organized, including attributes, data types, and might specify attributes such as Customer_ID,
relationships without considering the actual Name, Email, and Phone_Number, defining
database implementation. Provides a detailed relationships and cardinality (e.g., one customer
understanding of the data requirements and is can place multiple orders).
often used as a bridge between conceptual and
physical data models. Ex: An e-commerce data model might include
entities like Customer, Product, Order, and
Provides a high-level view of the data and focuses
Payment, with relationships such as "Customer
on defining the entities, their attributes, and
places an Order" and "Order contains Product."
relationships. It answers the "what" without
worrying about how data is stored or processed.
Used during the initial design phase to outline
the scope and structure of the data
A quick example for
Conceptual-Logical-
Physical Modeling
• Artifact 02
Advanced Data Modeling Techniques

Semantic Data Modeling: meaning representation

Ontology based Modeling: knowledge representation


• The Semantic Data Model is a high-level data model that defines the meaning
(semantics) of data and the relationships between different data elements. It
goes beyond traditional data models by incorporating the context and meaning
of the data, ensuring that data relationships and dependencies are explicitly
defined and understood.It’s often represented using entity-relationship
diagrams (ERDs), enriched with more semantic details about the data.
• Purpose

Semantic • It operates at a conceptual level, focusing on the meaning of data rather


than its physical storage.
• Contextual Relationships: Data is modeled with an emphasis on

Data relationships, hierarchies, and constraints, offering a more expressive


way of understanding data compared to purely structural models.
• Example:

Modeling • In a retail scenario, instead of just defining Customer and Product entities,
a semantic data model would include relationships such as:
• "A Customer purchases a Product."
• "A Product belongs to a Category."
• "A Customer has a Loyalty Status."
• This allows for more nuanced queries, such as finding all customers who have
purchased products in a particular category with a certain loyalty status.
• An Ontology is a formal, explicit specification of a shared conceptualization of a domain. It
represents knowledge as a set of concepts within a domain and the relationships between
those concepts.It uses a more structured and formal language (e.g., OWL - Web Ontology
Language) to define classes (concepts), properties (relationships), and instances (data
values).Ontologies are used to represent complex domains and enable systems to reason
about the data, making them foundational for the Semantic Web and Knowledge Graphs.
• Purpose
• Knowledge Representation Model: Ontology falls under knowledge representation

Ontology
and is considered a form of semantic modeling that captures domain knowledge,
rules, and relationships.
• Formal Model: Unlike semantic data models, ontologies are formalized using logic-
based languages like OWL, RDF (Resource Description Framework), and SPARQL,

Modeling
making them machine-readable and interpretable.
• Example:
• In an e-commerce ontology, you might define:
• Classes: Customer, Product, Order, Payment Method
• Properties: purchases, belongs_to_category, uses_payment_method
• Instances: John Doe (instance of Customer), "Smartphone" (instance of Product)
• This allows you to infer new knowledge, such as finding all customers who have purchased
electronics using a credit card, thanks to the relationships defined in the ontology.
A quick example for Ontology
Modeling
Artifact 03
Data Storage
A Deep Dive
Recap of Concepts
CAP Theorem
The CAP Theorem, also known as Brewer's Theorem, is a fundamental principle in distributed database
systems. It states that a distributed database can achieve only two out of three properties at any given time:
Consistency (C), Availability (A), and Partition Tolerance (P).

Key Takeaway: The CAP


theorem helps
understand the trade-
offs in distributed
database systems,
guiding the choice of
database design based
on whether
consistency, availability,
or partition tolerance is
the priority for a
specific use case.
Implication of the CAP
Theorem and Use Cases
The CAP Theorem states that in a distributed
database system, you can achieve only two
out of the three properties: Consistency (C),
Availability (A), and Partition Tolerance (P).
Let’s see how this works with specific
database examples and use cases.
ACID Properties

The ACID properties define a set of


principles that ensure reliable,
consistent, and error-free
transactions in database systems,
particularly in SQL (relational)
databases. These properties are
crucial for maintaining data
integrity, especially in OLTP (Online
Transaction Processing) systems.
Key Takeaway: ACID properties are essential for ensuring data
accuracy and reliability in relational databases, making them
the preferred choice for applications that require strong
consistency, such as OLTP systems.
OLTP and OLAP
Database
Patterns
Traditionally data architectures
had a clear segregation of
purpose, the OLTP (Online
Transaction Processing),
typically known to be used for
transactional needs and OLAP
(Online Analytic Processing)
data stores that typically used
for reporting and analytical
needs. The following table
elaborates the typical
differences.
Top OLTP
Databases
Top OLAP
Databases
Purpose
NoSQL (Not With the explosion of data generated by web applications, IoT devices, social media, and big data analytics,
traditional SQL databases struggled to handle these modern data requirements. NoSQL databases emerged

only SQL) to address the following needs:


• Scalability: NoSQL databases can scale horizontally across multiple servers, making them ideal for
handling large volumes of data and high-traffic workloads.
• Flexibility: They support flexible schema designs, allowing data structures to evolve over time without
requiring costly database migrations.
NoSQL category of non- • Handling Big Data: Suitable for handling diverse data types, such as text, images, videos, JSON, XML,
relational databases designed and binary data.
to handle a wide variety of • High Performance: Designed to manage high read/write speeds, making them suitable for real-time
applications.
data models, including
unstructured, semi-structured,
and structured data. Unlike
traditional SQL databases,
which use fixed schemas and
tabular structures, NoSQL
databases offer more
flexibility, scalability, and the
ability to handle big data and
real-time applications.
Types of NoSQL
NoSQL: Document Store
Document stores, also known as document-oriented databases, are a type
of NoSQL database that store data in a semi-structured format, typically
Key Features of Document Stores
using JSON, BSON (Binary JSON), or XML documents. Each document can
have a flexible schema, meaning different documents can have different
fields, data types, and structures, making document stores highly adaptable
and ideal for handling dynamic or evolving data.

Flexible Schema: Hierarchical Data High Scalability: Rich Query


Unlike traditional Representation: Document stores can Capabilities: They offer
relational databases, Documents can contain scale horizontally powerful querying
document stores allow nested structures, across multiple servers, options, allowing for
for a flexible schema, arrays, and sub- handling large volumes searches, filtering, and
enabling you to store documents, making of data and high-traffic aggregation on data
varied data structures them suitable for workloads efficiently. within the documents.
within the same complex, hierarchical
collection. data.
Mongo DB- A Deep Dive
MongoDB is a widely used document-oriented NoSQL database that stores data in a flexible, JSON-like format
called BSON (Binary JSON). It’s designed to handle unstructured and semi-structured data efficiently, making it
suitable for applications that require flexibility, scalability, and high performance.

Key Concepts
Example 1: E-commerce
Product Catalog
In an e-commerce application, the product
catalog can vary greatly in structure. Some
products might have different attributes
like sizes, colors, or specifications. With
MongoDB, you can store each product as a
document in a collection
Querying Product Catalog in MongoDB
1. Find Products in a Specific Category
Objective: Find all products in the "Electronics" category.

2. Find Products Within a Price Range


Objective: Retrieve all products in the "Electronics" category that cost between $500 and $800.

3. Retrieve Desired Products


Example 2: User
Profiles in a Social
Media Platform

• In a social media
application, user profiles
can have diverse
attributes. MongoDB’s
flexible schema allows for
varying data types and
structures:
NoSQL: Columnar Store
Columnar stores, or column-family databases, are databases Key Features
that store data by columns rather than by rows, making them
highly efficient for analytical and read-heavy workloads. Unlike Column-based Storage: Data is
High Performance for
traditional row-based storage, where data is stored and read stored column by column,
Aggregation: Since only the
row by row, columnar stores allow you to read only the relevant columns are read
enabling high compression
necessary columns, leading to faster query performance, during a query, operations like
rates and efficient data retrieval
SUM, AVG, COUNT, and other
especially for aggregation and analytics operations. for analytical queries.
aggregations are much faster.

Efficient Data Compression:


Scalability: Designed to handle
Storing data by columns allows
large volumes of data, making
for better compression,
them suitable for big data
reducing storage costs and
analytics and data warehousing.
improving I/O performance.

IMP: Columnar stores like ClickHouse, Amazon Redshift, and Google BigQuery are essential for
modern data analytics, offering high performance, scalability, and efficient storage.
ClickHouse, in particular, stands out for its real-time analytics capabilities, making it an excellent
choice for log analysis, time-series data, and BI reporting. Its columnar architecture ensures that only
the necessary data is read, enabling rapid query performance and efficient data processing.
Columnar Store Usecases

1 2 3 4 5
Data Warehousing and Real-time Analytics: Time-Series Data Analysis: Big Data Processing: ETL (Extract, Transform, Load)
Business Intelligence (BI): •Suitable for processing and analyzing •Efficiently handles time-series data •Ideal for big data applications where Operations:
•Columnar databases excel in streaming data in real-time, enabling where large amounts of data are you need to process large datasets •Columnar databases are efficient for
aggregating and analyzing large insights into live data. collected over time, such as financial across distributed systems quickly. ETL workflows, where data needs to
volumes of historical data, making •Example: Monitoring website traffic market data or IoT sensor readings. •Example: Processing large volumes be extracted, transformed, and
them ideal for data warehousing and or user activity in real time. •Example: Tracking stock prices, of social media data to derive loaded into a data warehouse for
BI applications. energy consumption, or weather insights into user behavior. further analysis.
•Example: Analyzing sales data to patterns over time. •Example: Transforming and loading
generate monthly or quarterly customer data from multiple sources
reports. into a central data warehouse.
ClickHouse: A Deep Dive
ClickHouse is an open-source, high-performance columnar database management system designed for real-
time analytics. Developed by Yandex, it’s optimized for handling large volumes of data quickly, making it one of
the fastest columnar stores available.
Features

High-Performance Queries: Columnar Storage: Stores Distributed and Scalable: Data Compression: Uses
ClickHouse is optimized for data by columns, enabling Supports distributed advanced compression
fast analytical queries, efficient compression and deployment, allowing you techniques, reducing
capable of processing retrieval, particularly for to scale horizontally across storage costs and improving
billions of rows per second. read-heavy analytics multiple servers.SQL I/O performance.
workloads. Support: Offers a SQL-like
querying language, making
it accessible for users
familiar with SQL.
Example: Real-Time Web Traffic Analytics
A company wants to analyze real-time web traffic data to monitor website activity, such as page
views, unique visitors, and session duration.
Table Structure (Simplified Example) Insert some data

Total Page Views by Country


This query retrieves the total number of page views per country, sorted by the highest page views.

Average Session Duration for Mobile Users


This query calculates the average session duration for users accessing the website via mobile devices.
Features

• Simplicity: The data model is straightforward, consisting of keys and their corresponding
values, which makes it easy to use and implement.
• High Performance: Optimized for quick read and write operations due to their simple data
structure.
NoSQL: Key- • Scalability: Can scale horizontally by distributing data across multiple nodes, making them
highly scalable and capable of handling large amounts of data.
Value Store • Flexibility: Values can store various data types, allowing for different data structures.

A key-value store is a type of NoSQL


database that stores data as a
collection of key-value pairs, similar
to a dictionary or hash table. Each
key is unique and maps directly to a
value, which can be a simple string,
number, JSON object, XML, or even
a binary data object.
Key-value stores are highly
optimized for simplicity,
performance, and scalability,
making them ideal for applications
requiring rapid data access and
updates.
Key-Value Store Usecases
• Key-value stores are frequently used for caching data to reduce database load and improve application
Caching response time.
• Example: Using Redis to cache frequently accessed data, such as product details or user sessions.

Session Management • Ideal for storing user session data in web applications, allowing quick access and updates.
• Example: Storing user login sessions in Redis for an e-commerce site.

• Suitable for applications requiring high-speed data access and updates, like tracking user activities in real
Real-time Analytics time.
• Example: Using Amazon DynamoDB to track user actions in a gaming application.

Leaderboards and Gaming • Excellent for maintaining real-time leaderboards in online games.
• Example: Using Redis to maintain a live ranking of players based on their scores.

IoT Data Storage • Can efficiently handle data generated by IoT devices, such as sensor readings.
• Example: Using DynamoDB to store sensor data from connected devices in a smart home system.
Q&A

You might also like