0% found this document useful (0 votes)

9 views

Data Engineering - Session 02

Uploaded by

Divam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Data Engineering - Session 02

Uploaded by

Divam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Course Curriculum

• Session 01 – Theory
• Introduction to Enterprise Data, Data Engineering,
Modern Data Applications & Patterns, Data
Frameworks, Components & Best Practices
• Session 02 – Theory & Lab Demos
• Introduction to Data stores: SQL, NoSQL, File Systems,
Data Lakes, Data Warehouses, Data Mesh Cloud Data
Products, Lab Demos of select data stores
• Session 03 – Theory & Lab Demos
• Data Architecture Layers, Data Pipelines,
Transformation, Orchestration, Data Aggregation vs
Federation, Lab Demos of sleect Data Pipeline
Products
• Session 04 – Theory & In-Class Design
• Data Governance: Data Catalogs, Data Quality,
Lineage, Provenance, Data Security, Regulatory
Compliance, Real-World Application Data Design
• Tutorials
Introduction to Data Modeling

Data Modeling is the process of creating a visual

representation or blueprint of how data will be stored,
organized, and managed in a database or data system.
It involves defining data elements, structures, and
relationships, allowing organizations to understand and
organize their data effectively for various applications,
such as data storage, retrieval, and analytics.
Traditional Data Modeling Methods
Represents the actual implementation of the Ex: In a relational database, the "Customer"
database, including table structures, columns, table might be implemented with columns such
data types, indexes, primary keys, foreign keys, as customer_id (integer, primary key), name
and storage details. Guides database (varchar), email (varchar), and phone_number
administrators and engineers in building the (varchar), with details on indexing and
actual database structure. constraints.
A more detailed version of the conceptual
model that defines how data should be Ex: For a "Customer" entity, the logical model
organized, including attributes, data types, and might specify attributes such as Customer_ID,
relationships without considering the actual Name, Email, and Phone_Number, defining
database implementation. Provides a detailed relationships and cardinality (e.g., one customer
understanding of the data requirements and is can place multiple orders).
often used as a bridge between conceptual and
physical data models. Ex: An e-commerce data model might include
entities like Customer, Product, Order, and
Provides a high-level view of the data and focuses
Payment, with relationships such as "Customer
on defining the entities, their attributes, and
places an Order" and "Order contains Product."
relationships. It answers the "what" without
worrying about how data is stored or processed.
Used during the initial design phase to outline
the scope and structure of the data
A quick example for
Conceptual-Logical-
Physical Modeling
• Artifact 02
Advanced Data Modeling Techniques

Semantic Data Modeling: meaning representation

Ontology based Modeling: knowledge representation

• The Semantic Data Model is a high-level data model that defines the meaning
(semantics) of data and the relationships between different data elements. It
goes beyond traditional data models by incorporating the context and meaning
of the data, ensuring that data relationships and dependencies are explicitly
defined and understood.It’s often represented using entity-relationship
diagrams (ERDs), enriched with more semantic details about the data.
• Purpose

Semantic • It operates at a conceptual level, focusing on the meaning of data rather

than its physical storage.
• Contextual Relationships: Data is modeled with an emphasis on

Data relationships, hierarchies, and constraints, offering a more expressive

way of understanding data compared to purely structural models.
• Example:

Modeling • In a retail scenario, instead of just defining Customer and Product entities,
a semantic data model would include relationships such as:
• "A Customer purchases a Product."
• "A Product belongs to a Category."
• "A Customer has a Loyalty Status."
• This allows for more nuanced queries, such as finding all customers who have
purchased products in a particular category with a certain loyalty status.
• An Ontology is a formal, explicit specification of a shared conceptualization of a domain. It
represents knowledge as a set of concepts within a domain and the relationships between
those concepts.It uses a more structured and formal language (e.g., OWL - Web Ontology
Language) to define classes (concepts), properties (relationships), and instances (data
values).Ontologies are used to represent complex domains and enable systems to reason
about the data, making them foundational for the Semantic Web and Knowledge Graphs.
• Purpose
• Knowledge Representation Model: Ontology falls under knowledge representation

Ontology
and is considered a form of semantic modeling that captures domain knowledge,
rules, and relationships.
• Formal Model: Unlike semantic data models, ontologies are formalized using logic-
based languages like OWL, RDF (Resource Description Framework), and SPARQL,

Modeling
making them machine-readable and interpretable.
• Example:
• In an e-commerce ontology, you might define:
• Classes: Customer, Product, Order, Payment Method
• Properties: purchases, belongs_to_category, uses_payment_method
• Instances: John Doe (instance of Customer), "Smartphone" (instance of Product)
• This allows you to infer new knowledge, such as finding all customers who have purchased
electronics using a credit card, thanks to the relationships defined in the ontology.
A quick example for Ontology
Modeling
Artifact 03
Data Storage
A Deep Dive
Recap of Concepts
CAP Theorem
The CAP Theorem, also known as Brewer's Theorem, is a fundamental principle in distributed database
systems. It states that a distributed database can achieve only two out of three properties at any given time:
Consistency (C), Availability (A), and Partition Tolerance (P).

Key Takeaway: The CAP

theorem helps
understand the trade-
offs in distributed
database systems,
guiding the choice of
database design based
on whether
consistency, availability,
or partition tolerance is
the priority for a
specific use case.
Implication of the CAP
Theorem and Use Cases
The CAP Theorem states that in a distributed
database system, you can achieve only two
out of the three properties: Consistency (C),
Availability (A), and Partition Tolerance (P).
Let’s see how this works with specific
database examples and use cases.
ACID Properties

The ACID properties define a set of

principles that ensure reliable,
consistent, and error-free
transactions in database systems,
particularly in SQL (relational)
databases. These properties are
crucial for maintaining data
integrity, especially in OLTP (Online
Transaction Processing) systems.
Key Takeaway: ACID properties are essential for ensuring data
accuracy and reliability in relational databases, making them
the preferred choice for applications that require strong
consistency, such as OLTP systems.
OLTP and OLAP
Database
Patterns
Traditionally data architectures
had a clear segregation of
purpose, the OLTP (Online
Transaction Processing),
typically known to be used for
transactional needs and OLAP
(Online Analytic Processing)
data stores that typically used
for reporting and analytical
needs. The following table
elaborates the typical
differences.
Top OLTP
Databases
Top OLAP
Databases
Purpose
NoSQL (Not With the explosion of data generated by web applications, IoT devices, social media, and big data analytics,
traditional SQL databases struggled to handle these modern data requirements. NoSQL databases emerged

only SQL) to address the following needs:

• Scalability: NoSQL databases can scale horizontally across multiple servers, making them ideal for
handling large volumes of data and high-traffic workloads.
• Flexibility: They support flexible schema designs, allowing data structures to evolve over time without
requiring costly database migrations.
NoSQL category of non- • Handling Big Data: Suitable for handling diverse data types, such as text, images, videos, JSON, XML,
relational databases designed and binary data.
to handle a wide variety of • High Performance: Designed to manage high read/write speeds, making them suitable for real-time
applications.
data models, including
unstructured, semi-structured,
and structured data. Unlike
traditional SQL databases,
which use fixed schemas and
tabular structures, NoSQL
databases offer more
flexibility, scalability, and the
ability to handle big data and
real-time applications.
Types of NoSQL
NoSQL: Document Store
Document stores, also known as document-oriented databases, are a type
of NoSQL database that store data in a semi-structured format, typically
Key Features of Document Stores
using JSON, BSON (Binary JSON), or XML documents. Each document can
have a flexible schema, meaning different documents can have different
fields, data types, and structures, making document stores highly adaptable
and ideal for handling dynamic or evolving data.

Flexible Schema: Hierarchical Data High Scalability: Rich Query

Unlike traditional Representation: Document stores can Capabilities: They offer
relational databases, Documents can contain scale horizontally powerful querying
document stores allow nested structures, across multiple servers, options, allowing for
for a flexible schema, arrays, and sub- handling large volumes searches, filtering, and
enabling you to store documents, making of data and high-traffic aggregation on data
varied data structures them suitable for workloads efficiently. within the documents.
within the same complex, hierarchical
collection. data.
Mongo DB- A Deep Dive
MongoDB is a widely used document-oriented NoSQL database that stores data in a flexible, JSON-like format
called BSON (Binary JSON). It’s designed to handle unstructured and semi-structured data efficiently, making it
suitable for applications that require flexibility, scalability, and high performance.

Key Concepts
Example 1: E-commerce
Product Catalog
In an e-commerce application, the product
catalog can vary greatly in structure. Some
products might have different attributes
like sizes, colors, or specifications. With
MongoDB, you can store each product as a
document in a collection
Querying Product Catalog in MongoDB
1. Find Products in a Specific Category
Objective: Find all products in the "Electronics" category.

2. Find Products Within a Price Range

Objective: Retrieve all products in the "Electronics" category that cost between $500 and $800.

3. Retrieve Desired Products

Example 2: User
Profiles in a Social
Media Platform

• In a social media
application, user profiles
can have diverse
attributes. MongoDB’s
flexible schema allows for
varying data types and
structures:
NoSQL: Columnar Store
Columnar stores, or column-family databases, are databases Key Features
that store data by columns rather than by rows, making them
highly efficient for analytical and read-heavy workloads. Unlike Column-based Storage: Data is
High Performance for
traditional row-based storage, where data is stored and read stored column by column,
Aggregation: Since only the
row by row, columnar stores allow you to read only the relevant columns are read
enabling high compression
necessary columns, leading to faster query performance, during a query, operations like
rates and efficient data retrieval
SUM, AVG, COUNT, and other
especially for aggregation and analytics operations. for analytical queries.
aggregations are much faster.

Efficient Data Compression:

Scalability: Designed to handle
Storing data by columns allows
large volumes of data, making
for better compression,
them suitable for big data
reducing storage costs and
analytics and data warehousing.
improving I/O performance.

IMP: Columnar stores like ClickHouse, Amazon Redshift, and Google BigQuery are essential for
modern data analytics, offering high performance, scalability, and efficient storage.
ClickHouse, in particular, stands out for its real-time analytics capabilities, making it an excellent
choice for log analysis, time-series data, and BI reporting. Its columnar architecture ensures that only
the necessary data is read, enabling rapid query performance and efficient data processing.
Columnar Store Usecases

1 2 3 4 5
Data Warehousing and Real-time Analytics: Time-Series Data Analysis: Big Data Processing: ETL (Extract, Transform, Load)
Business Intelligence (BI): •Suitable for processing and analyzing •Efficiently handles time-series data •Ideal for big data applications where Operations:
•Columnar databases excel in streaming data in real-time, enabling where large amounts of data are you need to process large datasets •Columnar databases are efficient for
aggregating and analyzing large insights into live data. collected over time, such as financial across distributed systems quickly. ETL workflows, where data needs to
volumes of historical data, making •Example: Monitoring website traffic market data or IoT sensor readings. •Example: Processing large volumes be extracted, transformed, and
them ideal for data warehousing and or user activity in real time. •Example: Tracking stock prices, of social media data to derive loaded into a data warehouse for
BI applications. energy consumption, or weather insights into user behavior. further analysis.
•Example: Analyzing sales data to patterns over time. •Example: Transforming and loading
generate monthly or quarterly customer data from multiple sources
reports. into a central data warehouse.
ClickHouse: A Deep Dive
ClickHouse is an open-source, high-performance columnar database management system designed for real-
time analytics. Developed by Yandex, it’s optimized for handling large volumes of data quickly, making it one of
the fastest columnar stores available.
Features

High-Performance Queries: Columnar Storage: Stores Distributed and Scalable: Data Compression: Uses
ClickHouse is optimized for data by columns, enabling Supports distributed advanced compression
fast analytical queries, efficient compression and deployment, allowing you techniques, reducing
capable of processing retrieval, particularly for to scale horizontally across storage costs and improving
billions of rows per second. read-heavy analytics multiple servers.SQL I/O performance.
workloads. Support: Offers a SQL-like
querying language, making
it accessible for users
familiar with SQL.
Example: Real-Time Web Traffic Analytics
A company wants to analyze real-time web traffic data to monitor website activity, such as page
views, unique visitors, and session duration.
Table Structure (Simplified Example) Insert some data

Total Page Views by Country

This query retrieves the total number of page views per country, sorted by the highest page views.

Average Session Duration for Mobile Users

This query calculates the average session duration for users accessing the website via mobile devices.
Features

• Simplicity: The data model is straightforward, consisting of keys and their corresponding
values, which makes it easy to use and implement.
• High Performance: Optimized for quick read and write operations due to their simple data
structure.
NoSQL: Key- • Scalability: Can scale horizontally by distributing data across multiple nodes, making them
highly scalable and capable of handling large amounts of data.
Value Store • Flexibility: Values can store various data types, allowing for different data structures.

A key-value store is a type of NoSQL

database that stores data as a
collection of key-value pairs, similar
to a dictionary or hash table. Each
key is unique and maps directly to a
value, which can be a simple string,
number, JSON object, XML, or even
a binary data object.
Key-value stores are highly
optimized for simplicity,
performance, and scalability,
making them ideal for applications
requiring rapid data access and
updates.
Key-Value Store Usecases
• Key-value stores are frequently used for caching data to reduce database load and improve application
Caching response time.
• Example: Using Redis to cache frequently accessed data, such as product details or user sessions.

Session Management • Ideal for storing user session data in web applications, allowing quick access and updates.
• Example: Storing user login sessions in Redis for an e-commerce site.

• Suitable for applications requiring high-speed data access and updates, like tracking user activities in real
Real-time Analytics time.
• Example: Using Amazon DynamoDB to track user actions in a gaming application.

Leaderboards and Gaming • Excellent for maintaining real-time leaderboards in online games.
• Example: Using Redis to maintain a live ranking of players based on their scores.

IoT Data Storage • Can efficiently handle data generated by IoT devices, such as sensor readings.
• Example: Using DynamoDB to store sensor data from connected devices in a smart home system.
Q&A

Assignment
No ratings yet
Assignment
2 pages
Introduction To Nosql: Gabriele Pozzani
No ratings yet
Introduction To Nosql: Gabriele Pozzani
49 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
29 pages
NoSql 2024 Assign2
No ratings yet
NoSql 2024 Assign2
189 pages
CloudComputing DATABASE
No ratings yet
CloudComputing DATABASE
27 pages
Big Data Analytics Lecture 3A
No ratings yet
Big Data Analytics Lecture 3A
27 pages
Lecture 6 - NoSQL
No ratings yet
Lecture 6 - NoSQL
28 pages
NoSQL (1)
No ratings yet
NoSQL (1)
12 pages
PPT 2.2.1
No ratings yet
PPT 2.2.1
26 pages
4unit NoSQL
No ratings yet
4unit NoSQL
27 pages
BDA Module-3
No ratings yet
BDA Module-3
7 pages
DBMS Capsule
No ratings yet
DBMS Capsule
4 pages
Module 5_NoSQL databases
No ratings yet
Module 5_NoSQL databases
33 pages
IntroNoSQL Revised
No ratings yet
IntroNoSQL Revised
28 pages
Introduction To: Nosql
No ratings yet
Introduction To: Nosql
27 pages
ch 3 data modeling
No ratings yet
ch 3 data modeling
31 pages
Module-2
No ratings yet
Module-2
100 pages
4 NoSql
No ratings yet
4 NoSql
25 pages
BDA Assignment1 BE6 20
No ratings yet
BDA Assignment1 BE6 20
10 pages
NoSQL_Notes
No ratings yet
NoSQL_Notes
11 pages
Full Stack UNIT3
No ratings yet
Full Stack UNIT3
57 pages
02 Handout 144-Unlocked
No ratings yet
02 Handout 144-Unlocked
3 pages
Unit Ii - Nosql Databases
No ratings yet
Unit Ii - Nosql Databases
112 pages
UNIT II
No ratings yet
UNIT II
70 pages
Module-2
No ratings yet
Module-2
104 pages
Activity 2
No ratings yet
Activity 2
49 pages
DB LECTURE 2 (4)
No ratings yet
DB LECTURE 2 (4)
34 pages
NoSQL Databases
No ratings yet
NoSQL Databases
20 pages
Nosql
No ratings yet
Nosql
26 pages
Database Systems - Lecture 5
No ratings yet
Database Systems - Lecture 5
7 pages
nosql
No ratings yet
nosql
64 pages
06.BigDataAndBigDataDesign
No ratings yet
06.BigDataAndBigDataDesign
52 pages
Big Data Analysis
No ratings yet
Big Data Analysis
9 pages
Conceptual Design of Document NoSQL Database With Formal Concept Analysis
No ratings yet
Conceptual Design of Document NoSQL Database With Formal Concept Analysis
20 pages
NoSQL Data Modeling Techniques - Highly Scalable Blog
0% (1)
NoSQL Data Modeling Techniques - Highly Scalable Blog
32 pages
Introduction To Data Model L-1
No ratings yet
Introduction To Data Model L-1
17 pages
BIG DATA UNIT-II NOTES
No ratings yet
BIG DATA UNIT-II NOTES
7 pages
Unit II No-SQL Db Managment
No ratings yet
Unit II No-SQL Db Managment
33 pages
Data Model - Important - Concepts
No ratings yet
Data Model - Important - Concepts
24 pages
Bcse302l Dbms Module-7 Nosql
No ratings yet
Bcse302l Dbms Module-7 Nosql
30 pages
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
No ratings yet
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
102 pages
NOsql Presentation
No ratings yet
NOsql Presentation
20 pages
NoSQL Database
No ratings yet
NoSQL Database
8 pages
NoSQL D
No ratings yet
NoSQL D
26 pages
2- NoSQL
No ratings yet
2- NoSQL
32 pages
NOSQL , MONGODB
No ratings yet
NOSQL , MONGODB
18 pages
Definition of A Data Model
No ratings yet
Definition of A Data Model
27 pages
unit 4 BDA
No ratings yet
unit 4 BDA
22 pages
Introduction to NoSQL
No ratings yet
Introduction to NoSQL
13 pages
BD 3
No ratings yet
BD 3
1 page
pyq DMDW
No ratings yet
pyq DMDW
8 pages
Model
No ratings yet
Model
17 pages
Intro 2 DB
No ratings yet
Intro 2 DB
126 pages
Super Important Questions For BDA
100% (1)
Super Important Questions For BDA
26 pages
Data Modeling
No ratings yet
Data Modeling
14 pages
Lecture 1
No ratings yet
Lecture 1
31 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
30 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
Big Data NOTES
No ratings yet
Big Data NOTES
14 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
10gen Top 5 NoSQL Considerations
No ratings yet
10gen Top 5 NoSQL Considerations
10 pages
BA Orientation
No ratings yet
BA Orientation
11 pages
Ba Mpbam
No ratings yet
Ba Mpbam
21 pages
Tableau SOP
No ratings yet
Tableau SOP
4 pages
Boarding Pass (CCU-BLR)
No ratings yet
Boarding Pass (CCU-BLR)
1 page
Ba MPBDS
No ratings yet
Ba MPBDS
5 pages
Inspiron 15 7572 Laptop Service Manual en Us
No ratings yet
Inspiron 15 7572 Laptop Service Manual en Us
64 pages
Milan Milenkovic Operating Systems Concepts and Design DF56E
0% (1)
Milan Milenkovic Operating Systems Concepts and Design DF56E
12 pages
PCworth Product Pricelist
No ratings yet
PCworth Product Pricelist
9 pages
Ampex Dialog Brochure
No ratings yet
Ampex Dialog Brochure
5 pages
iQPump Micro Brochure PDF
No ratings yet
iQPump Micro Brochure PDF
20 pages
Optical Networks
No ratings yet
Optical Networks
3 pages
PSV Sizing Manual
No ratings yet
PSV Sizing Manual
10 pages
D155a Brochure
No ratings yet
D155a Brochure
6 pages
Irt/C Irt/C Irt/C Irt/C Infrared Infrared Infrared Infrared Temperature Sensor Temperature Sensor Temperature Sensor Temperature Sensor
No ratings yet
Irt/C Irt/C Irt/C Irt/C Infrared Infrared Infrared Infrared Temperature Sensor Temperature Sensor Temperature Sensor Temperature Sensor
7 pages
ACC 205 Practice Questions-1
No ratings yet
ACC 205 Practice Questions-1
14 pages
Synopsis: Overview
No ratings yet
Synopsis: Overview
69 pages
Mibs Mib C CERAGON-MIB - Mib
No ratings yet
Mibs Mib C CERAGON-MIB - Mib
335 pages
Indholdsfortegnelse
No ratings yet
Indholdsfortegnelse
61 pages
Assignment 1 Ee Microcontroller
No ratings yet
Assignment 1 Ee Microcontroller
2 pages
Etech Data Visualization
No ratings yet
Etech Data Visualization
13 pages
1.3.4 Ec 2021
No ratings yet
1.3.4 Ec 2021
111 pages
Wire Modeling
No ratings yet
Wire Modeling
68 pages
Plagiarism Checker Report - A Free Online Plagiarism Detector
No ratings yet
Plagiarism Checker Report - A Free Online Plagiarism Detector
2 pages
AC250 Data Sheet
No ratings yet
AC250 Data Sheet
5 pages
College of Architecture Universtiy of Santo Tomas Espaìa, Manila
No ratings yet
College of Architecture Universtiy of Santo Tomas Espaìa, Manila
6 pages
Download Complete Imagining AI: How the World Sees Intelligent Machines Stephen Cave PDF for All Chapters
100% (6)
Download Complete Imagining AI: How the World Sees Intelligent Machines Stephen Cave PDF for All Chapters
64 pages
O&m - CCTV System
No ratings yet
O&m - CCTV System
617 pages
A Level History Coursework Edexcel
100% (2)
A Level History Coursework Edexcel
4 pages
SF5A600HD: Ultra Fast Recovery Power Rectifier
No ratings yet
SF5A600HD: Ultra Fast Recovery Power Rectifier
5 pages
Computer Applications in Mining
No ratings yet
Computer Applications in Mining
25 pages
Unleash Your Genius At: Brightchamps
No ratings yet
Unleash Your Genius At: Brightchamps
22 pages
Steering System: N35ZDR, N45ZR (C264) N30ZDR, N35-40ZR (D470) N30ZDRS, N35-40ZRS (A265)
100% (3)
Steering System: N35ZDR, N45ZR (C264) N30ZDR, N35-40ZR (D470) N30ZDRS, N35-40ZRS (A265)
56 pages
UTT Books 42
No ratings yet
UTT Books 42
9 pages
Report (Book Store)
100% (1)
Report (Book Store)
16 pages
VLSI Implementation of Crypto Coprocessor Using AES and LFSR
No ratings yet
VLSI Implementation of Crypto Coprocessor Using AES and LFSR
6 pages

Data Engineering - Session 02

Uploaded by

Data Engineering - Session 02

Uploaded by

Course Curriculum

Data Modeling is the process of creating a visual

Semantic Data Modeling: meaning representation

Ontology based Modeling: knowledge representation

Semantic • It operates at a conceptual level, focusing on the meaning of data rather

Data relationships, hierarchies, and constraints, offering a more expressive

Key Takeaway: The CAP

The ACID properties define a set of

only SQL) to address the following needs:

Flexible Schema: Hierarchical Data High Scalability: Rich Query

2. Find Products Within a Price Range

3. Retrieve Desired Products

Efficient Data Compression:

Total Page Views by Country

Average Session Duration for Mobile Users

A key-value store is a type of NoSQL

You might also like