0% found this document useful (0 votes)

7 views

Data Engineering - Behind the Scene of Data by Hoda Ragaie

Uploaded by

my home deccor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Data Engineering - Behind the Scene of Data by Hoda Ragaie

Uploaded by

my home deccor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Data Engineering

Behind the Scene of Data

17 November, 2024
A Little About Me hodaragaie

I am a 2022 GUC graduate with a degree in Computer Engineering.

Hoda Ragaie I first worked at @Raisa as a Data Engineer intern, and came back
Data Engineer II @Raisa to a full-time role after graduating over two years ago now. My
bachelor’s thesis was in data, and I’ve always been passionate
about it.
What is it?

Generated by ChatGPT - with manual enhancements

Data Engineers wear different hats

Generated by copilot
THE 2024 MAD (MACHINE LEARNING, ARTIFICIAL INTELLIGENCE & DATA) LANDSCAPE
INFRASTRUCTURE ANALYTICS MACHINE LEARNING & ARTIFICIAL INTELLIGENCE APPLICATIONS — ENTERPRISE
STORAGE MPP DBs DATA LAKES / DATA STREAMING / BI PLATFORMS VISUALIZATION DATA SCIENCE DATA SCIENCE ENTERPRISE ML/AI PLATFORMS DATA GENERATION
LAKEHOUSES WAREHOUSES IN-MEMORY NOTEBOOKS PLATFORMS & LABELING SALES MARKETING CUSTOMER EXPERIENCE HUMAN AUTOMATION DECISION &
CAPITAL & OPERATIONS OPTIMIZATION

LEGAL PARTNERSHIPS REGTECH & FINANCE

COMPLIANCE
RDBMS NoSQL DATABASES NewSQL DATABASES REAL TIME GRAPH DBs GPU MULTI- DATA ANALYST PLATFORMS MLOPS AI OBSERVABILITY COMPUTER
DATABASES DATABASES MODEL VISION
DATABASES &
ABSTRACTIONS

VECTOR APPLICATIONS — HORIZONTAL

DATABASES AI DEVELOPER PLATFORMS AI SAFETY & SECURITY
CODE & TEXT AUDIO & VOICE IMAGE PRESENTATION & VIDEO EDITING ANIMATION SEARCH
DOCUMENTATION DESIGN & 3D / GAMING / CONVER-
SATIONAL AI
3

VIDEO GENERATION

ETL / ELT / REVERSE ETL DATA INTEGRATION DATA GOVERNANCE CUSTOMER DATA PRODUCT SPEECH / VOICE NLP COMMERCIAL AI RESEARCH NONPROFIT
DATA TRANSFORMATION & CATALOG PLATFORMS ANALYTICS AI RESEARCH

APPLICATIONS — INDUSTRY

FINANCE & HEALTHCARE LIFE SCIENCES TRANSPORTATION AGRICULTURE INDUSTRIAL & AEROSPACE,
INSURANCE LOGISTICS DEFENSE & GOV’T

ORCHESTRATION DATA QUALITY & FULLY MGMT / MONITORING PRIVACY COMPUTE LOG ANALYTICS ENTERPRISE SEARCH / AI HARDWARE GPU CLOUD / EDGE AI CLOSED
OBSERVABILITY MANAGED & SECURITY KNOWLEDGE ANALYTICS INFRA SOURCE MODELS

3
AU LARGE

CROSS-
INDUSTRY

OPEN SOURCE INFRASTRUCTURE

DATA FRAMEWORKS FORMATS QUERY / DATA FLOW DATA MANAGEMENT DATABASES OLAP ORCHESTRATION INFRA- STREAMING & STAT TOOLS & MLOPS & AI INFRA AI FRAMEWORKS, TOOLS & LIBRARIES AI MODELS LOCAL AI SEARCH LOGGING & MONITORING VISUALIZATION COLLABORATION
STRUCTURE MESSAGING LANGUAGES

DATA SOURCES & APIs DATA & AI CONSULTING

DATA MARKETPLACES FINANCIAL & MARKET DATA AIR / SPACE / SEA PEOPLE / ENTITIES LOCATION INTELLIGENCE ESG
& DISCOVERY

Version 1.0 - March 2024 © Matt Turck (@mattturck) , Aman Kabeer (@AmanKabeer11) & FirstMark (@firstmarkcap) Blog post: mattturck.com/MAD2024 Interactive version: MAD.firstmarkcap.com Comments? Email [email protected]
When there's data, there's a data engineer.
and data is everywhere ...
Power of Data
Customer Data Delivery Services

Website Analytics Supplier Data

Social Media Weather Data

Data Sources
How is data created?

Out of data engineer's control - incident response & monitoring

Understand how upstream source system architecture is designed. Its
strengths and weaknesses
Data Contract with source system stakeholders:
What data is being extracted
Via what method (full, incremental)
How often
Schema changes
Person of contact
Uptime
Data quality

Using bad data to make decisions is much worse than having no data -
datastrophes
Data Storage
Where is data stored?

Purpose & use case:

Identify purpose of storing the data. What is it used for?
Stream data processing (real-time):
track orders real-time
Batch Processing:
reports & analysis (daily)
Update Patterns:
Is the solution optimized for quick bulk updates & inserts?
Is the solution optimized for complex queries & analysis?
Cost:
Storage and Compute scalability costs
Data Storage

Row-Based Storage Model Column-Based Storage Model

Modifying data is easier. Adding new Query performance is much faster for
row just gets appended at the end. analytics
Slower Aggregation since data for Requires lesser space
each row has to be loaded first Data modifications are more complex
Take up more space for indexes

Name Gender Age

Ahmed Male 63 | Malak Female 29 Ahmed Male 63 Ahmed Malak | Male Female | 63 29

Malak Female 29
Data Storage
Data Warehouse
Relational Database

Designed for real-time operations and Designed for complex analysis and
transactions (OLTP) reporting (OLAP)
Optimized for fast data entry and quick Often Column-based storage: data is
updates stored column by column.
Fast lookups done through indexing to This makes them perform well for
esnure we don’t scan entire table to find complex data transformations,
records satisfying the WHERE clause aggregations, statistical calcultaions or
Row-Based storage: Data stored row by evaluation iof complex conditions on
row large datasets
Still structured data
Data Storage
Object Storage

Grown in popularity with rise of big data

Stores any type - txt, csv, json, images, videos, audio ..
Fully-managed cloud object stores (Amazon S3, Azure Blob, Google Cloud Storage)
Excellent performance for large batch reads and writes
Infrequent accessed storage class costs less than frequently accessed storage class
Can configure retention policies
Data Storage
Data Warehouse Architecture

Cloud Provider Flexibility:

Snowflake can be deployed on multiple cloud providers, including AWS,
Azure, and Google Cloud

Separation of Compute and Storage:

Enables independentaly scaling of each. This means you can scale up or
down based on your compute needs without affecting storage.

Storage on Object Stores: Snowflake’s storage is built on top of cloud object

stores (e.g., Amazon S3, Azure Blob Storage, Google Cloud Storage), ensuring
durability and scalability.

Fully Managed Service: Snowflake handles infrastructure management,

security, and optimization. It automatically partitions data for fast querying
and provides robust security features. (DWH as a Service)
Data Storage
Data Warehouse Additional Features

Python Transformations:
You can write Python transformations directly in Snowflake using notebooks or stored procedures, making it
easy to process and analyze data.

Copilot Integration:
Snowflake has integrated Copilot, providing AI-powered assistance for various tasks within the platform.

Streamlit Apps:
Snowflake integrates with Streamlit, allowing you to build and deploy interactive data applications. Streamlit is a
framework for creating data apps quickly and easily using Python.
Data Storage
Data Warehouse Connectors

Python Connector:
Allows you to connect to Snowflake from Python applications, enabling data manipulation and analysis using
Python libraries.
ODBC Connector:
Provides a standard interface for connecting to Snowflake from various applications that support ODBC,
facilitating data access and integration.
JDBC Connector:
Enables Java applications to connect to Snowflake, allowing for seamless data interaction and manipulation
within Java environments
Data Pipeline
ETL ELT

TRANSFORM

Data Source A Data Source A

Data Data
TRANSFORM
Storage Storage
Data Source B Data Source B
Data Ingestion
How to ingest the data?

FTP Access:
Direct access to a live FTP file system for real-time data retrieval.
Automated scripts pull new files,and parse them for loading into SQL Server or other data solutions.
Incremental loading ensures efficiency, with error handling and scheduling for seamless operation.
Backup Files:
Daily database backups sent to S3 and fully loaded into SQL Server.
Pub/Sub Model:
Implemented real-time data updates using a publish-subscribe approach.
Shifted to incremental loading for efficiency.
Data Ingestion
How to ingest the data?

Web Scraping:
Extracts data from websites using automated scripts.
Collected data is processed, transformed, and stored in structured formats for analysis.
Ideal for gathering publicly available data.
Email Listener Tool:
Develop a tool to monitor email inboxes and automatically extract attachments (e.g., Excel, PDFs, CSVs).
Files are parsed and loaded into the database for processing.
Data Ingestion

Supplier Invoice

Supplier: Fresh Produce Co.

Invoice #: 12345
Date: 2024-11-01

EXTRACT TEXT FROM IMAGE Items:

- Tomatoes: 100 kg @ $2.00/kg = $200.00
[ OCR ] - Cucumbers: 50 kg @ $1.50/kg = $75.00
- Lettuce: 30 kg @ $1.00/kg = $30.00

Subtotal: $305.00 ❌
Tax (10%): $30.50
Total: $335.50

Contact: 123-456-7890
Email: [email protected]
+ run job metadata

❌
upload

x 10 x 10

Date: 17-11-2024

Source System
+ run job metadata

❌
upload

x 10 x 10

Date: 17-11-2024

Source System

Principle of least priviledge. Access only to essential Observability & “Data is a silent Killer” .. Monitor, log, alert
Security
data and resources needed to perform inteded task Monitoring Focus on Data Observability Driven Development

Can I trust this data? Accuracy, completeness, Everything breaks all the time & mistakes happen.
Data Quality Incident Response
timeliness Design to be able to find root cause rapidly.

Coordinating many jobs to run to run efficiently together

Data Modeling Make data in a usable form Orchesteration on a schedule cadence. Simple as a cron job or a
scheduler tool like airflow
Understand needs of business and gather
Central skill. Data processing code needs to be written.
requirements for use cases, then translate them to
Data Architecture Software Need to be proficient in several languages and code-
design, balancing between cost and operational
Engineering testing methodologies. Some even contribute in open-
simplicity and current and future needs
source data projects.
Data Modeling
Type 1 SCD:
If a record in a dimension table changes, the existing record is
updated or overwritten. Simple & efficient. (merge)
Type 2 SCD:
Maintains historical data by creating a new record with a new
timestamp ( start & end date) when changes occur. Allows for full
historic tracking.
Data Modeling

Type 2 SCD Type 1 SCD

Data Transformation

Python or SQL-Based
DWH like snowflake has string macthing functions built-in
Data Transformation
Data Serving
Weather Data Sales Data

keep pizza warmer during longer delivery times in

winter
Aggregate
invest in better insulated delivery bags
optimize delivery routes
promotional campaigns
Data is at its best when it leads to
action
Raisa Energy
At Raisa Energy, we leverage our extensive data on oil and natural gas wells to make
informed investment decisions. Unlike operators who are responsible for drilling and
extracting resources, our role is to evaluate these wells and decide whether to invest in them.
We earn a percentage of the profits and production from these wells and contribute a share
of the associated expenses.
Full-Picture

Data Science
Team

read/write Development
Extract-Load Team
Tool
Data Engineering
Sql Server Team
In-house
read/write Snowflake Connector
Business Users

Application
Extract-Load Tool
SELECT * FROM table_A

Sql Server

Construct Query Parquet File

Registry.yml

External Table
updates, deletes,
inserts

Data Sync
Full-Picture

Data Science
Team

read/write Development
Extract-Load Team
Tool
Data Engineering
Sql Server Team
In-house
read/write Snowflake Connector
Business Users

Application
Snowflake Connector
In-house Snowflake Python Connector Tool
Purpose:
Enforces table and column naming conventions for consistent DWH organization.
Ensures data science teams follow standard practices.
Features:
Historical Tracking: Enables monitoring of table changes if enables.
Benefits:
Improved data governance and team collaboration
Full-Picture

Data Science
Team

read/write Development
Extract-Load Team
Tool
Data Engineering
Sql Server Team
In-house
read/write Snowflake Connector
Business Users

Application
Transformations
Source Table A

Transformed
Source Table B View A
Table A

Source Table C

Transformed
view B
Table B
Each team member writes SQL scripts in their own way
Data Transformations scattered, and not re-usuable
Debugging and maintainign scripts is so time consuming
Transformed
Lack of lineage and documentation Table C
Lack of version control

...
Harder to maintain Data Quality ...
Transformations

Open-Source: dbt is an open-source tool that connects to your data warehouse.

Project Organization: Helps you organize your queries into a single, version-controlled project.
Python Support: Since Snowflake supports running Python, dbt allows you to perform transformations using
Python.
Utilizes DWH Resources: Transformations run on the data warehouse, leveraging its compute resources and
scaling capabilities.
Data Lineage and Documentation: Tracks data lineage and provides documentation for the data models
you develop.
Data Testing: Includes built-in data tests to proactively identify issues, and supports custom data tests.
Community and Documentation: Well-documented with a large online community for support.
SQL Conventions: Enforces conventions for writing SQL queries, ensuring consistency across the team.
Learning Resources
Books
Fundamentals of Data Engineering by Joe Reis and Matt Housley
SQL Cookbook by Anthony Molinaro and Robert de Graaf
Fluent Python by Luciano Ramalho
Courses (Tool Specific)
Microsoft Data Engineering Certification
Snowflake Certification
dbt Certification
Read Medium Posts
Follow People and Engage in Discussions on Data Engineering
Raisa Blog : http://tech.raisa.com/
We are hiring!
https://raisa.recruitee.com
Thank you

Snowflake For: Data Engineering
No ratings yet
Snowflake For: Data Engineering
15 pages
CH1 - Introduction To Data Engineering
No ratings yet
CH1 - Introduction To Data Engineering
36 pages
End-to-End Data Science with SAS: A Hands-On Programming Guide
From Everand
End-to-End Data Science with SAS: A Hands-On Programming Guide
James Gearheart
No ratings yet
Data Lakes in A Modern Data Architecture
100% (7)
Data Lakes in A Modern Data Architecture
23 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
final report
No ratings yet
final report
22 pages
Module 1.ppt
No ratings yet
Module 1.ppt
29 pages
Unit 4 Databases, Cloud & Snowflake: Prof. Thushara Weerawardane
No ratings yet
Unit 4 Databases, Cloud & Snowflake: Prof. Thushara Weerawardane
50 pages
When Where and Why To Use NoSQL
No ratings yet
When Where and Why To Use NoSQL
13 pages
BDA Module-1
No ratings yet
BDA Module-1
9 pages
Big Data Unit 1 Notes - 240311 - 100703
No ratings yet
Big Data Unit 1 Notes - 240311 - 100703
15 pages
Simplifying Data Engineering Databricks
100% (1)
Simplifying Data Engineering Databricks
20 pages
big data unit 1
No ratings yet
big data unit 1
24 pages
Detailednotes_unit1_Big Data
No ratings yet
Detailednotes_unit1_Big Data
22 pages
Best Practices For Optimizing Your DBT and Snowflake Deployment
No ratings yet
Best Practices For Optimizing Your DBT and Snowflake Deployment
30 pages
3 Assignment
No ratings yet
3 Assignment
5 pages
Storage Options for Transformed Data
No ratings yet
Storage Options for Transformed Data
3 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
Why Nosql?: The Shift To The Digital Economy Is Driving Nosql
No ratings yet
Why Nosql?: The Shift To The Digital Economy Is Driving Nosql
10 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
Analytics and Processing: Yuanyuan Zhu Email: Yyzhu@whu - Edu.cn
No ratings yet
Analytics and Processing: Yuanyuan Zhu Email: Yyzhu@whu - Edu.cn
47 pages
Bigdata Notes
No ratings yet
Bigdata Notes
136 pages
Big Data
No ratings yet
Big Data
51 pages
Oreilly Report High Performance Data Architectures
No ratings yet
Oreilly Report High Performance Data Architectures
35 pages
UNIT1 -BDH
No ratings yet
UNIT1 -BDH
77 pages
essentials-of-data-engineeringByMukeshSaini
No ratings yet
essentials-of-data-engineeringByMukeshSaini
30 pages
GCP - DataPlex - Building A Data Lakehouse
No ratings yet
GCP - DataPlex - Building A Data Lakehouse
19 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
Database: Data Schemas Tables Queries Views
No ratings yet
Database: Data Schemas Tables Queries Views
8 pages
Introduction to Database Systems
No ratings yet
Introduction to Database Systems
4 pages
Lecture 2
No ratings yet
Lecture 2
25 pages
Files 1 2020 April NotesHubDocument 1586849482
No ratings yet
Files 1 2020 April NotesHubDocument 1586849482
60 pages
Bigdata Overview PDF
No ratings yet
Bigdata Overview PDF
98 pages
23000122010
No ratings yet
23000122010
12 pages
Data Engineering - Session 01
No ratings yet
Data Engineering - Session 01
34 pages
09 - Cloud-Enabling Technologies - v2
No ratings yet
09 - Cloud-Enabling Technologies - v2
45 pages
Trends: Five Big Data Trends Moving Into The Spotlight in 2015
No ratings yet
Trends: Five Big Data Trends Moving Into The Spotlight in 2015
1 page
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
No ratings yet
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
12 pages
Ch1Introduction To Information Storage
No ratings yet
Ch1Introduction To Information Storage
39 pages
Data Science
No ratings yet
Data Science
87 pages
Data Engg
No ratings yet
Data Engg
19 pages
Digitization Week 3
No ratings yet
Digitization Week 3
13 pages
BD U-1 (Anupam Sir)
No ratings yet
BD U-1 (Anupam Sir)
20 pages
BIG DATA 1 Unit
100% (1)
BIG DATA 1 Unit
17 pages
Introducing Snowflake: Data Warehousing For Everyone
No ratings yet
Introducing Snowflake: Data Warehousing For Everyone
15 pages
Big Data The Driver For Innovation in Databases
No ratings yet
Big Data The Driver For Innovation in Databases
4 pages
Data Security Best Practices
No ratings yet
Data Security Best Practices
27 pages
A Seminar Presentation On "Big Data": Presented By: Divyanshu Bhardwaj Department of Computer Science VIII Semester
No ratings yet
A Seminar Presentation On "Big Data": Presented By: Divyanshu Bhardwaj Department of Computer Science VIII Semester
19 pages
Bigdata
No ratings yet
Bigdata
7 pages
Test 12 File
No ratings yet
Test 12 File
18 pages
Big Data complete Notes
No ratings yet
Big Data complete Notes
33 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
Ds Notes
No ratings yet
Ds Notes
88 pages
Unit 2 - BD - Big Data Technology Foundations
No ratings yet
Unit 2 - BD - Big Data Technology Foundations
44 pages
Unit 4 LT
No ratings yet
Unit 4 LT
16 pages
2 emerging
No ratings yet
2 emerging
10 pages
Unit3 - Cloud Data Storage
No ratings yet
Unit3 - Cloud Data Storage
7 pages
WK 3
No ratings yet
WK 3
29 pages
SOA Modeling Patterns for Service-Oriented Discovery and Analysis
From Everand
SOA Modeling Patterns for Service-Oriented Discovery and Analysis
Michael Bell
No ratings yet
Role of Business Intelligence in Business Performance Management.
No ratings yet
Role of Business Intelligence in Business Performance Management.
6 pages
Databases
No ratings yet
Databases
43 pages
PDF Manual Alineador Haweka Axis 10pdf PDF
No ratings yet
PDF Manual Alineador Haweka Axis 10pdf PDF
66 pages
Arduino
No ratings yet
Arduino
6 pages
SQL Tutorial
No ratings yet
SQL Tutorial
72 pages
B+ Trees
No ratings yet
B+ Trees
13 pages
AI Model Paper-2 @
No ratings yet
AI Model Paper-2 @
7 pages
Data Mining: Analysis of Student Database Using Classification Techniques
No ratings yet
Data Mining: Analysis of Student Database Using Classification Techniques
7 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Footnotes Are Notes Placed at The Bottom of A Page
No ratings yet
Footnotes Are Notes Placed at The Bottom of A Page
7 pages
Advanced Retrieval-Augmented Generation (RAG) With LangChain, LangGraph, and AI Agents - by Manoj Mukherjee - Oct, 2024 - Medium
No ratings yet
Advanced Retrieval-Augmented Generation (RAG) With LangChain, LangGraph, and AI Agents - by Manoj Mukherjee - Oct, 2024 - Medium
15 pages
Online Claims User Manual Call Centre No 18002089100
No ratings yet
Online Claims User Manual Call Centre No 18002089100
18 pages
Manual Instrucciones Jaguar F Pace
0% (2)
Manual Instrucciones Jaguar F Pace
2 pages
Alpha Fold
No ratings yet
Alpha Fold
16 pages
Example Scientific Literature Review Paper
100% (2)
Example Scientific Literature Review Paper
5 pages
Oracle Application Express: Developing Database Web Applications
No ratings yet
Oracle Application Express: Developing Database Web Applications
18 pages
DDBMS Failure and Recovery
100% (1)
DDBMS Failure and Recovery
23 pages
MARK3088 - TUT4 wk5 - Setting Up Orange Text Analysis
No ratings yet
MARK3088 - TUT4 wk5 - Setting Up Orange Text Analysis
20 pages
CIEM 6000K - BIM and Smart Construction: Solutions For Assignment 2
No ratings yet
CIEM 6000K - BIM and Smart Construction: Solutions For Assignment 2
5 pages
Unit 2 SQL With Oracle9i Database: Structure
No ratings yet
Unit 2 SQL With Oracle9i Database: Structure
48 pages
1.what Is Oracle Flashback Technology?
No ratings yet
1.what Is Oracle Flashback Technology?
7 pages
What Is Pharmacyinformatics
No ratings yet
What Is Pharmacyinformatics
28 pages
Preservation and Security of Library and Information Systems and Resources
No ratings yet
Preservation and Security of Library and Information Systems and Resources
146 pages
MCIS Case-1
No ratings yet
MCIS Case-1
2 pages
Sales Management System Project Report: October 2020
No ratings yet
Sales Management System Project Report: October 2020
26 pages
Data Migration Approach: Ramco Erp To Sap S/4 Hana
No ratings yet
Data Migration Approach: Ramco Erp To Sap S/4 Hana
6 pages
Uttam Ref Companion Placements
No ratings yet
Uttam Ref Companion Placements
3 pages
Amol
No ratings yet
Amol
48 pages
BIM Guide Class Notes
No ratings yet
BIM Guide Class Notes
5 pages