0% found this document useful (0 votes)
7 views

Data Engineering - Behind the Scene of Data by Hoda Ragaie

Uploaded by

my home deccor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Data Engineering - Behind the Scene of Data by Hoda Ragaie

Uploaded by

my home deccor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Data Engineering

Behind the Scene of Data


17 November, 2024
A Little About Me hodaragaie

I am a 2022 GUC graduate with a degree in Computer Engineering.


Hoda Ragaie I first worked at @Raisa as a Data Engineer intern, and came back
Data Engineer II @Raisa to a full-time role after graduating over two years ago now. My
bachelor’s thesis was in data, and I’ve always been passionate
about it.
What is it?

Generated by ChatGPT - with manual enhancements


Data Engineers wear different hats

Generated by copilot
THE 2024 MAD (MACHINE LEARNING, ARTIFICIAL INTELLIGENCE & DATA) LANDSCAPE
INFRASTRUCTURE ANALYTICS MACHINE LEARNING & ARTIFICIAL INTELLIGENCE APPLICATIONS — ENTERPRISE
STORAGE MPP DBs DATA LAKES / DATA STREAMING / BI PLATFORMS VISUALIZATION DATA SCIENCE DATA SCIENCE ENTERPRISE ML/AI PLATFORMS DATA GENERATION
LAKEHOUSES WAREHOUSES IN-MEMORY NOTEBOOKS PLATFORMS & LABELING SALES MARKETING CUSTOMER EXPERIENCE HUMAN AUTOMATION DECISION &
CAPITAL & OPERATIONS OPTIMIZATION

LEGAL PARTNERSHIPS REGTECH & FINANCE


COMPLIANCE
RDBMS NoSQL DATABASES NewSQL DATABASES REAL TIME GRAPH DBs GPU MULTI- DATA ANALYST PLATFORMS MLOPS AI OBSERVABILITY COMPUTER
DATABASES DATABASES MODEL VISION
DATABASES &
ABSTRACTIONS

VECTOR APPLICATIONS — HORIZONTAL


DATABASES AI DEVELOPER PLATFORMS AI SAFETY & SECURITY
CODE & TEXT AUDIO & VOICE IMAGE PRESENTATION & VIDEO EDITING ANIMATION SEARCH
DOCUMENTATION DESIGN & 3D / GAMING / CONVER-
SATIONAL AI
3

VIDEO GENERATION

ETL / ELT / REVERSE ETL DATA INTEGRATION DATA GOVERNANCE CUSTOMER DATA PRODUCT SPEECH / VOICE NLP COMMERCIAL AI RESEARCH NONPROFIT
DATA TRANSFORMATION & CATALOG PLATFORMS ANALYTICS AI RESEARCH

APPLICATIONS — INDUSTRY

FINANCE & HEALTHCARE LIFE SCIENCES TRANSPORTATION AGRICULTURE INDUSTRIAL & AEROSPACE,
INSURANCE LOGISTICS DEFENSE & GOV’T

ORCHESTRATION DATA QUALITY & FULLY MGMT / MONITORING PRIVACY COMPUTE LOG ANALYTICS ENTERPRISE SEARCH / AI HARDWARE GPU CLOUD / EDGE AI CLOSED
OBSERVABILITY MANAGED & SECURITY KNOWLEDGE ANALYTICS INFRA SOURCE MODELS

3
AU LARGE

CROSS-
INDUSTRY

OPEN SOURCE INFRASTRUCTURE


DATA FRAMEWORKS FORMATS QUERY / DATA FLOW DATA MANAGEMENT DATABASES OLAP ORCHESTRATION INFRA- STREAMING & STAT TOOLS & MLOPS & AI INFRA AI FRAMEWORKS, TOOLS & LIBRARIES AI MODELS LOCAL AI SEARCH LOGGING & MONITORING VISUALIZATION COLLABORATION
STRUCTURE MESSAGING LANGUAGES

DATA SOURCES & APIs DATA & AI CONSULTING


DATA MARKETPLACES FINANCIAL & MARKET DATA AIR / SPACE / SEA PEOPLE / ENTITIES LOCATION INTELLIGENCE ESG
& DISCOVERY

Version 1.0 - March 2024 © Matt Turck (@mattturck) , Aman Kabeer (@AmanKabeer11) & FirstMark (@firstmarkcap) Blog post: mattturck.com/MAD2024 Interactive version: MAD.firstmarkcap.com Comments? Email [email protected]
When there's data, there's a data engineer.
and data is everywhere ...
Power of Data
Customer Data Delivery Services

Website Analytics Supplier Data

Social Media Weather Data


Data Sources
How is data created?

Out of data engineer's control - incident response & monitoring


Understand how upstream source system architecture is designed. Its
strengths and weaknesses
Data Contract with source system stakeholders:
What data is being extracted
Via what method (full, incremental)
How often
Schema changes
Person of contact
Uptime
Data quality

Using bad data to make decisions is much worse than having no data -
datastrophes
Data Storage
Where is data stored?

Purpose & use case:


Identify purpose of storing the data. What is it used for?
Stream data processing (real-time):
track orders real-time
Batch Processing:
reports & analysis (daily)
Update Patterns:
Is the solution optimized for quick bulk updates & inserts?
Is the solution optimized for complex queries & analysis?
Cost:
Storage and Compute scalability costs
Data Storage

Row-Based Storage Model Column-Based Storage Model

Modifying data is easier. Adding new Query performance is much faster for
row just gets appended at the end. analytics
Slower Aggregation since data for Requires lesser space
each row has to be loaded first Data modifications are more complex
Take up more space for indexes

Name Gender Age

Ahmed Male 63 | Malak Female 29 Ahmed Male 63 Ahmed Malak | Male Female | 63 29

Malak Female 29
Data Storage
Data Warehouse
Relational Database

Designed for real-time operations and Designed for complex analysis and
transactions (OLTP) reporting (OLAP)
Optimized for fast data entry and quick Often Column-based storage: data is
updates stored column by column.
Fast lookups done through indexing to This makes them perform well for
esnure we don’t scan entire table to find complex data transformations,
records satisfying the WHERE clause aggregations, statistical calcultaions or
Row-Based storage: Data stored row by evaluation iof complex conditions on
row large datasets
Still structured data
Data Storage
Object Storage

Grown in popularity with rise of big data


Stores any type - txt, csv, json, images, videos, audio ..
Fully-managed cloud object stores (Amazon S3, Azure Blob, Google Cloud Storage)
Excellent performance for large batch reads and writes
Infrequent accessed storage class costs less than frequently accessed storage class
Can configure retention policies
Data Storage
Data Warehouse Architecture

Cloud Provider Flexibility:


Snowflake can be deployed on multiple cloud providers, including AWS,
Azure, and Google Cloud

Separation of Compute and Storage:


Enables independentaly scaling of each. This means you can scale up or
down based on your compute needs without affecting storage.

Storage on Object Stores: Snowflake’s storage is built on top of cloud object


stores (e.g., Amazon S3, Azure Blob Storage, Google Cloud Storage), ensuring
durability and scalability.

Fully Managed Service: Snowflake handles infrastructure management,


security, and optimization. It automatically partitions data for fast querying
and provides robust security features. (DWH as a Service)
Data Storage
Data Warehouse Additional Features

Python Transformations:
You can write Python transformations directly in Snowflake using notebooks or stored procedures, making it
easy to process and analyze data.

Copilot Integration:
Snowflake has integrated Copilot, providing AI-powered assistance for various tasks within the platform.

Streamlit Apps:
Snowflake integrates with Streamlit, allowing you to build and deploy interactive data applications. Streamlit is a
framework for creating data apps quickly and easily using Python.
Data Storage
Data Warehouse Connectors

Python Connector:
Allows you to connect to Snowflake from Python applications, enabling data manipulation and analysis using
Python libraries.
ODBC Connector:
Provides a standard interface for connecting to Snowflake from various applications that support ODBC,
facilitating data access and integration.
JDBC Connector:
Enables Java applications to connect to Snowflake, allowing for seamless data interaction and manipulation
within Java environments
Data Pipeline
ETL ELT

TRANSFORM

Data Source A Data Source A


Data Data
TRANSFORM
Storage Storage
Data Source B Data Source B
Data Ingestion
How to ingest the data?

FTP Access:
Direct access to a live FTP file system for real-time data retrieval.
Automated scripts pull new files,and parse them for loading into SQL Server or other data solutions.
Incremental loading ensures efficiency, with error handling and scheduling for seamless operation.
Backup Files:
Daily database backups sent to S3 and fully loaded into SQL Server.
Pub/Sub Model:
Implemented real-time data updates using a publish-subscribe approach.
Shifted to incremental loading for efficiency.
Data Ingestion
How to ingest the data?

Web Scraping:
Extracts data from websites using automated scripts.
Collected data is processed, transformed, and stored in structured formats for analysis.
Ideal for gathering publicly available data.
Email Listener Tool:
Develop a tool to monitor email inboxes and automatically extract attachments (e.g., Excel, PDFs, CSVs).
Files are parsed and loaded into the database for processing.
Data Ingestion

Supplier Invoice

Supplier: Fresh Produce Co.


Invoice #: 12345
Date: 2024-11-01

EXTRACT TEXT FROM IMAGE Items:


- Tomatoes: 100 kg @ $2.00/kg = $200.00
[ OCR ] - Cucumbers: 50 kg @ $1.50/kg = $75.00
- Lettuce: 30 kg @ $1.00/kg = $30.00

Subtotal: $305.00 ❌
Tax (10%): $30.50
Total: $335.50

Contact: 123-456-7890
Email: [email protected]
+ run job metadata


upload

x 10 x 10

Date: 17-11-2024

Source System
+ run job metadata


upload

x 10 x 10

Date: 17-11-2024

Source System

Principle of least priviledge. Access only to essential Observability & “Data is a silent Killer” .. Monitor, log, alert
Security
data and resources needed to perform inteded task Monitoring Focus on Data Observability Driven Development

Can I trust this data? Accuracy, completeness, Everything breaks all the time & mistakes happen.
Data Quality Incident Response
timeliness Design to be able to find root cause rapidly.

Coordinating many jobs to run to run efficiently together


Data Modeling Make data in a usable form Orchesteration on a schedule cadence. Simple as a cron job or a
scheduler tool like airflow
Understand needs of business and gather
Central skill. Data processing code needs to be written.
requirements for use cases, then translate them to
Data Architecture Software Need to be proficient in several languages and code-
design, balancing between cost and operational
Engineering testing methodologies. Some even contribute in open-
simplicity and current and future needs
source data projects.
Data Modeling
Type 1 SCD:
If a record in a dimension table changes, the existing record is
updated or overwritten. Simple & efficient. (merge)
Type 2 SCD:
Maintains historical data by creating a new record with a new
timestamp ( start & end date) when changes occur. Allows for full
historic tracking.
Data Modeling

Type 2 SCD Type 1 SCD


Data Transformation

Python or SQL-Based
DWH like snowflake has string macthing functions built-in
Data Transformation
Data Serving
Weather Data Sales Data

keep pizza warmer during longer delivery times in


winter
Aggregate
invest in better insulated delivery bags
optimize delivery routes
promotional campaigns
Data is at its best when it leads to
action
Raisa Energy
At Raisa Energy, we leverage our extensive data on oil and natural gas wells to make
informed investment decisions. Unlike operators who are responsible for drilling and
extracting resources, our role is to evaluate these wells and decide whether to invest in them.
We earn a percentage of the profits and production from these wells and contribute a share
of the associated expenses.
Full-Picture

Data Science
Team

read/write Development
Extract-Load Team
Tool
Data Engineering
Sql Server Team
In-house
read/write Snowflake Connector
Business Users

Application
Extract-Load Tool
SELECT * FROM table_A

Sql Server

Construct Query Parquet File

Registry.yml

External Table
updates, deletes,
inserts

Data Sync
Full-Picture

Data Science
Team

read/write Development
Extract-Load Team
Tool
Data Engineering
Sql Server Team
In-house
read/write Snowflake Connector
Business Users

Application
Snowflake Connector
In-house Snowflake Python Connector Tool
Purpose:
Enforces table and column naming conventions for consistent DWH organization.
Ensures data science teams follow standard practices.
Features:
Historical Tracking: Enables monitoring of table changes if enables.
Benefits:
Improved data governance and team collaboration
Full-Picture

Data Science
Team

read/write Development
Extract-Load Team
Tool
Data Engineering
Sql Server Team
In-house
read/write Snowflake Connector
Business Users

Application
Transformations
Source Table A

Transformed
Source Table B View A
Table A

Source Table C

Transformed
view B
Table B
Each team member writes SQL scripts in their own way
Data Transformations scattered, and not re-usuable
Debugging and maintainign scripts is so time consuming
Transformed
Lack of lineage and documentation Table C
Lack of version control

...
Harder to maintain Data Quality ...
Transformations

Open-Source: dbt is an open-source tool that connects to your data warehouse.


Project Organization: Helps you organize your queries into a single, version-controlled project.
Python Support: Since Snowflake supports running Python, dbt allows you to perform transformations using
Python.
Utilizes DWH Resources: Transformations run on the data warehouse, leveraging its compute resources and
scaling capabilities.
Data Lineage and Documentation: Tracks data lineage and provides documentation for the data models
you develop.
Data Testing: Includes built-in data tests to proactively identify issues, and supports custom data tests.
Community and Documentation: Well-documented with a large online community for support.
SQL Conventions: Enforces conventions for writing SQL queries, ensuring consistency across the team.
Learning Resources
Books
Fundamentals of Data Engineering by Joe Reis and Matt Housley
SQL Cookbook by Anthony Molinaro and Robert de Graaf
Fluent Python by Luciano Ramalho
Courses (Tool Specific)
Microsoft Data Engineering Certification
Snowflake Certification
dbt Certification
Read Medium Posts
Follow People and Engage in Discussions on Data Engineering
Raisa Blog : http://tech.raisa.com/
We are hiring!
https://raisa.recruitee.com
Thank you

You might also like