Data Engineering - Behind the Scene of Data by Hoda Ragaie
Data Engineering - Behind the Scene of Data by Hoda Ragaie
Generated by copilot
THE 2024 MAD (MACHINE LEARNING, ARTIFICIAL INTELLIGENCE & DATA) LANDSCAPE
INFRASTRUCTURE ANALYTICS MACHINE LEARNING & ARTIFICIAL INTELLIGENCE APPLICATIONS — ENTERPRISE
STORAGE MPP DBs DATA LAKES / DATA STREAMING / BI PLATFORMS VISUALIZATION DATA SCIENCE DATA SCIENCE ENTERPRISE ML/AI PLATFORMS DATA GENERATION
LAKEHOUSES WAREHOUSES IN-MEMORY NOTEBOOKS PLATFORMS & LABELING SALES MARKETING CUSTOMER EXPERIENCE HUMAN AUTOMATION DECISION &
CAPITAL & OPERATIONS OPTIMIZATION
VIDEO GENERATION
ETL / ELT / REVERSE ETL DATA INTEGRATION DATA GOVERNANCE CUSTOMER DATA PRODUCT SPEECH / VOICE NLP COMMERCIAL AI RESEARCH NONPROFIT
DATA TRANSFORMATION & CATALOG PLATFORMS ANALYTICS AI RESEARCH
APPLICATIONS — INDUSTRY
FINANCE & HEALTHCARE LIFE SCIENCES TRANSPORTATION AGRICULTURE INDUSTRIAL & AEROSPACE,
INSURANCE LOGISTICS DEFENSE & GOV’T
ORCHESTRATION DATA QUALITY & FULLY MGMT / MONITORING PRIVACY COMPUTE LOG ANALYTICS ENTERPRISE SEARCH / AI HARDWARE GPU CLOUD / EDGE AI CLOSED
OBSERVABILITY MANAGED & SECURITY KNOWLEDGE ANALYTICS INFRA SOURCE MODELS
3
AU LARGE
CROSS-
INDUSTRY
Version 1.0 - March 2024 © Matt Turck (@mattturck) , Aman Kabeer (@AmanKabeer11) & FirstMark (@firstmarkcap) Blog post: mattturck.com/MAD2024 Interactive version: MAD.firstmarkcap.com Comments? Email [email protected]
When there's data, there's a data engineer.
and data is everywhere ...
Power of Data
Customer Data Delivery Services
Using bad data to make decisions is much worse than having no data -
datastrophes
Data Storage
Where is data stored?
Modifying data is easier. Adding new Query performance is much faster for
row just gets appended at the end. analytics
Slower Aggregation since data for Requires lesser space
each row has to be loaded first Data modifications are more complex
Take up more space for indexes
Ahmed Male 63 | Malak Female 29 Ahmed Male 63 Ahmed Malak | Male Female | 63 29
Malak Female 29
Data Storage
Data Warehouse
Relational Database
Designed for real-time operations and Designed for complex analysis and
transactions (OLTP) reporting (OLAP)
Optimized for fast data entry and quick Often Column-based storage: data is
updates stored column by column.
Fast lookups done through indexing to This makes them perform well for
esnure we don’t scan entire table to find complex data transformations,
records satisfying the WHERE clause aggregations, statistical calcultaions or
Row-Based storage: Data stored row by evaluation iof complex conditions on
row large datasets
Still structured data
Data Storage
Object Storage
Python Transformations:
You can write Python transformations directly in Snowflake using notebooks or stored procedures, making it
easy to process and analyze data.
Copilot Integration:
Snowflake has integrated Copilot, providing AI-powered assistance for various tasks within the platform.
Streamlit Apps:
Snowflake integrates with Streamlit, allowing you to build and deploy interactive data applications. Streamlit is a
framework for creating data apps quickly and easily using Python.
Data Storage
Data Warehouse Connectors
Python Connector:
Allows you to connect to Snowflake from Python applications, enabling data manipulation and analysis using
Python libraries.
ODBC Connector:
Provides a standard interface for connecting to Snowflake from various applications that support ODBC,
facilitating data access and integration.
JDBC Connector:
Enables Java applications to connect to Snowflake, allowing for seamless data interaction and manipulation
within Java environments
Data Pipeline
ETL ELT
TRANSFORM
FTP Access:
Direct access to a live FTP file system for real-time data retrieval.
Automated scripts pull new files,and parse them for loading into SQL Server or other data solutions.
Incremental loading ensures efficiency, with error handling and scheduling for seamless operation.
Backup Files:
Daily database backups sent to S3 and fully loaded into SQL Server.
Pub/Sub Model:
Implemented real-time data updates using a publish-subscribe approach.
Shifted to incremental loading for efficiency.
Data Ingestion
How to ingest the data?
Web Scraping:
Extracts data from websites using automated scripts.
Collected data is processed, transformed, and stored in structured formats for analysis.
Ideal for gathering publicly available data.
Email Listener Tool:
Develop a tool to monitor email inboxes and automatically extract attachments (e.g., Excel, PDFs, CSVs).
Files are parsed and loaded into the database for processing.
Data Ingestion
Supplier Invoice
Subtotal: $305.00 ❌
Tax (10%): $30.50
Total: $335.50
Contact: 123-456-7890
Email: [email protected]
+ run job metadata
❌
upload
x 10 x 10
Date: 17-11-2024
Source System
+ run job metadata
❌
upload
x 10 x 10
Date: 17-11-2024
Source System
Principle of least priviledge. Access only to essential Observability & “Data is a silent Killer” .. Monitor, log, alert
Security
data and resources needed to perform inteded task Monitoring Focus on Data Observability Driven Development
Can I trust this data? Accuracy, completeness, Everything breaks all the time & mistakes happen.
Data Quality Incident Response
timeliness Design to be able to find root cause rapidly.
Python or SQL-Based
DWH like snowflake has string macthing functions built-in
Data Transformation
Data Serving
Weather Data Sales Data
Data Science
Team
read/write Development
Extract-Load Team
Tool
Data Engineering
Sql Server Team
In-house
read/write Snowflake Connector
Business Users
Application
Extract-Load Tool
SELECT * FROM table_A
Sql Server
Registry.yml
External Table
updates, deletes,
inserts
Data Sync
Full-Picture
Data Science
Team
read/write Development
Extract-Load Team
Tool
Data Engineering
Sql Server Team
In-house
read/write Snowflake Connector
Business Users
Application
Snowflake Connector
In-house Snowflake Python Connector Tool
Purpose:
Enforces table and column naming conventions for consistent DWH organization.
Ensures data science teams follow standard practices.
Features:
Historical Tracking: Enables monitoring of table changes if enables.
Benefits:
Improved data governance and team collaboration
Full-Picture
Data Science
Team
read/write Development
Extract-Load Team
Tool
Data Engineering
Sql Server Team
In-house
read/write Snowflake Connector
Business Users
Application
Transformations
Source Table A
Transformed
Source Table B View A
Table A
Source Table C
Transformed
view B
Table B
Each team member writes SQL scripts in their own way
Data Transformations scattered, and not re-usuable
Debugging and maintainign scripts is so time consuming
Transformed
Lack of lineage and documentation Table C
Lack of version control
...
Harder to maintain Data Quality ...
Transformations