De Unit 4
De Unit 4
Data Data
Feature Data Lake Data Platform
Warehouse Lakehouse
ACID
Yes No Yes Yes
Transactions
Machine
Limited Excellent Excellent Excellent
Learning
End-to-end
BI & Hybrid
Use Cases Big Data & AI data
Reporting Workloads
management
D. Data Ingestion
Bounded versus unbounded
Frequency
Synchronous versus asynchronous
Serialization and deserialization
Throughput and elastic scalability
Reliability and durability
Payload
Push versus pull versus poll patterns
The Ingestion Stage in the Data Engineering Lifecycle is
the process of collecting raw data from various sources and bringing it into
storage systems like Data Warehouses, Data Lakes, or Data Lakehouses.
Efficient data ingestion ensures seamless downstream processing, analytics,
and machine learning.
This document discusses key concepts in data ingestion, including:
• Bounded vs. Unbounded Data
• Frequency of Data Ingestion
• Synchronous vs. Asynchronous Processing
• Serialization & Deserialization
• Throughput & Elastic Scalability
• Reliability & Durability
• Payload Considerations
• Push vs. Pull vs. Poll Patterns
Bounded Data
Feature Unbounded Data (Streaming)
(Batch)
2. Late-Arriving Data
Streaming pipelines must handle data that arrives out of order or is delayed
due to network issues, retries, or system failures.
Causes of Late-Arriving Data
• Network latency.
• Device or sensor buffering delays.
• Message queue backlogs.
• Time zone inconsistencies.
• Source system failures and retries.
Handling Late Data
• Event Time vs. Processing Time:
o Event Time: When the event actually happened.
o Processing Time: When the event is processed by the system.
• Windowing Strategies:
o Fixed Windows: A specific duration (e.g., 10 minutes).
o Sliding Windows: Overlapping time intervals.
o Session Windows: Grouping events based on user activity.
• Watermarking: Defines a threshold where late events are accepted up to
a certain point but discarded afterward (e.g., Apache Flink, Apache
Beam).
• Reprocessing Mechanisms: Some frameworks (Kafka, Flink) allow
reprocessing of past data when necessary.
File
Pros Cons
Format
ORC Similar to Parquet, good for big data Higher storage overhead
Key Considerations
• Use Parquet/ORC for analytics workloads.
• Use Avro for schema evolution needs.
• Avoid CSV/JSON for large-scale data ingestion unless required.