0% found this document useful (0 votes)
31 views33 pages

De Unit 4

The document discusses various data storage technologies and architectures, including magnetic disk drives, solid-state drives, and random access memory, highlighting their characteristics, advantages, and limitations. It also covers data storage systems such as data warehouses, data lakes, and data lakehouses, explaining their definitions, use cases, and architectures. Additionally, it emphasizes the importance of serialization, compression, and caching in data engineering, and provides insights into the evolution of data storage solutions in modern data architectures.

Uploaded by

bashaa0669
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views33 pages

De Unit 4

The document discusses various data storage technologies and architectures, including magnetic disk drives, solid-state drives, and random access memory, highlighting their characteristics, advantages, and limitations. It also covers data storage systems such as data warehouses, data lakes, and data lakehouses, explaining their definitions, use cases, and architectures. Additionally, it emphasizes the importance of serialization, compression, and caching in data engineering, and provides insights into the evolution of data storage solutions in modern data architectures.

Uploaded by

bashaa0669
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

A.

Raw Ingredients of Data Storage


In most data architectures, data frequently passes
through magnetic storage, SSDs, and memory as it works its way through the
various processing phases of a data pipeline. Data storage and query systems
generally follow complex recipes involving distributed systems, numerous
services, and multiple hardware storage layers. These systems require the right
raw ingredients to function correctly.
Magnetic Disk Drive
Magnetic disks utilize spinning platters coated with a ferromagnetic film (Figure
6-3). This film is magnetized by a read/write head during write operations to
physically encode binary data. The read/write head detects the magnetic field
and outputs a bitstream during read operations. Magnetic disk drives have
been around for ages. Still, they form the backbone of bulk data storage
systems because they are significantly cheaper than SSDs per gigabyte of
stored data.
IBM developed magnetic disk drive technology in the 1950s.
Since then, magnetic disk capacities have grown steadily. The first commercial
magnetic disk drive, the IBM 350, had a capacity of 3.75 megabytes. As of this
writing, magnetic drives storing 20 TB are commercially available.
First, disk transfer speed, the rate at which data can be read
and written, does not scale in proportion with disk capacity. Disk capacity
scales with areal density (gigabits stored per square inch), whereas transfer
speed scales with linear density (bits per inch).
A second major limitation is seek time. To access data, the
drive must physically relocate the read/write heads to the appropriate track on
the disk. Third, in order to find a particular piece of data on the disk, the disk
controller must wait for that data to rotate under the read/write heads. This
leads to rotational latency.
Solid-State Drive
Solid-state drives (SSDs) store data as charges in flash memory
cells. SSDs eliminate the mechanical components of magnetic drives; the data
is read by purely electronic means. SSDs can look up random data in less than
0.1 ms (100 microseconds). In addition, SSDs can scale both data-transfer
speeds and IOPS by slicing storage into partitions with numerous storage
controllers running in parallel. Commercial SSDs can support transfer speeds of
many gigabytes per second and tens of thousands of IOPS.
Random Access Memory
Is attached to a CPU and mapped into CPU address space.
Stores the code that CPUs execute and the data that this code directly
processes.
Is volatile, while magnetic drives and SSDs are nonvolatile. Though they may
occasionally fail and corrupt or lose data, drives generally retain data when
powered off.
Offers significantly higher transfer speeds and faster retrieval times than SSD
storage. DDR5 memory—the latest widely used standard for RAM—offers data
retrieval latency on the order of 100 ns, roughly 1,000 times faster than SSD.
Is significantly more expensive than SSD storage, at roughly $10/GB (at the
time of this writing).
Is limited in the amount of RAM attached to an individual CPU and memory
controller
Is still significantly slower than CPU cache, a type of memory located directly
on the CPU die or in the same package.
Networking and CPU
While storage standards such as redundant arrays of independent disks
(RAID) parallelize on a single server, cloud object storage clusters operate at a
much larger scale, with disks distributed across a network and even multiple
data centers and availability zones.
Availability zones are a standard cloud construct
consisting of compute environments with independent power, water, and other
resources. Multizonal storage enhances both the availability and durability of
data.
CPUs handle the details of servicing requests,
aggregating reads, and distributing writes. Storage becomes a web application
with an API, backend service components, and load balancing. Network device
performance and network topology are key factors in realizing high
performance
Serialization
Serialization is another raw storage ingredient and a critical
element of database design. The decisions around serialization will inform how
well queries perform across a network, CPU overhead, query latency, and
more. Designing a data lake, for example, involves choosing a base storage
system (e.g., Amazon S3) and standards for serialization that balance
interoperability with performance considerations.
Data stored in system memory by software is generally not in a
format suitable for storage on disk or transmission over a network. Serialization
is the process of flattening and packing data into a standard format that a
reader will be able to decode.
Compression
Compression is another critical component of storage engineering.
On a basic level, compression makes data smaller, but compression algorithms
interact with other details of storage systems in complex ways. Highly efficient
compression has three main advantages in storage systems. First, the data is
smaller and thus takes up less space on the disk. Second, compression
increases the practical scan speed per disk. With a 10:1 compression ratio, we
go from scanning 200 MB/s per magnetic disk to an effective rate of 2 GB/s per
disk.
Caching
As we analyze storage systems, it is helpful to put every type of storage we
utilize inside a cache hierarchy (Table 6-1). Most practical data systems rely on
many cache layers assembled from storage with varying performance
characteristics. This starts inside CPUs; processors may deploy up to four cache
tiers. We move down the hierarchy to RAM and SSDs. Cloud object storage is a
lower tier that supports long-term data retention and durability while allowing
for data serving and dynamic data movement in pipelines.
B. Data Storage Systems
Single Machine Versus Distributed Storage
As data storage and access patterns become more complex and
outgrow the usefulness of a single server, distributing data to more than
one server becomes necessary. Data can be stored on multiple servers,
known as distributed storage. This is a distributed system whose purpose is
to store data in a distributed fashion.
Distributed storage coordinates the activities of multiple servers
to store, retrieve, and process data faster and at a larger scale, all while
providing redundancy in case a server becomes unavailable. Distributed
storage is common in architectures where you want built-in redundancy and
scalability for large amounts of data. For example, object storage, Apache
Spark, and cloud data warehouses rely on distributed storage architectures.

Eventual Versus Strong Consistency


Basically available
Consistency is not guaranteed, but efforts at database
reads and writes are made on a best-effort basis, meaning consistent data is
available most of the time.
Soft-state
The state of the transaction is fuzzy, and it’s uncertain
whether the transaction is committed or uncommitted.
Eventual consistency
At some point, reading data will return consistent values.
The opposite of eventual consistency is strong
consistency. With strong consistency, the distributed database ensures that
writes to any node are first distributed with a consensus and that any reads
against the database return consistent values.
Generally, data engineers make decisions about consistency in
three places. First, the database technology itself sets the stage for a certain
level of consistency. Second, configuration parameters for the database will
have an impact on consistency. Third, databases often support some
consistency configuration at an individual query level.
File Storage
Finite length
A file is a finite-length stream of bytes.
Append operations
We can append bytes to the file up to the limits of the
host storage system. Random access
We can read from any location in the file or write updates
to any location.
Local disk storage
The most familiar type of file storage is an operating system–managed
filesystem on a local disk partition of SSD or magnetic disk. Local filesystems
generally support full read after write consistency; reading immediately after
a write will return the written data. Operating systems also employ various
locking strategies to manage concurrent writing attempts to a file.
Network-attached storage
Network-attached storage (NAS) systems provide a file
storage system to clients over a network. NAS is a prevalent solution for
servers; they quite often ship with built-in dedicated NAS interface
hardware. While there are performance penalties to accessing the filesystem
over a network, significant advantages to storage virtualization also exist,
including redundancy and reliability, fine-grained control of resources,
storage pooling across multiple disks for large virtual volumes, and file
sharing across multiple machines. Engineers should be aware of the
consistency model provided by their NAS solution, especially when multiple
clients will potentially access the same data.
Block Storage
Fundamentally, block storage is the type of raw storage
provided by SSDs and magnetic disks. In the cloud, virtualized block storage
is the standard for VMs. These block storage abstractions allow fine control
of storage size, scalability, and data durability beyond that offered by raw
disks .
Blocks on magnetic disks are geometrically arranged on a
physical platter. Two blocks on the same track can be read without moving
the head, while reading two blocks on separate tracks requires a seek. Seek
time can occur between blocks on an SSD, but this is infinitesimal compared
to the seek time for magnetic disk tracks.
Object Storage
Object storage contains objects of all shapes and sizes (Figure 6-
8). The term object storage is somewhat confusing because object has
several meanings in computer science. In this context, we’re talking about a
specialized file-like construct. It could be any type of file—TXT, CSV, JSON,
images, videos, or audio.
Object stores have grown in importance and popularity with
the rise of big data and the cloud. Amazon S3, Azure Blob Storage, and
Google Cloud Storage (GCS) are widely used object stores. In addition, many
cloud data warehouses (and a growing number of databases) utilize object
storage as their storage layer, and cloud data lakes generally sit on object
stores.
Streaming Storage
Streaming data has different storage requirements than
nonstreaming data. In the case of message queues, stored data is temporal
and expected to disappear after a certain duration. However, distributed,
scalable streaming frameworks like Apache Kafka now allow extremely long-
duration streaming data retention. Kafka supports indefinite data retention
by pushing old, infrequently accessed messages down to object storage.
Kafka competitors (including Amazon Kinesis, Apache Pulsar, and Google
Cloud Pub/Sub) also support long data retentio .
C. Data Engineering Storage Abstractions
Data engineering storage abstractions are data organization and query patterns
that sit at the heart of the data engineering lifecycle and are built atop the data
storage systems
Data Engineering involves designing, building, and managing data
storage and processing systems to support analytics, business intelligence (BI),
and machine learning (ML). Modern data architectures include Data
Warehouses, Data Lakes, and Data Lakehouses, which serve different
purposes based on data types, processing needs, and business objectives. Data
Platforms provide end-to-end solutions that integrate multiple storage and
processing technologies.
This guide explores these concepts in depth, including their architectures, use
cases, advantages, and real-world examples.
1. The Data Warehouse
Definition
A Data Warehouse (DW) is a centralized repository designed for storing
structured data optimized for analytical processing (OLAP - Online Analytical
Processing). It enables businesses to generate reports, dashboards, and
insights based on historical data.
Characteristics
• Schema-on-write: Data is structured before ingestion.
• ACID compliance: Ensures data integrity and consistency.
• Optimized for analytics: Uses columnar storage for fast queries.
• Slow-changing data: Best for historical and aggregated data.
• Supports SQL queries: Business analysts use SQL-based tools for
reporting.
Architecture
1. Data Sources → Extracted from databases, applications, logs, and APIs.
2. ETL (Extract, Transform, Load) Process → Data is cleaned and
transformed.
3. Data Warehouse Storage → Organized in tables and columns.
4. Query Engine → Used for analytics, reporting, and business intelligence
(BI).
Examples of Data Warehouses
• Amazon Redshift (AWS)
• Google BigQuery (Google Cloud)
• Snowflake (Cloud-based)
• Microsoft Azure Synapse Analytics
Use Cases
• Retail & E-commerce: Customer segmentation, sales forecasting.
• Banking & Finance: Fraud detection, risk management.
• Healthcare: Patient history analysis, drug discovery.
• Marketing & Advertising: Campaign performance analysis.
Example
A retail company uses Snowflake to store sales transaction data. Analysts run
SQL queries to track monthly revenue trends across different locations.
2. The Data Lake
Definition
A Data Lake is a centralized storage repository that can store structured, semi-
structured, and unstructured data in raw form. Unlike a data warehouse, it
does not require predefined schemas and supports a schema-on-read
approach.
Characteristics
• Schema-on-read: Data is transformed only when queried.
• Scalability: Handles massive datasets, including logs, videos, and images.
• Cost-effective: Uses cheap cloud object storage (e.g., Amazon S3).
• Supports multiple data types: Structured (CSV, JSON), Semi-structured
(XML, Parquet), Unstructured (images, audio, video).
• Supports big data processing frameworks: Apache Spark, Hadoop,
Presto.
Architecture
1. Data Ingestion → Collects raw data from logs, IoT devices, and APIs.
2. Data Storage → Stored in cloud object storage (AWS S3, Azure Blob
Storage).
3. Data Processing → Uses frameworks like Apache Spark, Presto, or
Athena.
4. Data Querying → Queries data with tools like Hive, Dremio, Trino.
Examples of Data Lakes
• AWS S3 + AWS Glue
• Azure Data Lake Storage
• Google Cloud Storage (GCS)
• Hadoop Distributed File System (HDFS)
Use Cases
• Big Data Analytics: IoT sensor data processing, clickstream analysis.
• Machine Learning & AI: Storing large ML training datasets.
• Media & Entertainment: Video and audio content archiving.
• Healthcare: Storing medical images and genomic data.
Example
A healthcare company stores MRI scans and patient records in Azure Data
Lake Storage. Machine learning models analyze MRI images to detect
anomalies.
3. The Data Lakehouse
Definition
A Data Lakehouse is a hybrid architecture that combines the flexibility of a
Data Lake with the structured querying and performance of a Data
Warehouse. It enables schema enforcement, ACID transactions, and real-time
data processing within a Data Lake.
Characteristics
• Schema enforcement: Unlike Data Lakes, it supports structured tables.
• Supports ACID transactions: Ensures data consistency.
• Optimized for analytics: Provides SQL-based querying like a Data
Warehouse.
• Cost-effective storage: Uses cheap object storage like AWS S3 or Google
Cloud Storage.
• Unified data processing: Works with both BI tools and ML frameworks.
Architecture
1. Data Ingestion → Raw data from IoT, logs, databases.
2. Delta Lake / Apache Iceberg / Hudi → Adds transactional support.
3. Unified Query Layer → Apache Spark, Trino, Dremio for querying.
4. BI & ML Integration → Supports dashboards and AI workloads.
Examples of Data Lakehouse Technologies
• Databricks Delta Lake (AWS, Azure)
• Apache Iceberg (Open-source)
• Apache Hudi (Real-time updates)
• Google BigLake (Google Cloud)
Use Cases
• Real-time Analytics: Streaming data insights.
• Data Science & ML: Unified storage for ML models and raw data.
• Enterprise Data Management: Single platform for structured &
unstructured data.
Example
A telecom company uses Delta Lake (Databricks) to store call logs and
customer data. Analysts use SQL queries for churn prediction while machine
learning models train on raw data.
4. Data Platforms
Definition
A Data Platform is an end-to-end system that integrates data storage,
processing, and analytics in a single environment. It provides a unified
interface for data ingestion, transformation, governance, security, and
visualization.
Characteristics
• Multi-cloud & hybrid support: Works across AWS, Azure, Google Cloud.
• ETL/ELT Integration: Supports Apache Spark, dbt, Airflow.
• Data Governance & Security: Manages access control, compliance.
• AI & ML Capabilities: Supports TensorFlow, PyTorch, AutoML.
• Self-service analytics: Users can query data with SQL, Python, or drag-
and-drop tools.
Examples of Data Platforms
• Google Cloud BigQuery + Looker
• Databricks Unified Data Analytics
• Snowflake Data Cloud
• AWS Data Platform (S3, Redshift, Glue, Athena)
• Microsoft Azure Synapse Analytics
Use Cases
• Enterprise Data Management: Data-driven decision-making.
• Customer 360 Platforms: Unified view of customer interactions.
• Predictive Analytics: AI-powered forecasting models.
Example
An e-commerce company integrates Google Cloud BigQuery with Looker to
analyze customer purchase patterns and recommend products in real-time.

Comparison of Data Architectures

Data Data
Feature Data Lake Data Platform
Warehouse Lakehouse

All (Raw, Semi-


Structured +
Data Type Structured Structured, All Data Types
Unstructured
Unstructured)

Schema-on- Schema Schema-on-


Schema Schema-on-read
write enforcement read & write
Data Data
Feature Data Lake Data Platform
Warehouse Lakehouse

Cost Expensive Cost-effective Moderate Varies

Optimized for High


Performance High for SQL Slow for queries
analytics performance

ACID
Yes No Yes Yes
Transactions

Machine
Limited Excellent Excellent Excellent
Learning

End-to-end
BI & Hybrid
Use Cases Big Data & AI data
Reporting Workloads
management

D. Data Ingestion
Bounded versus unbounded
Frequency
Synchronous versus asynchronous
Serialization and deserialization
Throughput and elastic scalability
Reliability and durability
Payload
Push versus pull versus poll patterns
The Ingestion Stage in the Data Engineering Lifecycle is
the process of collecting raw data from various sources and bringing it into
storage systems like Data Warehouses, Data Lakes, or Data Lakehouses.
Efficient data ingestion ensures seamless downstream processing, analytics,
and machine learning.
This document discusses key concepts in data ingestion, including:
• Bounded vs. Unbounded Data
• Frequency of Data Ingestion
• Synchronous vs. Asynchronous Processing
• Serialization & Deserialization
• Throughput & Elastic Scalability
• Reliability & Durability
• Payload Considerations
• Push vs. Pull vs. Poll Patterns

1. Bounded vs. Unbounded Data


Data ingestion processes handle bounded and unbounded data based on
the nature of the data source.
Bounded Data
• Definition: A finite dataset that has a known beginning and end.
• Examples:
o Historical batch data (e.g., CSV, JSON, Parquet files uploaded at
once).
o Database snapshots or periodic backups.
o Logs collected in daily/hourly batches.
• Processing Model: Uses batch processing with tools like Apache Spark,
AWS Glue, or Google Dataflow.
• Use Case:
o A company imports a monthly customer transaction report into a
data warehouse for analysis.
Unbounded Data
• Definition: A continuous stream of data that has no predefined end.
• Examples:
o Streaming event logs from a web application.
o Sensor data from IoT devices.
o Live social media feeds (Twitter, Facebook).
• Processing Model: Uses real-time stream processing with Kafka, Apache
Flink, or Apache Pulsar.
• Use Case:
o A stock trading platform processes live market data streams to
detect anomalies in real-time.

Bounded Data
Feature Unbounded Data (Streaming)
(Batch)

Data Size Finite Infinite

Processing Batch (ETL) Continuous (Streaming)

Logs, Snapshots, IoT data, Web Clickstream, Market


Examples
Reports Data

Apache Spark, AWS


Tools Apache Flink, Kafka Streams
Glue

2. Frequency of Data Ingestion


The frequency of data ingestion depends on business requirements and
system capabilities.
Batch Ingestion
• Data is ingested at scheduled intervals (e.g., hourly, daily, weekly).
• Example:
o A retail store uploads daily sales reports at midnight into a Data
Warehouse.
• Tools: Apache Airflow, AWS Glue, Azure Data Factory.
Micro-Batch Ingestion
• A hybrid approach where small batches are ingested at frequent
intervals (e.g., every 5 minutes).
• Example:
o A fraud detection system ingests financial transactions every 10
minutes for analysis.
• Tools: Apache Spark Structured Streaming, Snowflake Streams.
Real-time (Streaming) Ingestion
• Data is ingested continuously as it arrives.
• Example:
o A bank streams customer transactions to detect fraud in real-
time.
• Tools: Apache Kafka, Apache Pulsar, AWS Kinesis.
3. Synchronous vs. Asynchronous Processing
The ingestion process can be designed to be synchronous or asynchronous,
depending on latency and performance needs.
Synchronous Processing
• Definition: The system waits for a response before proceeding to the
next step.
• Characteristics:
o Lower latency but can cause delays if responses are slow.
o Used in critical applications (e.g., financial transactions).
• Example:
o A banking system synchronously validates a credit card
transaction before approving it.
• Tools: REST APIs, gRPC, PostgreSQL triggers.
Asynchronous Processing
• Definition: The system does not wait for a response before continuing.
• Characteristics:
o Higher throughput and better scalability.
o Used in big data pipelines and background processing.
• Example:
o A social media platform asynchronously ingests and analyzes user
activity logs for recommendations.
• Tools: Apache Kafka, RabbitMQ, AWS SQS.
4. Serialization & Deserialization
Data is serialized during ingestion and deserialized when consumed.
Serialization
• Definition: Converts structured data into a format that can be
transmitted or stored efficiently.
• Formats:
o JSON (JavaScript Object Notation) – Human-readable but less
efficient.
o Avro – Binary format, used in Hadoop and Kafka.
o Parquet – Columnar storage, optimized for analytics.
Deserialization
• Definition: Converts serialized data back into its original structure.
• Example:
o A Kafka consumer deserializes Avro messages before storing them
in a Data Lake.
5. Throughput & Elastic Scalability
Throughput
• Definition: The amount of data ingested per second.
• Measured in: Events per second (EPS), MB/s, GB/s.
• Example:
o A video streaming platform ingests 1 GB/sec of user playback
data.
Elastic Scalability
• Definition: The ability of a system to dynamically scale resources based
on demand.
• Example:
o AWS Kinesis automatically scales ingestion pipelines during peak
traffic.
• Tools: Apache Flink, Google Pub/Sub, AWS Lambda.
6. Reliability & Durability
Reliability
• Definition: Ensures data is ingested without loss or corruption.
• Example:
o Kafka guarantees exactly-once processing using idempotent
producers.
Durability
• Definition: Ensures persisted data is not lost even after failures.
• Example:
o Amazon S3 provides 11 nines (99.999999999%) durability for
stored data.
7. Payload Considerations
Definition:
A payload is the actual data being transmitted. Payload size impacts
ingestion speed and costs.
Optimized Payload Strategies
• Compression (Gzip, Snappy) reduces storage costs.
• Partitioning (HDFS, Parquet) improves read performance.
• Batching (Kafka, Kinesis) increases throughput.

8. Push vs. Pull vs. Poll Patterns


Data ingestion follows three primary patterns:
Push Pattern
• Definition: The source system pushes data to the destination in real-
time.
• Example:
o IoT devices push sensor readings to AWS IoT Core.
• Tools: Webhooks, Kafka Producers, MQTT.
Pull Pattern
• Definition: The destination requests data from the source periodically.
• Example:
o A BI dashboard queries a database every hour to fetch new data.
• Tools: REST APIs, GraphQL.
Poll Pattern
• Definition: The destination system continuously checks if new data is
available.
• Example:
o A mobile app polls a server every 10 seconds for updates.
• Tools: AWS SQS, Apache Flink.
E. Batch Ingestion Considerations
Batch ingestion is a crucial aspect of data engineering, enabling the
movement of large volumes of data from source systems to data
warehouses, lakes, or other analytical stores.
1. Snapshot vs. Differential Extraction
Snapshot Extraction
• In this approach, the entire dataset is extracted from the source and
loaded into the target system.
• Useful when:
o The dataset is relatively small.
o Changes in data cannot be tracked.
o A complete refresh is acceptable.
Example:
A retail company extracts all sales transaction records from an OLTP database
daily and reloads them into a data warehouse. Since the dataset is relatively
small, a full snapshot ingestion is feasible.
Challenges:
• Inefficient for large datasets, leading to performance issues.
• Higher storage and processing costs.
Differential (Incremental) Extraction
• Only extracts the data that has changed since the last extraction.
• Commonly implemented using timestamps, change data capture (CDC),
or versioning techniques.
Example:
A bank processes millions of transactions daily. Instead of extracting the entire
transaction table, the system fetches only the transactions from the last 24
hours based on the updated_at timestamp.
Advantages:
• Reduces data transfer and processing time.
• More efficient use of storage and compute resources.
Challenges:
• Requires mechanisms to track changes in source data.
• Potential complexity in maintaining historical data.

2. File-Based Export and Ingestion


File-based ingestion involves extracting data from source systems into
structured files (e.g., CSV, JSON, Parquet) and loading them into a target
system.
Common File Formats:
• CSV: Simple, but lacks schema enforcement.
• JSON: Semi-structured, useful for NoSQL databases.
• Parquet/Avro: Columnar storage, optimized for analytical workloads.
Example:
A healthcare system exports patient appointment data as Parquet files from an
electronic medical records system and loads it into a data lake for analysis.
Batch Processing Tools for File-Based Ingestion
• Apache NiFi for orchestrating file movement.
• AWS S3, Azure Blob, Google Cloud Storage for storing and retrieving
files.
• Apache Spark, Databricks for processing large-scale data files.
Challenges:
• Ensuring schema evolution and consistency.
• Handling failures during file transfers.
• Managing large files efficiently (e.g., splitting files into smaller chunks for
parallel processing).

3. ETL vs. ELT


ETL (Extract, Transform, Load)
• Transformation happens before loading into the target system.
• Useful for:
o Traditional data warehouses (e.g., Oracle, Teradata).
o Enforcing strict data governance and quality.
Example:
A telecom company extracts customer call records from multiple sources,
processes them using Apache Talend, applies transformations (e.g., filtering
invalid records), and loads the cleansed data into an Oracle Data Warehouse.
Challenges:
• Can be slower for large datasets.
• Requires dedicated processing infrastructure.
ELT (Extract, Load, Transform)
• Data is extracted and loaded first, then transformed within the target
system.
• Works best with modern cloud-based data warehouses (e.g., Snowflake,
BigQuery, Redshift).
Example:
A fintech company ingests raw transaction logs into Snowflake, then uses SQL
queries to clean, aggregate, and analyze the data.
Advantages:
• More scalable for big data processing.
• Leverages cloud-native compute power.
Challenges:
• Requires a powerful target system to handle transformations.
• Governance and data quality checks must be implemented carefully.

4. Data Migration Considerations


Data migration involves transferring data from one system to another, often
during system upgrades, cloud adoption, or mergers.
Types of Data Migration
1. Storage Migration: Moving data between storage systems.
2. Database Migration: Transferring data from one database to another.
3. Cloud Migration: Moving on-premises data to cloud platforms.
4. Application Migration: Shifting entire applications along with data.
Example:
A financial institution migrates customer records from an on-premises Oracle
database to Google BigQuery using AWS Database Migration Service (DMS).
Best Practices for Data Migration
• Data Validation: Ensure data integrity before and after migration.
• Incremental Migration: Move data in batches to minimize downtime.
• Backup Strategy: Maintain backups to recover from failures.
Challenges:
• Data format differences between source and destination.
• Handling large-scale migrations without affecting business operations.
F. Message and Stream Ingestion Considerations.

E. Batch Ingestion Considerations


Batch ingestion is a common data engineering pattern where large volumes of
data are collected, processed, and loaded at scheduled intervals.
1. Snapshot vs. Differential Extraction
When ingesting data in batches, organizations must decide whether to extract
full snapshots or only changes since the last ingestion.
Snapshot Extraction
• Captures the entire dataset at each ingestion cycle.
• Suitable when:
o The dataset is small or moderate in size.
o There is no reliable way to track incremental changes.
o A complete refresh is required for consistency.
• Challenges:
o High storage and compute costs.
o Longer processing times.
Differential (Incremental) Extraction
• Extracts only data that has changed since the last ingestion.
• Typically implemented using:
o Timestamps (e.g., last_updated field).
o Change Data Capture (CDC) mechanisms.
o Versioning with audit logs.
• Suitable when:
o The dataset is large, and full extraction is impractical.
o Data consistency can be maintained with deltas.
• Challenges:
o Requires reliable tracking mechanisms.
o Potential issues with missing or late-arriving updates.

2. File-Based Export and Ingestion


Batch ingestion often relies on file-based methods for transporting data.
File Formats
• CSV: Simple but lacks schema enforcement.
• JSON: Flexible but can be inefficient for large-scale analytics.
• Parquet/ORC: Optimized for analytics, supports schema evolution and
columnar storage.
• Avro: Good for streaming and batch processing, supports schema
evolution.
Storage and Transfer Considerations
• On-Premises vs. Cloud: Files may be stored in local systems (e.g., NAS,
HDFS) or cloud storage (e.g., S3, Azure Blob).
• Batch Scheduling: Tools like Apache Airflow, AWS Glue, or Azure Data
Factory orchestrate file-based ingestion.
• Error Handling: Logging, retry mechanisms, and validation checks should
be in place.

3. ETL vs. ELT


The choice between Extract, Transform, Load (ETL) and Extract, Load,
Transform (ELT) depends on system architecture and processing needs.
ETL (Extract, Transform, Load)
• Data is transformed before loading into the target system.
• Traditional approach used in data warehouses (e.g., Oracle, Teradata).
• Pros:
o Reduces processing burden on target systems.
o Ensures data consistency and validation before ingestion.
• Cons:
o Transformation adds latency.
o Requires dedicated ETL infrastructure (e.g., Informatica, Talend).
ELT (Extract, Load, Transform)
• Raw data is first loaded into a data lake or warehouse, and
transformations occur afterwards.
• Common in modern cloud-based architectures (e.g., Snowflake,
BigQuery).
• Pros:
o Scales well with large datasets.
o Leverages cloud-based processing (e.g., Spark, dbt).
• Cons:
o Data quality issues may arise if transformation is not properly
managed.
o Requires powerful storage and compute resources.

4. Data Migration Considerations


When migrating data from one system to another, batch ingestion plays a
crucial role.
Migration Strategies
• Lift-and-Shift: Directly moving data as-is.
• Schema Transformation: Mapping data structures to the target format.
• Historical Data Loading: Migrating old data while ensuring consistency.
• Validation & Reconciliation: Ensuring no data loss or corruption during
transfer.
Challenges in Data Migration
• Downtime: Large migrations can impact system availability.
• Data Integrity: Ensuring no duplicates, missing records, or corrupted
data.
• Performance Optimization: Efficient extraction and loading techniques
to minimize processing time.
F. Message and Stream Ingestion Considerations
Streaming data ingestion differs from batch ingestion as it deals with real-time
or near-real-time data.
1. Schema Evolution
When dealing with real-time message streams, the data schema can change
over time. If not managed properly, schema changes can lead to ingestion
failures, data inconsistencies, or even pipeline breakdowns.
Challenges in Schema Evolution
• Backward and Forward Compatibility: Ensuring new schema changes do
not break existing consumers.
• Schema Registry Management: Keeping track of schema versions (e.g.,
using Apache Avro with Confluent Schema Registry).
• Handling Null Values: When a new field is added, older messages won’t
have that field, so handling defaults is important.
• Enforcing Data Contracts: Establishing clear rules on how schema
changes can be introduced.
Solutions
• Using Schema Registries: Tools like Apache Avro, Protocol Buffers
(ProtoBuf), or JSON Schema help manage schema changes.
• Schema Evolution Policies:
o Additive changes (e.g., new fields) should have default values.
o Deprecating fields instead of removing them.
o Using versioned APIs for stream processing.
• Schema Validation in Pipelines: Implement checks before publishing or
consuming data.

2. Late-Arriving Data
Streaming pipelines must handle data that arrives out of order or is delayed
due to network issues, retries, or system failures.
Causes of Late-Arriving Data
• Network latency.
• Device or sensor buffering delays.
• Message queue backlogs.
• Time zone inconsistencies.
• Source system failures and retries.
Handling Late Data
• Event Time vs. Processing Time:
o Event Time: When the event actually happened.
o Processing Time: When the event is processed by the system.
• Windowing Strategies:
o Fixed Windows: A specific duration (e.g., 10 minutes).
o Sliding Windows: Overlapping time intervals.
o Session Windows: Grouping events based on user activity.
• Watermarking: Defines a threshold where late events are accepted up to
a certain point but discarded afterward (e.g., Apache Flink, Apache
Beam).
• Reprocessing Mechanisms: Some frameworks (Kafka, Flink) allow
reprocessing of past data when necessary.

3. Ordering and Multiple Delivery Considerations


Streaming systems often deal with out-of-order messages and multiple
deliveries, which can cause inconsistencies in downstream processing.
Ordering Guarantees
• At-Least-Once Delivery: Messages may be duplicated but never lost
(e.g., Kafka, RabbitMQ).
• At-Most-Once Delivery: No duplicates, but messages may be lost (useful
for low-latency use cases).
• Exactly-Once Delivery: Guarantees no duplicates and no loss (more
complex, requires idempotency).
• Partitioning for Order Preservation: Kafka preserves order within
partitions, so ensuring correct partitioning helps maintain event order.
Handling Duplicate Messages
• Deduplication Techniques:
o Assigning unique message IDs (UUIDs) and checking before
processing.
o Using idempotent operations in downstream systems.
o Storing processed message IDs in a stateful store (e.g., Redis,
RocksDB).
Reordering Techniques
• Event Timestamps and Buffering: Sort messages before processing.
• Sequence Numbers: Ensure messages are processed in order.
• Checkpointing in Stream Processing Engines: Use tools like Flink, Spark
Streaming to recover state in case of failure.

4. Error Handling and Dead-Letter Queues (DLQs)


Ingesting data streams in real time means errors will occur—malformed
messages, system failures, and processing errors must be handled properly.
Common Errors in Streaming Pipelines
• Malformed Data: Corrupt or non-parseable JSON, Avro, or CSV files.
• Schema Violations: Incoming messages do not match the expected
schema.
• Network Failures: Connection issues with message brokers or
consumers.
• Processing Failures: Bugs in stream processing logic causing crashes.
Dead-Letter Queues (DLQs)
A Dead-Letter Queue (DLQ) is a mechanism to store failed or unprocessable
messages for later review.
How DLQs Work
• When a message cannot be processed after multiple retries, it is moved
to a dedicated queue.
• Operators can analyze and reprocess failed messages separately.
• Popular brokers like Kafka, AWS SQS, and RabbitMQ have built-in DLQ
support.
Strategies for Handling Failed Messages
1. Retry Mechanisms:
o Immediate retries for transient failures.
o Exponential backoff strategies (gradually increasing retry time).
2. Logging and Alerting:
o Alerting systems to detect spikes in DLQ messages.
o Logging error details for debugging.
3. Reprocessing from DLQ:
o Manual intervention for critical failures.
o Automated replaying of DLQ messages when the issue is resolved.
4. Filtering Out Poison Messages:
o Messages that repeatedly fail should be moved to a separate
quarantine queue.
o Prevents pipeline slowdowns due to persistent failures.

G. Ways to Ingest Data


Data ingestion is the process of collecting, processing, and loading data into a
storage system or data warehouse for further analysis.
1. Direct Database Connection
A direct database connection involves querying a database directly using APIs,
ODBC, JDBC, or SQL clients to extract data for ingestion.
How It Works
• A script or ETL tool connects to the source database.
• Runs SQL queries (full extracts or incremental pulls).
• Data is transferred via network to the destination system (data lake,
warehouse, or another database).
Pros

✅ Simple to implement for small-to-medium datasets.


✅ Works well when data access is structured and controlled.
✅ No need for additional components like message queues.
Cons

❌ High query load can impact database performance.


❌ Requires handling schema changes manually.
❌ Not suitable for real-time ingestion.
Use Cases
• Pulling reports from transactional databases.
• Extracting daily data from operational systems.

2. Change Data Capture (CDC)


Change Data Capture (CDC) is a technique to track and capture changes
(INSERT, UPDATE, DELETE) in a database and replicate them elsewhere.
How It Works
• Uses transaction logs, triggers, or timestamp-based tracking.
• Captures only incremental changes instead of full table extracts.
• Sends changes to a target system like a data warehouse or message
queue.
Types of CDC Methods
• Log-Based CDC: Reads transaction logs (e.g., Debezium, Oracle
GoldenGate).
• Trigger-Based CDC: Uses database triggers to record changes.
• Timestamp-Based CDC: Extracts records with a last modified timestamp.
Pros

✅ Efficient, as it processes only changes.


✅ Minimizes load on source systems.
✅ Enables real-time or near-real-time replication.
Cons

❌ Requires access to database transaction logs.


❌ Can be complex to set up and maintain.
❌ Some databases may not support log-based CDC.
Use Cases
• Real-time data replication (e.g., syncing PostgreSQL to a data
warehouse).
• Keeping data lakes updated with operational data.

3. Message Queues and Event-Streaming Platforms


Message queues and event-streaming platforms provide asynchronous and
scalable ways to ingest high-volume, real-time data.
How It Works
• Data is pushed into a message queue (Kafka, RabbitMQ, AWS SQS).
• Consumers (ETL processes, streaming jobs) read and process messages.
• Messages can be processed in real time or stored for batch processing.
Pros

✅ High scalability for real-time data ingestion.


✅ Supports decoupling producers and consumers.
✅ Ensures reliability with features like retries and dead-letter queues (DLQs).
Cons

❌ Requires additional infrastructure and monitoring.


❌ Ordering and duplicate handling can be complex.
❌ Real-time processing increases operational complexity.
Use Cases
• Real-time event-driven architectures.
• Streaming logs, sensor data, or IoT telemetry.

4. Managed Data Connectors


Managed data connectors are pre-built integrations that simplify data ingestion
from various sources to a destination system.
How It Works
• Cloud providers (AWS Glue, Google Dataflow) or third-party tools
(Fivetran, Stitch) provide pre-configured data ingestion pipelines.
• Data is automatically extracted, transformed, and loaded into a data
warehouse or data lake.
Pros

✅ Reduces engineering effort with pre-built connectors.


✅ Handles schema changes and incremental ingestion.
✅ Cloud-native and scalable.
Cons

❌ Vendor lock-in and potential high costs.


❌ Limited customization and control.
❌ Might not support all data sources.
Use Cases
• Quick data ingestion from SaaS apps (Salesforce, Google Analytics).
• Automating ETL/ELT workflows for non-engineering teams.

5. Databases and File Export


Some systems generate periodic data exports in structured formats (CSV, JSON,
Avro, Parquet) which are then ingested into a target system.
How It Works
• The source system exports files on a scheduled basis.
• The files are uploaded to cloud storage (S3, GCS, Azure Blob).
• ETL pipelines pick up and process the files.
Pros

✅ Simple and widely used method for batch ingestion.


✅ Works well with legacy systems.
✅ Suitable for large-scale historical data transfers.
Cons
❌ Not ideal for real-time ingestion.
❌ Managing file versioning and schema evolution can be tricky.
❌ Large CSV/JSON files can slow down processing.
Use Cases
• Batch ingestion for analytics dashboards.
• Periodic sync of operational databases to data lakes.

6. Practical Issues with Common File Formats


Different file formats impact the efficiency of ingestion.

File
Pros Cons
Format

No schema, inefficient for


CSV Simple, human-readable
large data

Larger size, no native


JSON Flexible, widely used in APIs
schema enforcement

Schema evolution support, efficient


Avro Less human-readable
for row-based processing

Optimized for analytics, columnar Not ideal for row-based


Parquet
format queries

ORC Similar to Parquet, good for big data Higher storage overhead

Key Considerations
• Use Parquet/ORC for analytics workloads.
• Use Avro for schema evolution needs.
• Avoid CSV/JSON for large-scale data ingestion unless required.

7. Transfer Appliances for Data Migration


For large-scale data migration (petabytes of data), physical transfer appliances
provide an efficient solution.
How It Works
• Data is copied to a physical device (AWS Snowball, Google Transfer
Appliance).
• The device is shipped to the cloud provider’s data center.
• Data is uploaded to cloud storage for further processing.
Pros

✅ Faster than internet-based transfers for large datasets.


✅ Reliable for migrating petabytes of historical data.
✅ Secure with encryption and tamper-proof hardware.
Cons

❌ Not real-time, involves logistics delays.


❌ Requires planning and coordination.
❌ Not suitable for continuous ingestion.
Use Cases
• Migrating on-premises databases to the cloud.
• Large-scale backup restoration.

You might also like