0% found this document useful (0 votes)
6 views31 pages

Delta_Compression_Complete_Guide

Delta (unix & linux)

Uploaded by

anshnavodian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views31 pages

Delta_Compression_Complete_Guide

Delta (unix & linux)

Uploaded by

anshnavodian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

DELTA COMPRESSION

Complete Technical Reference with Xdelta Comparison


Architecture · Algorithms · Syntax · Advantages · Applications · Deep Comparison

1. What Is Delta Compression?

Delta compression (also called differential compression or binary diffing) is a


technique that stores or transmits only the differences — the "delta" — between two
versions of data, rather than storing or sending each version in full. The result is a
compact patch file that, when applied to the original (source), reproduces the updated
(target) exactly.

The fundamental insight: if two versions of a file share 95% of their content, why store
or transmit 200% of the data? Store 100% + 5% instead.

📌 Core Principle: Delta = Target − Source | Target = Source + Delta

1.1 Delta vs. Regular Compression


Aspect Regular Compression Delta Compression
Input Single file Two files: source + target
Exploits Repetition within one file Similarity between two files
Output Compressed file Patch file (requires source to use)
(standalone)
Decode Just the compressed file Source file + patch file
requires
Best for Shrinking any single file Distributing updated versions of
files
Example tools gzip, zstd, bzip2, lzma xdelta3, bsdiff, rdiff, fossil

1.2 Historical Context


Delta compression has its roots in source code version control. Early VCS tools like
SCCS (1972) and RCS (1982) stored file revisions as deltas to save disk space. The
idea expanded to binary files with tools like the rsync algorithm (1996), bsdiff (2003),
and xdelta (1997–present). Today it underpins OS updates, game patches, container
registries, and cloud storage sync.

2. How Delta Compression Works — Deep Internals

2.1 The Three Core Operations


Every delta encoding algorithm reduces to three fundamental operations. The patch file
is a sequence of these instructions:

Operation Instruction Example Meaning


COPY COPY(offset, COPY(1024, 256) Reuse 256 bytes from
len) source at offset 1024
INSERT / ADD(data[]) ADD([0x48,0x65,...]) Insert these raw bytes
ADD (new content not in
source)
DELETE skip / omit (implicit) Bytes in source not
referenced are simply
skipped
RUN RUN(byte, RUN(0x00, 512) Output 512 repeated zero
len) bytes — sparse file
encoding

2.2 Encoding Pipeline — Step by Step


Step 1: Chunking / Blocking
The source file is divided into blocks. Block size varies by algorithm: rolling-hash
algorithms use variable-length blocks (content-defined chunking), while simpler
algorithms use fixed-size blocks. Smaller blocks increase the chance of finding
matches but create more overhead per match.

Source (1000 bytes): [Block0: 0-63][Block1: 64-127][Block2: 128-191]...


Target (1020 bytes): [Block0: same][Block1: CHANGED][Block2: same][NEW:
20 bytes]
Step 2: Fingerprinting / Hashing
Each source block gets a fast fingerprint (hash). The fingerprint table allows O(1)
lookup: given a block of target data, instantly check if it exists in the source. The key
choice here is the hash function:
• Rolling hash (Rabin-Karp, Adler-32): can slide one byte at a time — ideal for
variable-length chunk scanning
• Cryptographic hash (MD4, SHA1): collision-resistant but slower — used in
rsync-style algorithms where trust matters
• Suffix arrays (bsdiff): finds all matches across the entire source simultaneously
— most powerful but most memory-intensive

Step 3: Match Finding


The target is scanned byte-by-byte (or chunk-by-chunk). For each position, the hash is
looked up in the source table:
FOR each position P in target:
hash = rolling_hash(target[P .. P+block_size])
IF hash in source_table:
match = source_table[hash]
extend match forward and backward (greedy extension)
emit COPY([Link], [Link])
advance P by [Link]
ELSE:
collect byte as literal (ADD data)
P++

Step 4: Instruction Encoding


The stream of COPY/ADD/RUN instructions is serialized into a binary patch format.
Different algorithms use different formats:
• VCDIFF (RFC 3284): standardized, used by xdelta3 and Google open-vcdiff
• bsdiff format: custom binary format with bzip2 compression over 3 streams
• rdiff format: rsync-style librsync format with MD4 signatures
• IPS/UPS/BPS: simple formats for ROM patching with fixed-block matching

Step 5: Secondary Compression


The raw instruction stream often still contains compressible patterns. A secondary
general-purpose compressor (gzip, bzip2, lzma, zstd) is applied over the delta, yielding
additional 20-60% size reduction with no semantic change.

2.3 Decoding — Reconstruction


Reconstruction is always faster than encoding because no searching is required. The
decoder simply executes each instruction sequentially:

OPEN source file, patch file, output file

FOR each instruction in patch:


IF instruction == COPY(src_offset, length):
read length bytes from source[src_offset]
write to output

IF instruction == ADD(data, length):


write data[0..length] to output

IF instruction == RUN(byte, length):


write byte repeated length times to output

CLOSE all files


VERIFY output checksum (if embedded in patch)

⚡ Decode Speed: Decoding is typically 10-50× faster than encoding. It is


sequential, cache-friendly, and requires no hash tables — just memcpy
operations.

2.4 Key Algorithms in Detail


A. Rolling Hash (Rabin-Karp Style)
Used by: xdelta3, casync, zsync
A rolling hash maintains a hash value for a sliding window of bytes. Removing the
leftmost byte and adding the rightmost byte updates the hash in O(1) time — no need
to rehash the entire window.
// Rabin-Karp rolling hash
hash = initial_hash(window[0..B-1])
for i in range(len(data) - B):
if hash in source_table:
process_match(source_table[hash], i)
// Slide window: O(1) update
hash = (hash * BASE - data[i] * BASE^B + data[i+B]) % MOD

B. Suffix Array / Suffix Sort (bsdiff)


Used by: bsdiff, bspatch
Builds a sorted suffix array of the source — an index of every suffix of the source in
lexicographic order. This allows binary search to find the longest match for any target
position anywhere in the source in O(log N) time. Much more powerful than rolling
hash (finds distant matches, handles rearrangements) but uses O(N) extra memory for
the suffix array.
# Build suffix array once for source
SA = build_suffix_array(source) # O(N log N) or O(N) with SA-IS

# For each target position, binary-search the suffix array


for pos in range(len(target)):
best_match = binary_search(SA, target[pos:])
emit_instructions(best_match)

C. Block Signature / Rsync Algorithm


Used by: rsync, rdiff, librsync
Designed for network sync where source and target are on different machines. The
receiver sends block signatures (weak rolling hash + strong MD4 hash) to the sender.
The sender identifies which blocks of the new file match the old file's signatures and
sends only the differences.
# Receiver (has old file):
signatures = []
for block in chunk(old_file, BLOCK_SIZE):
[Link]((adler32(block), md4(block), block_index))
send(signatures) # tiny — just hashes

# Sender (has new file):


delta = []
for position in rolling_scan(new_file):
if weak_hash matches and strong_hash matches:
[Link](COPY_INSTRUCTION)
else:
[Link](ADD_DATA)
send(delta) # much smaller than new file

D. Content-Defined Chunking (CDC)


Used by: casync, restic, Bup, ZFS dedup
Instead of fixed-size blocks, CDC uses the data content itself to determine split points
— wherever the rolling hash value falls below a threshold. This creates variable-length
chunks where identical content always maps to the same chunk regardless of position,
making it robust to insertions and deletions.
# FastCDC / Gear-hash CDC
pos = 0
chunk_start = 0
hash = 0

while pos < len(data):


hash = (hash << 1) + GEAR[data[pos]]
if (hash & MASK) == 0 or pos - chunk_start >= MAX_CHUNK:
emit_chunk(data[chunk_start:pos])
chunk_start = pos
hash = 0
pos++

3. Delta Compression Tools — Syntax & Examples

3.1 diff / patch — Text Delta (GNU)


The oldest and most universal delta tool. Works line-by-line on text files. Not suitable
for binary data but essential for source code, configuration files, and documents.

Basic Syntax
# Create a text patch (unified diff format)
diff -u old_file.txt new_file.txt > [Link]

# More verbose context diff


diff -c old_file.txt new_file.txt > [Link]

# Recursive directory diff


diff -ruN old_dir/ new_dir/ > [Link]

# Apply a patch
patch old_file.txt < [Link]

# Apply patch to a directory


patch -p1 < [Link]

# Dry run (check without applying)


patch --dry-run -p1 < [Link]

# Reverse / undo a patch


patch -R old_file.txt < [Link]

Unified Diff Format Explained


--- old_file.txt 2024-01-01 [Link] # source file + timestamp
+++ new_file.txt 2024-01-02 [Link] # target file + timestamp
@@ -3,7 +3,8 @@ # hunk header:
# -3,7 = source: start line 3, 7 lines
# +3,8 = target: start line 3, 8 lines
context line (unchanged) # lines with space = context
-removed line # lines with - = deleted from source
+added line # lines with + = added in target
+another new line
more context

3.2 bsdiff / bspatch — Binary Delta


Produces small patches for binary files using suffix sorting. Excellent compression
ratio but slow encoding and high memory use.

Installation & Syntax


# Install
sudo apt install bsdiff
# Create binary patch
bsdiff old_binary new_binary [Link]

# Apply binary patch


bspatch old_binary new_binary [Link]

# Example: patch an executable


bsdiff app_v1.0 app_v2.0 app_v1to2.bsdiff
bspatch app_v1.0 app_restored.bin app_v1to2.bsdiff

# Verify restore is identical


md5sum app_v2.0 app_restored.bin
# Both should match

3.3 rdiff / librsync — Network-Aware Delta


Implements the rsync rolling-checksum algorithm. Useful when source and target are
on different machines — you only need to transfer block signatures, not the source file
itself.
Installation & Syntax
# Install
sudo apt install rdiff

# Step 1: Generate signature of old file (small — just hashes)


rdiff signature old_file.bin old_file.sig

# Step 2: Compute delta from signature + new file


rdiff delta old_file.sig new_file.bin [Link]

# Step 3: Apply delta to reconstruct new file


rdiff patch old_file.bin [Link] new_file_restored.bin

# All-in-one local usage:


rdiff signature old_file.bin sig && rdiff delta sig new_file.bin patch && rdiff
patch old_file.bin patch [Link]

# Custom block size (larger blocks = smaller signature, less precision)


rdiff -b 2048 signature old_file.bin old_file.sig
3.4 rsync — Network Delta Sync
The most widely-used delta sync tool. Uses rolling checksum over a network —
transfers only changed blocks. Not a patch-file tool but a real-time delta transfer
system.

Syntax & Options


# Basic remote sync (delta transfer by default)
rsync -av source/ user@remote:/destination/

# Archive mode + compression + progress


rsync -avz --progress source/ user@remote:/dest/

# Show what would change (dry run)


rsync -avn source/ user@remote:/dest/

# Only show files that differ


rsync -rcnv --delete source/ dest/

# Bandwidth limit (KBps)


rsync --bwlimit=1000 -av src/ user@host:/dst/

# Exclude patterns
rsync -av --exclude='*.tmp' --exclude='.git/' src/ dst/

# Checksum-based comparison (slower but precise)


rsync -avc src/ dst/

# Block size tuning (larger = faster for big files, less precision)
rsync --block-size=4096 -av src/ dst/

3.5 git — Version Control Delta


Git uses delta compression internally in packfiles. When you push, pull, or run git gc,
Git finds similar blobs and stores only their deltas. Understanding git's delta mechanics
helps optimize repository size.

Git Delta Commands


# Show pack statistics (how many deltas exist)
git count-objects -vH

# Repack with aggressive delta compression


git gc --aggressive

# Manual repack with depth and window control


git repack -a -d --depth=250 --window=250
# --depth = max delta chain length (default 50)
# --window = objects considered per delta base (default 10)

# Show delta chain for a specific object


git cat-file --batch-all-objects --batch-check='%(objectname) %(objecttype) %
(deltabase)'

# See pack file contents


git verify-pack -v .git/objects/pack/*.idx | head -20

# Delta encode two arbitrary blobs manually via git hash-object


git diff HEAD~1 HEAD -- [Link] | wc -c # size of text diff

3.6 zstd — Dictionary (Reference-Based) Compression


Zstandard supports dictionary compression, where a 'dictionary' (a reference file or
training set) is used as a shared reference during compression. This is effectively
dictionary-based delta compression — if the file is similar to the dictionary, it
compresses much smaller.

Syntax
# Train a dictionary from sample files
zstd --train samples/*.bin -o [Link]

# Compress using dictionary (much smaller if similar to training data)


zstd -D [Link] new_file.bin -o new_file.[Link]

# Decompress with dictionary


zstd -D [Link] -d new_file.[Link] -o new_file_restored.bin

# Compare: with vs without dictionary


zstd new_file.bin -o without_dict.zst && ls -lh without_dict.zst
zstd -D [Link] new_file.bin -o with_dict.zst && ls -lh with_dict.zst
3.7 Python — Delta Compression from Scratch
Implementing a simple delta encoder in Python illustrates the core concepts without
external dependencies.

Simple Block-Level Delta in Python


import hashlib, json

BLOCK_SIZE = 512

def make_signature(data: bytes, block_size: int = BLOCK_SIZE) -> dict:


"""Build a block hash index of source data."""
index = {}
for i in range(0, len(data), block_size):
block = data[i:i+block_size]
h = hashlib.md5(block).hexdigest()
index[h] = i
return index

def compute_delta(source: bytes, target: bytes, block_size: int = BLOCK_SIZE) -


> list:
"""Compute a delta (list of instructions) from source to target."""
sig = make_signature(source, block_size)
ops = []
i=0
while i < len(target):
block = target[i:i+block_size]
h = hashlib.md5(block).hexdigest()
if h in sig and len(block) == block_size:
[Link]({'op': 'COPY', 'src_offset': sig[h], 'length': len(block)})
else:
[Link]({'op': 'ADD', 'data': list(block)})
i += block_size
return ops

def apply_delta(source: bytes, ops: list) -> bytes:


"""Reconstruct target from source + delta instructions."""
out = bytearray()
for op in ops:
if op['op'] == 'COPY':
off, ln = op['src_offset'], op['length']
[Link](source[off:off+ln])
elif op['op'] == 'ADD':
[Link](bytes(op['data']))
return bytes(out)

# ── Demo
──────────────────────────────────────────────
───
source = b'Hello World! ' * 100 + b'original end'
target = b'Hello World! ' * 100 + b'UPDATED end!!!'

ops = compute_delta(source, target)


copies = sum(1 for o in ops if o['op']=='COPY')
adds = sum(1 for o in ops if o['op']=='ADD')
print(f'Instructions: {copies} COPYs, {adds} ADDs')

restored = apply_delta(source, ops)


print(f'Restore matches: {restored == target}')

patch_json = [Link](ops).encode()
print(f'Source: {len(source)} B | Target: {len(target)} B | Delta: {len(patch_json)}
B')

3.8 Advanced Python — Using xdelta3 Library


# pip install xdelta3
import xdelta3
import hashlib

def delta_encode_file(source_path: str, target_path: str, patch_path: str):


with open(source_path, 'rb') as f: src = [Link]()
with open(target_path, 'rb') as f: tgt = [Link]()

patch = [Link](src, tgt)

with open(patch_path, 'wb') as f: [Link](patch)

ratio = len(patch) / len(tgt) * 100


print(f'Source: {len(src):>10,} bytes')
print(f'Target: {len(tgt):>10,} bytes')
print(f'Patch: {len(patch):>10,} bytes ({ratio:.1f}% of target)')
print(f'Savings: {100-ratio:.1f}%')

def delta_decode_file(source_path: str, patch_path: str, output_path: str):


with open(source_path, 'rb') as f: src = [Link]()
with open(patch_path, 'rb') as f: patch = [Link]()

restored = [Link](src, patch)

with open(output_path, 'wb') as f: [Link](restored)

# Verify
h = hashlib.sha256(restored).hexdigest()
print(f'SHA256: {h}')
print(f'Restored: {len(restored):,} bytes')

# Usage
delta_encode_file('app_v1.bin', 'app_v2.bin', '[Link]')
delta_decode_file('app_v1.bin', '[Link]', 'app_v2_restored.bin')

3.9 Shell Script — Full Delta Backup System


#!/bin/bash
# delta_backup.sh — incremental delta backup using xdelta3

set -euo pipefail

BACKUP_DIR='/var/backups/delta'
SOURCE_DIR='/var/data'
LOG="$BACKUP_DIR/[Link]"
DATE=$(date +%Y%m%d_%H%M%S)
TODAY_SNAP="$BACKUP_DIR/snapshot_${DATE}.tar"

mkdir -p "$BACKUP_DIR"

# Create today's snapshot


echo "[$(date)] Creating snapshot..." | tee -a "$LOG"
tar -cf "$TODAY_SNAP" -C "$SOURCE_DIR" .

# Find the most recent previous snapshot


PREV_SNAP=$(ls -t "$BACKUP_DIR"/snapshot_*.tar 2>/dev/null | sed -n '2p')

if [ -z "$PREV_SNAP" ]; then
echo "[$(date)] No previous snapshot — storing full backup" | tee -a "$LOG"
else
PATCH_FILE="$BACKUP_DIR/delta_${DATE}.xdelta"
echo "[$(date)] Creating delta patch..." | tee -a "$LOG"
xdelta3 -e -9 -f -s "$PREV_SNAP" "$TODAY_SNAP" "$PATCH_FILE"
ORIG_MB=$(du -m "$TODAY_SNAP" | cut -f1)
DELTA_MB=$(du -m "$PATCH_FILE" | cut -f1)
SAVING=$(echo "scale=1; 100 - $DELTA_MB * 100 / $ORIG_MB" | bc)
echo "[$(date)] Snapshot: ${ORIG_MB}MB | Patch: ${DELTA_MB}MB |
Saving: ${SAVING}%" | tee -a "$LOG"
# Remove full snapshot — keep only delta
rm "$TODAY_SNAP"
fi

echo "[$(date)] Backup complete" | tee -a "$LOG"

4. Advantages of Delta Compression

4.1 Massive Bandwidth & Storage Savings


For versioned data (software, databases, disk images, documents), delta compression
typically reduces the patch size to 1–10% of the full file size. A 500 MB application
update might be distributable as a 5–15 MB delta patch, reducing CDN bandwidth
costs by 97%.
📊 Real figures: Linux kernel 6.6→6.7 source tarball: 134 MB full | ~4 MB xdelta
patch (97% savings). Firefox 120→121: 80 MB full | 2.1 MB delta patch (97.4%
savings).

4.2 Works on Any Binary Data


Unlike text diff tools that operate line-by-line, binary delta tools work byte-by-byte on
any file type — executables, databases, ISO images, VM disks, firmware, compressed
archives (before compression), multimedia files, and more.

4.3 Extremely Fast Decoding


Patch application (decoding) is nearly always O(N) in the output size and is typically
memory-bandwidth limited. Modern systems can reconstruct files at 1–5 GB/s during
decode — faster than copying a file on most storage devices.

4.4 Enables Incremental Distribution Models


Delta compression makes true incremental update systems feasible. Software can ship
continuous updates without users having to download full installers. This is the
foundation of modern update infrastructure: Windows Update, macOS Software
Update, apt/yum package managers, and game platforms like Steam all use delta
principles.

4.5 Reduces Flash Write Cycles (Embedded / IoT)


For embedded systems and IoT devices, flash memory has a limited write cycle count.
By using delta patches, only the changed sectors of firmware need to be rewritten,
dramatically extending hardware lifespan and speeding up OTA updates.

4.6 Complementary to Regular Compression


Delta and regular compression are not mutually exclusive — they stack. The delta
removes inter-version redundancy; a secondary compressor (zstd, lzma) removes intra-
patch redundancy. Combined, they often outperform either technique alone by 2–5x.

4.7 Supports Verifiable Integrity


Most delta formats embed checksums (CRC32, Adler-32, MD4, SHA1) for both the
source and the resulting output. This ensures that a corrupted patch or mismatched
source file is detected before producing incorrect output — critical for software
distribution and medical/industrial systems.

4.8 Open Standards Exist


VCDIFF (RFC 3284 / ISO 13586) provides a vendor-neutral, documented patch
format. Libraries implementing VCDIFF exist for C, C++, Java, Go, Rust, Python, and
more, enabling multi-platform interoperability without proprietary lock-in.
5. Disadvantages & Limitations

5.1 Source File Dependency — The Core Limitation


Delta patches are completely useless without the exact source file they were computed
against. If a user has a different version, a modified copy, or no copy at all, the patch
cannot be applied. This creates significant complexity in distribution systems that must
support users at many different starting versions.
⚠️Chain complexity: To support users on v1.0, v1.5, and v2.0 all updating to
v3.0, you need three separate patches. For N source versions, you need N patches
per release.

5.2 Poor Performance on Dissimilar Files


When source and target share little content, the patch size approaches the target file
size — offering no benefit. Encryption, compression, and random data all appear
completely different even for minor logical changes, making delta compression
worthless in those cases.

5.3 Encoding is CPU and Memory Intensive


Building the hash index or suffix array for large files requires significant CPU time and
RAM. Encoding a 4 GB VM image can take minutes and require 5+ GB of RAM. This
is acceptable for centralized build servers but impossible on constrained devices.

5.4 No Built-in Version Management


Delta compression produces point-to-point patches. Managing a chain of patches
(v1→v2→v3→v4), handling rollbacks, and maintaining patch history requires
additional infrastructure. Delta compression is a building block, not a complete version
management system.

5.5 Encryption Destroys Delta Efficiency


Encrypting or re-compressing a file before delta encoding produces near-random
output — even a 1-byte change in plaintext cascades into completely different
ciphertext bytes. The delta patch grows to nearly the size of the target. Always delta
before encrypting, never after.
5.6 Delta Chain Brittleness
If any file in a long delta chain (v1→v2→v3→...→vN) is corrupted or lost, all
subsequent patches become unrecoverable without the intermediate versions.
Production systems mitigate this by inserting full snapshots periodically (e.g., every 10
versions or weekly).

5.7 Binary Patches Are Not Human-Readable


Unlike text patches (diff output), binary delta patches cannot be inspected, reviewed, or
manually modified. Debugging a broken patch requires specialized tools and deep
knowledge of the patch format. Code review workflows cannot use binary patches for
security audits.

5.8 File Rearrangement Breaks Compression


If the target is a reordering of source content (e.g., a sorted vs. unsorted dataset),
rolling-hash algorithms may miss all matches. Only suffix-array based approaches
(bsdiff) reliably handle large-scale rearrangements, at the cost of much higher memory
use.

6. Real-World Applications & Use Cases

6.1 Operating System Updates


Every major OS uses delta compression for software updates. Windows Update uses
Microsoft's delta compression (Cabinet/.delta format). macOS uses BOM-based binary
delta for system updates. Linux distributions use debdelta (apt) and deltarpm/drpm
(yum/dnf) for package updates.
# Debian/Ubuntu — view delta update activity
sudo apt-get update -o APT::Get::Show-Upgraded=1
# APT fetches deltas automatically when available

# RPM delta packages — Fedora/RHEL


sudo dnf install deltarpm
sudo dnf upgrade # uses drpm delta packages automatically

# Generate an RPM delta package manually


makedeltarpm [Link] [Link] [Link]
applydeltarpm [Link] new_from_delta.rpm
6.2 Game Patching & Digital Distribution
Game studios distribute patches that transform installed game data files. Steam's
Content Delivery Network uses a proprietary delta system. GOG, Epic Games, and
indie developers commonly use xdelta3 or bsdiff for smaller games.
# Game update script example
#!/bin/bash
GAME_DIR='/opt/mygame'
PATCH_URL='[Link]

# Download patch (a few MB instead of a full GB install)


curl -L "$PATCH_URL" -o /tmp/game_update.xdelta

# Apply patch in-place


xdelta3 -d -f \
-s "$GAME_DIR/[Link]" \
/tmp/game_update.xdelta \
"$GAME_DIR/[Link]"

# Verify and swap


if xdelta3 -t -s "$GAME_DIR/[Link]" "$GAME_DIR/[Link]"
/tmp/game_update.xdelta; then
mv "$GAME_DIR/[Link]" "$GAME_DIR/[Link]"
echo 'Update successful'
else
echo 'Patch verification failed!'
fi

6.3 Container & VM Image Distribution


Container registries (Docker Hub, GHCR) store image layers as content-addressed
blobs. Tools like casync, dragonfly, and nydus use delta/dedup techniques to avoid re-
downloading unchanged layers. VM image distribution uses xdelta or bsdiff to ship
only changed disk blocks.
# Docker layer delta inspection
docker history myimage:latest --no-trunc

# Manual VM disk delta


xdelta3 -e -B 2147483648 -9 \
-s ubuntu-22.04-base.qcow2 \
ubuntu-22.04-updated.qcow2 \
vm_delta.xdelta3

# casync — content-addressable delta sync for OS images


casync make --store=./store image_v2.catar ./image_v2_dir/
casync extract --store=./store image_v2.catar ./restored_dir/

6.4 Database Backup & Replication


Database WAL (Write-Ahead Log) in PostgreSQL and MySQL binlog are themselves
forms of delta encoding — recording only changed rows. Binary delta tools add
another layer for backup storage compression.
# PostgreSQL WAL archiving with delta compression
# In [Link]:
# archive_command = 'xdelta3 -e -9 -s %[Link] %p %[Link]'

# MySQL binlog is already a delta stream


mysqlbinlog --start-datetime='2024-01-01' binlog.000001 > [Link]

# Compress daily DB dumps using delta against yesterday


YESTERDAY_DUMP='/backups/db_yesterday.[Link]'
TODAY_DUMP='/tmp/db_today.[Link]'
mysqldump mydb | gzip > "$TODAY_DUMP"
xdelta3 -e -9 -s "$YESTERDAY_DUMP" "$TODAY_DUMP"
'/backups/delta_today.xd3'
echo "Delta size: $(du -sh /backups/delta_today.xd3)"

6.5 IoT & Embedded Firmware OTA


# Server-side: create firmware delta patch
xdelta3 -e -S none -9 -s fw_v2.[Link] fw_v2.[Link] fw_v2.3_to_v2.[Link]
echo "Firmware: $(du -sh fw_v2.[Link]) | Patch: $(du -sh fw_v2.3_to_v2.[Link])"

# Device-side C code using libxdelta3:


# xd3_decode_memory(
# old_fw_ptr, old_fw_size, /* source */
# patch_ptr, patch_size, /* delta */
# new_fw_buf, &new_fw_size, /* output */
# max_output_size,
# XD3_SKIP_EMIT /* flags */
# );
6.6 Source Code Version Control
# Git uses delta internally — inspect packfile deltas
git gc --aggressive --prune=now
git count-objects -vH

# Fossil VCS — explicit delta storage


fossil diff --binary v1.0 v2.0 > [Link]

# SVN stores deltas in FSFS/BDB backends


svnadmin dump /path/to/repo | grep -c 'Text-delta:'

6.7 ROM Hacking & Digital Preservation


# Fan translation / ROM patch distribution
# Distributing the patch is legal — original ROM not included

# Apply a BPS patch (beat patch system — common for ROM hacks)
# Using flips or a BPS patcher:
flips --apply [Link] [Link] [Link]

# Apply an IPS patch


ips_apply [Link] [Link] [Link]

# Apply an xdelta ROM patch


xdelta3 -d -s [Link] [Link] [Link]

7. Delta Compression vs. Xdelta — Detailed Comparison

This section clarifies an important distinction: 'Delta Compression' is a broad concept


— a family of techniques. 'Xdelta' is one specific implementation of delta compression.
Here we compare them across every meaningful dimension.

💡 Analogy: Delta compression is like 'sorting algorithms' (the concept). Xdelta is


like 'quicksort' (one specific algorithm in that family).

7.1 Nature & Scope


Dimension Delta Compression Xdelta3 (Specific Tool)
(General)
What is it? A concept / technique family A specific software tool &
library
Scope Any algorithm that encodes Rolling-hash + VCDIFF
differences implementation
Standards Many (VCDIFF, bsdiff, IPS, VCDIFF only (RFC 3284)
rsync, etc.)
Implementations Dozens of tools & libraries One canonical tool (xdelta3)
Author Concept: distributed / Joshua MacDonald (jmacd)
academic
License Varies by implementation Apache 2.0

7.2 Algorithm Comparison


Algorithm Aspect Delta (General Spectrum) Xdelta3 (Specific)
Matching method Rolling hash, suffix arrays, Rolling hash only
CDC, rsync, LCS (Rabin-Karp style)
Block size Fixed or variable (CDC) Variable (hash-defined
boundaries)
Match quality Varies: suffix arrays find best; Near-optimal for
rolling hash is near-optimal sequential data
Rearrangement bsdiff (suffix array) handles it Limited — rolling hash
handling well is sequential
Secondary Varies: gzip, bzip2, lzma, none DJW/FGK adaptive
compression Huffman, or none
Source window Full file (bsdiff) or configurable Configurable (-B flag,
(xdelta) default 64MB)

7.3 Performance Profile


Metric diff/patch bsdiff rsync xdelta3
File type Text only Any binary Any binary Any binary
Encode speed Very fast Very slow Medium Fast
Decode speed Fast Fast N/A (live) Very fast
Patch size Large (line- Very small N/A Small
based)
Memory Low 2-3x Low ~source
(encode) source (streaming) window
Memory Low Low Low Very low
(decode)
Network support No No Yes (native) With piping
Standard format Unified diff Proprietary Proprietary VCDIFF
RFC3284
Streaming/pipe Yes No Yes (live) Yes (-c flag)
Source req. No (text- Yes Yes Yes
decode mode)

7.4 Feature Matrix — All Major Delta Tools


Feature xdelta3 bsdiff rdiff rsync git pack zstd
dict
Binary Yes Yes Yes Yes Yes Yes
files
Open VCDIF No librsyn No No Zstd
standard F c
Pipe/ Yes No Yes Yes No Yes
stream
Checksum Built-in None MD4 MD4/ SHA1 XXHas
verify SHA h
Compr. 0-9 Fixed Fixed Fixed Configurabl 1-22
level e
In- Yes No Yes No No Yes
memory
API
Network No No Partial Yes Yes No
native
Small file Good Excellen Good Poor Good Good
perf. t
Large file Very Poor Good Good Good Good
perf. good (RAM)
Decode Very Fast Fast N/A Fast Very
speed fast fast

7.5 When to Use Which


Scenario Best Choice Why
Binary software/game xdelta3 VCDIFF standard, fast decode, good
patches compression
Smallest possible binary bsdiff Suffix array finds optimal matches,
patch smallest patches
Sync files over network rsync Native network protocol, no
intermediate files
Text / source code diff/patch Human-readable, universal,
patches mergeable
ROM hacking patches xdelta3 or IPS Community standard, easy to apply
Git history efficiency git repack Built-in, automatic, integrated with
workflow
Similar-file zstd -D dict No source file needed at decode,
compression very fast
IoT firmware OTA xdelta3 C library embeddable, small decode
libxdelta3 footprint
Database backup deltas xdelta3 or rdiff xdelta3 for local, rdiff for network-
aware
Container image sync casync or Content-defined chunking for layer
dragonfly dedup
Very large files (> 4 xdelta3 with -B 64-bit offsets, configurable windows
GB)
Embedded C with no custom / miniz Implement simple block-delta +
deps deflate

7.6 Side-by-Side Code Comparison


Creating a Binary Patch
# ── General delta (bsdiff) ─────────────────────────
bsdiff old_file new_file [Link]

# ── Xdelta3
─────────────────────────────────────────
xdelta3 -e -s old_file new_file [Link]

# ── rdiff
───────────────────────────────────────────
rdiff signature old_file old_file.sig
rdiff delta old_file.sig new_file [Link]

# ── Python xdelta3 (in-memory) ──────────────────────


patch = [Link](open('old','rb').read(), open('new','rb').read())
open('[Link]','wb').write(patch)

Applying a Binary Patch


# ── General delta (bspatch) ─────────────────────────
bspatch old_file [Link] [Link]

# ── Xdelta3
─────────────────────────────────────────
xdelta3 -d -s old_file [Link] [Link]

# ── rdiff
───────────────────────────────────────────
rdiff patch old_file [Link] [Link]

# ── Python xdelta3 ──────────────────────────────────


restored = [Link](open('old','rb').read(), open('[Link]','rb').read())
open('[Link]','wb').write(restored)

8. Advanced Topics

8.1 Delta Chains & Snapshot Strategy


In a real update system, users may be on many different versions. A naive approach
requires N patches per release (one per supported source version). Two strategies
reduce this:
• Baseline + delta chain: keep one full snapshot every K versions; apply a chain of
at most K deltas. Tradeoff: up to K apply operations needed.
• Parallel deltas from common base: each new version has a patch from the
canonical 'stable base'. Every user downloads patches from the same source
regardless of their current version (requires a 'downgrade to base, then upgrade'
workflow).
• Bintray/Deltarpm approach: generate deltas from the N most recent versions to
the current release, covering most real-world users while limiting patch count.

8.2 Delta Compression + Encryption


Never apply delta compression after encryption. The correct order is:
# WRONG: encrypt then delta (delta will be huge — encrypted data is near-
random)
openssl enc -aes-256-cbc -in new_file -out new_file.enc
xdelta3 -e -s old_file.enc new_file.enc patch # TERRIBLE — no savings!

# CORRECT: delta first, then encrypt the (small) patch


xdelta3 -e -s old_file new_file patch # small patch
openssl enc -aes-256-cbc -in patch -out [Link] # encrypt the patch

# On client: decrypt then apply


openssl enc -d -aes-256-cbc -in [Link] -out patch
xdelta3 -d -s old_file patch new_file

8.3 Measuring Delta Efficiency


#!/bin/bash
# Compare all delta tools on a pair of files
OLD=$1
NEW=$2

OLD_SIZE=$(stat -c%s "$OLD")


NEW_SIZE=$(stat -c%s "$NEW")
SIMILAR=$(echo "scale=1; ($OLD_SIZE + $NEW_SIZE) / 2" | bc)

echo "Source: ${OLD_SIZE} bytes | Target: ${NEW_SIZE} bytes"


echo
"──────────────────────────────────────────────
──"

# xdelta3
xdelta3 -e -9 -f -s "$OLD" "$NEW" /tmp/[Link] 2>/dev/null
SZ=$(stat -c%s /tmp/[Link])
echo "xdelta3 -9: ${SZ} bytes ($(echo "scale=1; $SZ*100/$NEW_SIZE" | bc)%
of target)"

# bsdiff
bsdiff "$OLD" "$NEW" /tmp/[Link] 2>/dev/null
SZ=$(stat -c%s /tmp/[Link])
echo "bsdiff: ${SZ} bytes ($(echo "scale=1; $SZ*100/$NEW_SIZE" | bc)% of
target)"

# rdiff
rdiff signature "$OLD" /tmp/[Link] 2>/dev/null
rdiff delta /tmp/[Link] "$NEW" /tmp/[Link] 2>/dev/null
SZ=$(stat -c%s /tmp/[Link])
echo "rdiff: ${SZ} bytes ($(echo "scale=1; $SZ*100/$NEW_SIZE" | bc)% of
target)"

# zstd without delta (baseline)


zstd -19 -q "$NEW" -o /tmp/[Link]
SZ=$(stat -c%s /tmp/[Link])
echo "zstd -19: ${SZ} bytes ($(echo "scale=1; $SZ*100/$NEW_SIZE" | bc)% of
target, no delta)"

9. Troubleshooting & Common Pitfalls


Symptom Root Cause Solution
Patch larger than Files share little content Distribute full file; delta not
target beneficial
Wrong source Mismatched source Match exact source version
checksum error version used during encoding
Out of memory Source window too Reduce block/window size or
encoding large add swap
Delta patch on Encryption destroys Delta before encrypting, never
encrypted file fails similarity after
Encoding is Suffix-array algorithm Use xdelta3 (rolling hash) for
extremely slow on large file speed
Decode produces Corrupt patch or wrong Re-download patch; verify
wrong file tool checksums
bsdiff out of memory Suffix array needs 2-3× Use xdelta3 for large files (>
source RAM 500 MB)
rsync not saving Block size mismatch Use --checksum or increase
bandwidth after reorder block size
Chain delta restore Intermediate version Rebuild chain from nearest full
fails missing snapshot
Patch fails on 32-bit File > 4 GB with 32-bit Use 64-bit build of xdelta3
OS offsets

10. Quick Reference Cheat Sheet

Delta Compression — All Tools


Tool Create Patch Apply Patch
diff/patch diff -u old new > [Link] patch old < [Link]
xdelta3 xdelta3 -e -s old new patch xdelta3 -d -s old patch
new
bsdiff bsdiff old new patch bspatch old new patch
rdiff rdiff signature old sig; rdiff delta sig rdiff patch old patch new
new patch
rsync (no patch file — live sync) rsync -av src/
user@host:/dst/
zstd dict zstd --train samples/ -o dict; zstd -D zstd -D dict -d [Link] -o
dict new -o [Link] new
git (automatic in packfiles) git checkout / git pull
Python p = [Link](src, tgt) r = [Link](src,
xdelta3 patch)

xdelta3 Flag Quick Reference


Flag Meaning
-e Encode (create patch)
-d Decode (apply patch)
-t Test (encode + decode + verify)
-c Stdin/stdout pipe mode
-f Force overwrite output
-s FILE Source file
-S djw|fgk|none Secondary compressor
-B SIZE Source window size in bytes (default 64MB)
-0 to -9 Compression level (0=fastest, 9=best)
-v / -q Verbose / quiet
-n Skip checksum verification
Most popular tools for binary delta patching?
Binary delta patching tools are categorized based on their specific strengths, such as
compression ratio, speed, or network efficiency. According to the sources, the most
popular and prominent tools include:
• Xdelta3: This is one of the most widely used tools for binary software and game
patches. It implements the VCDIFF (RFC 3284) open standard and uses a rolling-hash
algorithm that is near-optimal for sequential data. It is favored for its very fast decode
speeds and good compression, making it a standard for community ROM hacking and
industrial software distribution.
• Bsdiff / Bspatch: This tool is the best choice when the priority is achieving the
smallest possible binary patch. It uses a powerful suffix-array algorithm that can
find matches across the entire source simultaneously, even handling reordered data.
However, it is much slower than Xdelta3 and requires significant memory (2–3x the
source file size) during the encoding process.
• Rsync: While primarily a real-time delta transfer system rather than a standalone
patch-file tool, it is cited as the most widely-used delta sync tool in existence. It uses a
rolling checksum to transfer only changed blocks over a network, making it essential
for remote file synchronization.
• Rdiff (librsync): Based on the rsync algorithm, rdiff is used for network-aware
deltas. Its primary advantage is that it only requires a small "signature" file (containing
hashes) to compute differences, rather than needing the full source file locally.
• Zstandard (zstd): Although a general-purpose compressor, it supports dictionary-
based compression, which functions as a form of delta compression. It is highly
efficient for compressing many similar small files and offers extremely fast
performance.
• Git: Internally, Git uses delta compression to manage its packfiles. It automatically
identifies similar blobs and stores only their differences to keep repository sizes
manageable.
• ROM Patching Formats (IPS, BPS, UPS): These are specialized, simpler formats
used extensively in the ROM hacking and digital preservation communities. BPS
(Beat Patch System) is a common standard for distributing legal fan translations and
game modifications.
• Casync: Used in modern container and OS image distribution, this tool utilizes
content-defined chunking (CDC) to identify identical content regardless of its position
in a file, which is robust against insertions and deletions.
Would you like me to create an infographic comparing the performance and use cases
of these different tools, or perhaps a quiz to help you test your knowledge of their
technical differences?
How is delta compression used in software and game updates
Delta compression is a foundational technology for modern software and game
distribution, enabling incremental updates that transmit only the differences (the
"delta") between an existing version and a new one. Instead of downloading a full
multi-gigabyte installer, the user downloads a compact patch file that, when applied to
their local files, reproduces the updated version exactly.
Real-World Applications
• Operating Systems: Every major OS utilizes this technique to reduce the size of
system updates. For instance, Windows Update uses Microsoft's Cabinet/.delta format,
while macOS uses BOM-based binary deltas. In the Linux ecosystem, tools like
debdelta (Debian/Ubuntu) and deltarpm (Fedora/RHEL) automatically fetch deltas for
package upgrades.
• Game Patching: Platforms like Steam use proprietary delta systems to transform
installed game data files. Smaller studios or indie developers often use standard tools
like xdelta3 or bsdiff to distribute patches. This is especially effective for large "data
pack" files (e.g., [Link]), where a few megabytes of changed assets can be patched
into a gigabyte-sized archive.
• Version Control: Systems like Git use delta compression internally in "packfiles" to
store repository history efficiently by identifying similar blobs and storing only their
differences.
Key Benefits for Distribution
• Bandwidth Savings: Delta patches typically reduce update sizes to 1–10% of the
full file size. Examples include the Firefox browser, where an 80 MB full update can
be reduced to a 2.1 MB delta patch (a 97.4% saving).
• Decode Speed: Applying a patch is significantly faster than downloading a full file.
Modern decoders can reconstruct files at speeds of 1–5 GB/s, making the process
limited by storage speed rather than CPU performance.
• Hardware Longevity: In IoT and embedded systems, delta patches extend the
lifespan of flash memory by only rewriting the specific changed sectors of firmware
rather than the entire chip.
The Technical Process
The update process follows a specific instruction-based logic:
1. Instruction Set: A patch file contains a sequence of commands: COPY (reuse data
from the old version), ADD (insert new data not found in the old version), or RUN
(repeat a specific byte).
2. Decoding (Reconstruction): The user's device executes these instructions
sequentially to build the new file. This is 10-50× faster than encoding because it
requires no searching—just simple memory copying operations.
3. Integrity Verification: Most formats embed checksums (like SHA1 or MD4) to
ensure that the resulting updated file is bit-for-bit identical to the intended version.
Implementation Strategies and Limitations
• Delta Chains: To support users upgrading from various previous versions,
developers use delta chains (v1→v2→v3) or generate multiple patches from a
common "stable base".
• Source Dependency: A patch is useless without the exact source file it was created
from. If a user’s local file is corrupted or is a different version than expected, the patch
cannot be applied.
• Encryption/Compression Conflict: If a developer encrypts or re-compresses a file
before delta encoding, the efficiency is destroyed because even a 1-byte change in the
source results in a completely different, randomized output.
Would you like me to create a tailored report on how to implement a delta backup
system, or perhaps a slide deck summarizing the pros and cons of different delta tools
for software distribution?
Which tool provides the smallest possible binary patch?
According to the sources, bsdiff is the best choice when the priority is achieving the
smallest possible binary patch.
The primary reasons for its superior compression include:
• Suffix-Array Algorithm: Unlike rolling-hash tools, bsdiff builds a sorted suffix
array of the entire source file. This allows it to perform a binary search to find the
longest match for any target position anywhere in the source.
• Global Match Finding: It can find matches across the entire source simultaneously,
making it more powerful than tools like xdelta3, which are more restricted to sequential
data.
• Handling Rearrangements: Because it uses suffix sorting, it reliably handles large-
scale data reordering and rearrangements that might cause other algorithms to miss
matches.
However, achieving this minimal patch size comes with significant trade-offs. The
encoding process is very slow and highly memory-intensive, requiring approximately
2–3x the size of the source file in RAM. For this reason, while bsdiff is ideal for
minimizing file size, it is often bypassed for xdelta3 in scenarios where encoding speed
or large file sizes (greater than 500 MB) are a factor

You might also like