0% found this document useful (0 votes)
5 views

File Types

Uploaded by

César Martín
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

File Types

Uploaded by

César Martín
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Aspect Text files Sequence files

File format Human-readable plain text Binary format (not human.


Lines of text (usually delimited by
Data Organization Key-Value pairs in a binary format
spaces or commas)
Less efficient in terms of storage and More efficient storage and faster
Efficiency
processing processing
Limited, often requires external tools Built-in compression (e.g., Snappy,
Compression Support
(e.g., Gzip) Gzip)
Requires custom parsing for structured Automatic serialization using Hadoop
Serialization
data Writable types
Supports splitting for large data Splittable, making it easier for
Splittability
processing distributed processing
No metadata support; data schema Supports metadata, allowing easier
Metadata Support
inferred manually schema management
Supported but not optimized for Optimized for use with Hadoop and
Integration with Hadoop
Hadoop MapReduce
Best for small or simple datasets; Best for large datasets in MapReduce
Use Cases
debugging applications
Faster read/write due to binary
Read/Write Performance Slower for structured data
encoding

Aspect ORC Files Parquet Files


Columnar, supported by multiple big data
File Format Columnar, optimized for Hive
platforms
Multiple compression algorithms (Snappy,
Compression Support High compression (Zlib, Snappy, LZO)
Gzip, LZO)
Efficient for both read-heavy and write-
Efficiency More efficient for read-heavy operations
heavy workloads
Data Organization Column-oriented storage Column-oriented storage
Splittable for parallel processing in distributed
Splittability Splittable for parallel processing
systems
Supports schema evolution (adding/removing Supports schema evolution (renaming,
Schema Evolution
columns) adding columns)
Extensive, supports predicate pushdown for
Metadata Support Extensive, supports predicate pushdown
faster queries
Generally produces smaller file sizes than
File Size Slightly larger file sizes compared to ORC
Parquet
Balanced performance for both read and
Read/Write Performance Better read performance for large datasets
write operations
Broad compatibility (supports Spark, Hive,
Compatibility Optimized for Hive and Hadoop ecosystems
Impala, etc.)
Best for Hive workloads and read-heavy Best for multi-platform use and general-
Use Cases
analytics purpose analytics

You might also like