hive-file-formats-compressions-usecases

Usecases to select Hive File formats & Compression codecs
Usecases to select Hive File formats & Compression codecs

Pros & Cons - Hive File formats

Format Name Pros Cons
Text Format Light weight and no dependencies
Popular format for data exchange
For datasets does not have any schema or CSV dependent ingestion
Can not handle nested data structures
No/minimal compression
Slow to read & write
RC Format Better compression than text format
Stores info as blobs
Column oriented
Fixed columns family
No/minimal compression
Not intended for supporting transactions
ORC Format Best compression of all formats
Stores info as blob stripes
Column oriented
Fixed columns family
Costly write operations
Not intended for supporting transactions
Avro Format Supports schema evolution
Language independent & High performance
Best for heavy write operations
Supports Transactions
Row oriented
Does not supports Enum, Null, Timestamp with zone types
Slow Serialization/Deserialization
Schema dependent
Parquet Format Best suits for analytical opertions
Supports complex and nested data types
Column oriented
Costly write operations
Not intended for supporting transactions
Usecases of chosing best file formats and compression codecs

Scenario Ideal File format Ideal Compression
Small data with simple structure
No external dependencies
No nested & complex structure
Text No compression Required
More compression required
With high analytical operations
System reads columns rather than rows
ORC Snappy(High Disk Usage/Less CPU)
GZIP(Low Disk Usage/High CPU)
Heavy write operations
Should support Schema evolution
System reads entire rows rather than columns
Avro Snappy(High Disk Usage/Less CPU)
GZIP(Low Disk Usage/High CPU)
Heavy Anaytical Operations
System process set of columns instead of entire rows
Parquet Snappy(High Disk Usage/Less CPU)
GZIP(Low Disk Usage/High CPU)
Compression rate

File Format: TEXT offers 0% compression
Compression

File Format:  RC offers 15% compression
Compression

File Format:  ORC offers 60% compression
Compression

File Format:  AVRO offers 30% compression
Compression

File Format:  PARQUET offers 25% compression
Compression

File Format: sdsd offers 77% compression
Compression

Comments

Popular posts from this blog

hadoop-installation-ubuntu

jenv-tool

hive-installation-in-ubuntu