hive-file-compression-codecs

Hive File compression codecs
Hive File compression codecs

Compression Codecs supported by Hive

Snappy
GZip
BZip2
LZO
LZ4



Fast compression and de-compression library not in terms of file size but for processing speeds
Perfectly trade off between CPU and storage
Compressed outputs can be splittable, if used with ORC/Parquet file formats.
Do not support splitting if used with traditional text/csv files.



Best compression codec of all Types in terms of storage space
Produces less storage output but needs more CPU(processing) for compression/decompression
Con: Output file can not be splittable



It is a loss-less data compression and fastest de-compression techniques
Output file can be splittable, if indexed
Compression ratio and speed can be configurable irrespective de-compression speed
While compression, LZO needs additional buffer to create indexes, but no additional resources are needed while decompression



Low to medium compression ratio but needs more CPU time
Typically compreses 10% to 15% of file size
Compressed outputs can be splittable.



Fastest compression and decompression as it uses less CPU
Uses defalation algorithm
Compressed outputs can be splittable.

Pros & Cons - Hive File formats

Format Name Pros Cons
Text Format Light weight and no dependencies
Popular format for data exchange
For datasets does not have any schema or CSV dependent ingestion
Can not handle nested data structures
No/minimal compression
Slow to read & write
RC Format Better compression than text format
Stores info as blobs
Column oriented
Fixed columns family
No/minimal compression
Not intended for supporting transactions
ORC Format Best compression of all formats
Stores info as blob stripes
Column oriented
Fixed columns family
Costly write operations
Not intended for supporting transactions
Avro Format Supports schema evolution
Language independent & High performance
Best for heavy write operations
Supports Transactions
Row oriented
Does not supports Enum, Null, Timestamp with zone types
Slow Serialization/Deserialization
Schema dependent
Parquet Format Best suits for analytical opertions
Supports complex and nested data types
Column oriented
Costly write operations
Not intended for supporting transactions
Usecases of chosing best file formats and compression codecs

Scenario Ideal File format Ideal Compression
Small data with simple structure
No external dependencies
No nested & complex structure
Text No compression Required
More compression required
With high analytical operations
System reads columns rather than rows
ORC Snappy(High Disk Usage/Less CPU)
GZIP(Low Disk Usage/High CPU)
Heavy write operations
Should support Schema evolution
System reads entire rows rather than columns
Avro Snappy(High Disk Usage/Less CPU)
GZIP(Low Disk Usage/High CPU)
Heavy Anaytical Operations
System process set of columns instead of entire rows
Parquet Snappy(High Disk Usage/Less CPU)
GZIP(Low Disk Usage/High CPU)
Compression rate

File Format: TEXT offers 0% compression
Compression

File Format:  RC offers 15% compression
Compression

File Format:  ORC offers 60% compression
Compression

File Format:  AVRO offers 30% compression
Compression

File Format:  PARQUET offers 25% compression
Compression

Comments

Popular posts from this blog

hadoop-installation-ubuntu

jenv-tool

hive-installation-in-ubuntu