hive-file-compression-codecs

July 25, 2024

Hive File compression codecs

Compression Codecs supported by Hive

Codecs


Snappy
GZip
BZip2
LZO
LZ4

1. Snappy Compression


Fast compression and de-compression library not in terms of file size but for processing speeds
Perfectly trade off between CPU and storage
Compressed outputs can be splittable, if used with ORC/Parquet file formats.
Do not support splitting if used with traditional text/csv files.

2. GZip Compression


Best compression codec of all Types in terms of storage space
Produces less storage output but needs more CPU(processing) for compression/decompression
Con: Output file can not be splittable

3. LZO(Lempel-Ziv-Oberhumer) Compression


It is a loss-less data compression and fastest de-compression techniques
Output file can be splittable, if indexed
Compression ratio and speed can be configurable irrespective de-compression speed
While compression, LZO needs additional buffer to create indexes, but no additional resources are needed while decompression

4. BZip2 Compression


Low to medium compression ratio but needs more CPU time
Typically compreses 10% to 15% of file size
Compressed outputs can be splittable.

5. LZ4 Compression


Fastest compression and decompression as it uses less CPU
Uses defalation algorithm
Compressed outputs can be splittable.

Pros & Cons - Hive File formats

Format Name	Pros	Cons
Text Format	Light weight and no dependencies Popular format for data exchange For datasets does not have any schema or CSV dependent ingestion	Can not handle nested data structures No/minimal compression Slow to read & write
RC Format	Better compression than text format Stores info as blobs Column oriented	Fixed columns family No/minimal compression Not intended for supporting transactions
ORC Format	Best compression of all formats Stores info as blob stripes Column oriented	Fixed columns family Costly write operations Not intended for supporting transactions
Avro Format	Supports schema evolution Language independent & High performance Best for heavy write operations Supports Transactions Row oriented	Does not supports Enum, Null, Timestamp with zone types Slow Serialization/Deserialization Schema dependent
Parquet Format	Best suits for analytical opertions Supports complex and nested data types Column oriented	Costly write operations Not intended for supporting transactions

Usecases of chosing best file formats and compression codecs

Scenario	Ideal File format	Ideal Compression
Small data with simple structure No external dependencies No nested & complex structure	Text	No compression Required
More compression required With high analytical operations System reads columns rather than rows	ORC	Snappy(High Disk Usage/Less CPU) GZIP(Low Disk Usage/High CPU)
Heavy write operations Should support Schema evolution System reads entire rows rather than columns	Avro	Snappy(High Disk Usage/Less CPU) GZIP(Low Disk Usage/High CPU)
Heavy Anaytical Operations System process set of columns instead of entire rows	Parquet	Snappy(High Disk Usage/Less CPU) GZIP(Low Disk Usage/High CPU)

Compression rate

File Format: TEXT offers 0% compression

Compression

File Format: RC offers 15% compression

Compression

File Format: ORC offers 60% compression

Compression

File Format: AVRO offers 30% compression

Compression

File Format: PARQUET offers 25% compression

Compression

Coding Streams

hive-file-compression-codecs

Hive File compression codecs

Comments

Post a Comment

Popular posts from this blog

hadoop-installation-ubuntu

jenv-tool

hive-installation-in-ubuntu