hive-file-formats

Hive File formats
Hive File formats

File formats/SerDe(Serialization/Deserialization) are used to sepcify the underlying storage type of a Table.
Importanat Factors to consider right file format
1. Fault tolerant
2. Consistency
3. Widely accepted format for sharing across applications
4. Support for performing quick analytical operations

(Row Object) ==> Serialization ==> (Output File Format) ==> Deserialization ==> (Row Object)

Text - STORED AS TEXTFILE
RC(Row Columnar) - STORED AS RCFILE
ORC(Optimized Row Columnar) - STORED AS ORC/ORCFILE
Avro - STORED AS AVRO/AVROFILE
Parquet - STORED AS PARQUET/PARQUETFILE
Sequence - STORED AS SEQUENCEFILE



Text File format is the default for the Hive tables.
De facto format for CSV, TSV(Tab separated) and other special character separated values.
Input format  - org.apache.hadoop.mapred.TextInputFormat
Output format - org.apache.hadoop.mapred.TextOutputFormat



RC File format stores the data in binary key/value pairs.
The Source file is partitioned in to row splits(horizontally)
Each row split again partitioned in to column wise(vertically)
Metadata of each row split is Key and Value is row split data
Input format  - org.apache.hadoop.hive.ql.io.RCFileInputFormat
Output format - org.apache.hadoop.hive.ql.io.RCFileOutputFormat



The best file format in terms of file size compression.
Increases performance while reading, writing and processing data in Hive.
The input file is divided in to multiple row groups called as stripes.
Each stripe is of 250 MB of data having three parts.
The first part holds index, second one holds row data and last called footer stripe
The footer contains column level aggregations like count, max,min and sum of that stripe
orc.compress="ZLIB"(default) or "NONE"(no compression) or "SNAPPY"
Serde - org.apache.hadoop.hive.ql.io.orc.OrcSerde
Input format - org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
Output format - org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat



It is row based binary file format and language independent file sharing format
Best used format for streaming data pipelines created in Kafka, etc.
It supports schema evolution as it stores metadata in Json format, flexible for modifications
Serde - org.apache.hadoop.hive.serde2.avro.AvroSerDe
Input format  - org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat
Output format - org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat



Columnar based format and best chosen for data analytics code
Supports advanced data types/nested data structures
Data orginized in column groups rather than row groups, hence loading data of specific columns will be faster
Serde - org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
Input format  - org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
Output format - org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat



This format is developed to overcome small files problem in Big Data world
It combines many small files generated by reducers to one big file
Sequence format also stores data in binary key/value pairs
Input format  - org.apache.hadoop.mapred.SequenceFileInputFormat
Output format - org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Comments

Popular posts from this blog

hadoop-installation-ubuntu

jenv-tool

hive-installation-in-ubuntu