hive-file-formats

hive-file-formats

Hive File formats

Hive File formats

File formats/SerDe(Serialization/Deserialization) are used to sepcify the underlying storage type of a Table. Importanat Factors to consider right file format 1. Fault tolerant 2. Consistency 3. Widely accepted format for sharing across applications 4. Support for performing quick analytical operations (Row Object) ==> Serialization ==> (Output File Format) ==> Deserialization ==> (Row Object) File formats Types `Text - STORED AS TEXTFILE RC(Row Columnar) - STORED AS RCFILE ORC(Optimized Row Columnar) - STORED AS ORC/ORCFILE Avro - STORED AS AVRO/AVROFILE Parquet - STORED AS PARQUET/PARQUETFILE Sequence - STORED AS SEQUENCEFILE`
1. Text File format `Text File format is the default for the Hive tables. De facto format for CSV, TSV(Tab separated) and other special character separated values. Input format - org.apache.hadoop.mapred.TextInputFormat Output format - org.apache.hadoop.mapred.TextOutputFormat`
2. Row Columnar `RC File format stores the data in binary key/value pairs. The Source file is partitioned in to row splits(horizontally) Each row split again partitioned in to column wise(vertically) Metadata of each row split is Key and Value is row split data Input format - org.apache.hadoop.hive.ql.io.RCFileInputFormat Output format - org.apache.hadoop.hive.ql.io.RCFileOutputFormat`
3. Optimized Row Columnar The best file format in terms of file size compression. Increases performance while reading, writing and processing data in Hive. The input file is divided in to multiple row groups called as stripes. Each stripe is of 250 MB of data having three parts. The first part holds index, second one holds row data and last called footer stripe The footer contains column level aggregations like count, max,min and sum of that stripe orc.compress="ZLIB"(default) or "NONE"(no compression) or "SNAPPY" Serde - org.apache.hadoop.hive.ql.io.orc.OrcSerde Input format - org.apache.hadoop.hive.ql.io.orc.OrcInputFormat Output format - org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
4. Avro File Format `It is row based binary file format and language independent file sharing format Best used format for streaming data pipelines created in Kafka, etc. It supports schema evolution as it stores metadata in Json format, flexible for modifications Serde - org.apache.hadoop.hive.serde2.avro.AvroSerDe Input format - org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat Output format - org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat`
5. Parquet File Format `Columnar based format and best chosen for data analytics code Supports advanced data types/nested data structures Data orginized in column groups rather than row groups, hence loading data of specific columns will be faster Serde - org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe Input format - org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat Output format - org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat`
6. Sequence File Format `This format is developed to overcome small files problem in Big Data world It combines many small files generated by reducers to one big file Sequence format also stores data in binary key/value pairs Input format - org.apache.hadoop.mapred.SequenceFileInputFormat Output format - org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat`

Comments