hive-table-bucketing

hive-table-bucketing

December 30, 2022

Hive Table Bucketing

Hive Table Bucketing

What is Bucketing Hive Bucketing allows us to divide Partitioned data into further level called Buckets It helps table sampling more efficiently than non-bucketed tables on very large Datasets Bucketing increases performance, if Map side join performed on two sorted bucketing tables by efficient Merge sort Bucketing can be implemented by using CLUSTERED BY or DISTRIBUTE BY with SORT BY keywords on bucketing column CLUSTERED BY is shortcut for DISTRIBUTE BY WITH SORT BY on single column Partitioning creates directories, Bucketing creates files in each partitioned directory We can apply Bucketing with out Partitioning
Examples create external table emp_part_buck_tb(id int,fname string,lname string,email string,gender string,ip_address string) partitioned by (country string,state string,city string) clustered by(id) into 5 buckets row format delimited fields terminated by ',' lines terminated by ' ' stored as textfile location '/empstgnew344' TBLPROPERTIES("skip.header.line.count"="1"); insert into emp_part_buck_tb partition(country,state,city) select id,fname,lname,email,gender,ip_address,country, state,city from emp_staging; Country >>\|__State >>>>\|___City >>>>>>>>>\|___file1 >>>>>>>>>\|___file2 >>>>>>>>>\|___file3 >>>>>>>>>\|___file4 >>>>>>>>>\|___file5
Please ensure below property was enabled and then load data from the staging table set hive.enforce.bucketing=true; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; If necessary as per Data set hive.exec.max.dynamic.partitions={No. of desired partitions}; set hive.exec.max.dynamic.partitions.pernode={No. of desired partitions per node};
Examples `insert into table emp_partbuck_tb partition(gender) select id,first_name,last_name,email,company,gender from EMP_STG;`
Bucketing without Partitioning Examples create external table emp_buck_tb(id int,fname string,lname string,email string,gender string,ip_address string,city string,state string,country string) clustered by(id) into 5 buckets row format delimited fields terminated by ',' lines terminated by ' ' stored as textfile location '/empstgnew345' TBLPROPERTIES("skip.header.line.count"="1"); insert into emp_buck_tb select id,fname,lname,email,gender,ip_address, city, state,country from emp_staging; Data \|___file1 \|___file2 \|___file3 \|___file4 \|___file5
Table Sampling Queries Examples `select * from emp_buck_tb TABLESAMPLE(BUCKET 1 OUT OF 4); select * from emp_buck_tb TABLESAMPLE(BUCKET 2 OUT OF 4 on LNAME); select * from emp_buck_tb TABLESAMPLE(BUCKET 3 OUT OF 4 on RAND()); select * from emp_buck_tb TABLESAMPLE(BUCKET 4 OUT OF 4 on RAND()) LIMIT 2;`

Comments