hive-table-bucketing

Hive Table Bucketing
Hive Table Bucketing

What is Bucketing

  • Hive Bucketing allows us to divide Partitioned data into further level called Buckets
  • It helps table sampling more efficiently than non-bucketed tables on very large Datasets
  • Bucketing increases performance, if Map side join performed on two sorted bucketing tables by efficient Merge sort
  • Bucketing can be implemented by using CLUSTERED BY or DISTRIBUTE BY with SORT BY keywords on bucketing column
  • CLUSTERED BY is shortcut for DISTRIBUTE BY WITH SORT BY on single column
  • Partitioning creates directories, Bucketing creates files in each partitioned directory
  • We can apply Bucketing with out Partitioning


create external table emp_part_buck_tb(id int,fname string,lname string,email string,gender string,ip_address string)
partitioned by (country string,state string,city string)
clustered by(id) into 5 buckets
row format delimited
fields terminated by ','
lines terminated by '
'
stored as textfile
location '/empstgnew344'
TBLPROPERTIES("skip.header.line.count"="1");

insert into emp_part_buck_tb partition(country,state,city)
select id,fname,lname,email,gender,ip_address,country, state,city from emp_staging;

Country
>>|__State
>>>>|___City
>>>>>>>>>|___file1
>>>>>>>>>|___file2
>>>>>>>>>|___file3
>>>>>>>>>|___file4
>>>>>>>>>|___file5



  • Please ensure below property was enabled and then load data from the staging table
  • set hive.enforce.bucketing=true;
  • set hive.exec.dynamic.partition=true;
  • set hive.exec.dynamic.partition.mode=nonstrict;
  • If necessary as per Data
  • set hive.exec.max.dynamic.partitions={No. of desired partitions};
  • set hive.exec.max.dynamic.partitions.pernode={No. of desired partitions per node};


insert into table emp_partbuck_tb partition(gender)
select id,first_name,last_name,email,company,gender from EMP_STG;

Bucketing without Partitioning

create external table emp_buck_tb(id int,fname string,lname string,email string,gender string,ip_address string,city string,state string,country string)
clustered by(id) into 5 buckets
row format delimited
fields terminated by ','
lines terminated by '
'
stored as textfile
location '/empstgnew345'
TBLPROPERTIES("skip.header.line.count"="1");

insert into emp_buck_tb
select id,fname,lname,email,gender,ip_address, city, state,country from emp_staging;

Data
|___file1
|___file2
|___file3
|___file4
|___file5

Table Sampling Queries

select * from emp_buck_tb TABLESAMPLE(BUCKET 1 OUT OF 4);
select * from emp_buck_tb TABLESAMPLE(BUCKET 2 OUT OF 4 on LNAME);
select * from emp_buck_tb TABLESAMPLE(BUCKET 3 OUT OF 4 on RAND());
select * from emp_buck_tb TABLESAMPLE(BUCKET 4 OUT OF 4 on RAND()) LIMIT 2;

Comments

Popular posts from this blog

hadoop-installation-ubuntu

jenv-tool

hive-installation-in-ubuntu