hive-order-sort-distribute-cluster

Hive Order by, Sort by, Cluster by and Distribute by
Hive Order by, Sort by, Cluster by and Distribute by



"Order by" clause ensures the global sorting on the column(s) specified.
Very costly operation as all the data has to pass to single reducer irrespective of mappers count.
Syntax: select <col list> from <table name> ORDER BY <col_list> <asc or desc> limit <no. of rows> offset <offset value>



DATA
10
2
7
4
5
6
3
8
9
1
  
MAPPER-1
10
2
7
4
MAPPER-2
5
6
3
MAPPER-3
8
9
1
  
REDUCER-1
1
2
3
4
5
6
7
8
9
10
  
OUTPUT
1
2
3
4
5
6
7
8
9
10
  


"Sort by" clause used to sort on the column(s) specified on the local/reducer level.
If multiple reducers are in place, mutliple groups of sorted rows will be returned.
Syntax: select <col list> from <table name> SORT BY <col_list> <asc or desc> limit <no. of rows> offset <offset value>



DATA
10
2
7
4
5
6
3
8
9
1
  
MAPPER-1
10
2
7
4
MAPPER-2
5
6
3
MAPPER-3
8
9
1
  
REDUCER-1
2
4
7
10
REDUCER-2
3
5
6
REDUCER-3
1
8
9
  
OUTPUT
2
4
7
10
3
5
6
1
8
9
  


"Distribute by" clause controls the allocation of the map output among the reducers.
The Data will be grouped by the distributed column and each group will be sent a single and same sreducer.
But there is no gurantee of allocation, but normally partial allocations of same set of data to two different reducers is prevented.
Sorting is not possible on the data on the reducer level.
Syntax: select <col list> from <table name> DISTRIBUTE BY by <col_list> <asc or desc> limit <no. of rows> offset <offset value>



DATA
X,10
Y,2
X,7
Y,4
Z,5
X,1
Y,3
Z,8
Z,9
X,6
  
MAPPER-1
X,10
Y,2
X,7
Y,4
MAPPER-2
Z,5
X,1
Y,3
MAPPER-3
Z,8
Z,9
X,6
  
REDUCER-1
X,10
X,7
X,1
X,6
REDUCER-2
Y,2
Y,4
Y,3
REDUCER-3
Z,5
Z,8
Z,9
  
OUTPUT
X,10
X,7
X,1
X,6
Y,2
Y,4
Y,3
Z,5
Z,8
Z,9
  


"Cluster by" combines both "Distribute by" and "Sort by".
It ensures that grouping of the data by the specified column and each group is sorted and sent to same and single reducer.
A limitation that it distributes and sorts on same column. If we want to sort by different column then we need to use distribute and sort clauses.
Compressed outputs can be splittable.
Syntax: select <col list> from <table name> CLUSTER BY <col_list> <asc or desc> limit <no. of rows> offset <offset value>



DATA
X,10
Y,2
X,7
Y,4
Z,5
X,1
Y,3
Z,8
Z,9
X,6
  
MAPPER-1
X,10
Y,2
X,7
Y,4
MAPPER-2
Z,5
X,1
Y,3
MAPPER-3
Z,8
Z,9
X,6
  
REDUCER-1
X,1
X,6
X,7
X,10
REDUCER-2
Y,2
Y,3
Y,4
REDUCER-3
Z,5
Z,8
Z,9
  
OUTPUT
X,1
X,6
X,7
X,10
Y,2
Y,3
Y,4
Z,5
Z,8
Z,9
  
Compiled on SATURDAY, 27-JULY-2024, 10:01:08 AM IST

Comments

Popular posts from this blog

hadoop-installation-ubuntu

jenv-tool

hive-installation-in-ubuntu