Lets work on emp table.
Ranking Functions: These functions are used to find top(n) rows.
2. Rank(): Returns rank of each row within the partition of a result set. Based on partition condition, it provides rank for each record. For employees (7902,7788) salary is repeated and these records got same rank. But the next record got rank of 14. In general we should get next rank value of 13. For rank 12, we got 2 records. For Rank() we will get the next rank 14 (12+no of same repeated values (i.e 2) ).
hive> select empno,e.deptno,d.dname,e.sal,rank() over( order by sal ) rank from emp e join dept d on e.deptno = d.deptno;
3. Dense_Rank(): We will overcome above problems using this function. Returns rank of rows with in the partition of result set without any gaps. Rank of a row is one plus the number of distinct rank.
hive>select empno,e.deptno,d.dname,e.sal,dense_rank() over( order by sal ) rank from emp e join dept d on e.deptno = d.deptno;
4. Ntile(): Distributes records in an ordered partition into specified no of groups. Groups are started from 1 for each group.
hive>select empno,e.deptno,d.dname,e.sal,ntile(3) over(partition by e.deptno order by sal ) rank from emp e join dept d on e.deptno = d.deptno;
LAG: Displays preceeding employee number department wise. Default step value is 1. We can change this value by lag(<column>,<n>)
hive> select empno,e.deptno,d.dname,lag(empno) over (partition by e.deptno order by empno) followed_empno from emp e join dept d on e.deptno = d.deptno;
LEAD: Displays following employee number department wise. Default step value is 1. We can change this value by lead(<column>,<n>)
hive> select empno,e.deptno,d.dname,lead(empno) over (partition by e.deptno order by empno) followed_empno from emp e join dept d on e.deptno = d.deptno;
FIRST_VALUE: Returns first value of the window group.
hive>select empno,e.deptno,d.dname,first_value(empno) over (partition by e.deptno ) preceeding_empno from emp e join dept d on e.deptno = d.deptno;
LAST_VALUE: Returns last value of the window group.
hive> select empno,e.deptno,d.dname,last_value(empno) over (partition by e.deptno ) lastvalue from emp e join dept d on e.deptno = d.deptno;
1. Row_Number(): Returns sequential number of a row within a partition of the result set without any gap.
hive> select empno,e.deptno,d.dname,e.sal,row_number() over(partition by e.deptno order by sal ) rownum from emp e join dept d on e.deptno = d.deptno;
For empno (7902,7788) salary is same (3000). In general for this scenario, same number should repeat both the records.
2. Rank(): Returns rank of each row within the partition of a result set. Based on partition condition, it provides rank for each record. For employees (7902,7788) salary is repeated and these records got same rank. But the next record got rank of 14. In general we should get next rank value of 13. For rank 12, we got 2 records. For Rank() we will get the next rank 14 (12+no of same repeated values (i.e 2) ).
hive> select empno,e.deptno,d.dname,e.sal,rank() over( order by sal ) rank from emp e join dept d on e.deptno = d.deptno;
3. Dense_Rank(): We will overcome above problems using this function. Returns rank of rows with in the partition of result set without any gaps. Rank of a row is one plus the number of distinct rank.
hive>select empno,e.deptno,d.dname,e.sal,dense_rank() over( order by sal ) rank from emp e join dept d on e.deptno = d.deptno;
hive>select empno,e.deptno,d.dname,e.sal,ntile(3) over(partition by e.deptno order by sal ) rank from emp e join dept d on e.deptno = d.deptno;
5. Percent_Rank(): Calculates the relative rank of a row within a group of rows. Value between 0 and 1.
hive> select empno,e.deptno,d.dname,e.sal,percent_rank() over(partition by e.deptno order by sal) rank from emp e join dept d on e.deptno = d.deptno;
6. Cume_Dist(): Calculates the cumulative distribution of a value in a group of values.
hive>select empno,e.deptno,d.dname,e.sal,cume_dist() over(partition by e.deptno order by sal ) rank from emp e join dept d on e.deptno = d.deptno;