Tuesday 27 February 2018

Hive - Analytical Functions

Lets work on emp table.

LAG: Displays preceeding employee number department wise. Default step value is 1. We can change this value by lag(<column>,<n>)

hive> select empno,e.deptno,d.dname,lag(empno) over (partition by e.deptno order by empno) followed_empno from emp e join dept d on e.deptno = d.deptno;




LEAD: Displays following employee number department wise. Default step value is 1. We can change this value by lead(<column>,<n>)

hive> select empno,e.deptno,d.dname,lead(empno) over (partition by e.deptno order by empno) followed_empno from emp e join dept d on e.deptno = d.deptno;



FIRST_VALUE: Returns first value of the window group.

hive>select empno,e.deptno,d.dname,first_value(empno) over (partition by e.deptno ) preceeding_empno from emp e join dept d on e.deptno = d.deptno;


LAST_VALUE: Returns last value of the window group.

hive> select empno,e.deptno,d.dname,last_value(empno) over (partition by e.deptno ) lastvalue from emp e join dept d on e.deptno = d.deptno;



Ranking Functions: These functions are used to find top(n) rows.

1. Row_Number(): Returns sequential number of a row within a partition of the result set without any gap.

hive> select empno,e.deptno,d.dname,e.sal,row_number() over(partition by e.deptno order by sal ) rownum from emp e join dept d on e.deptno = d.deptno;

For empno (7902,7788) salary is same (3000). In general for this scenario, same number should repeat both the records. 


2. Rank(): Returns rank of each row within the partition of a result set. Based on partition condition, it provides rank for each record. For employees (7902,7788) salary is repeated and these records got same rank. But the next record got rank of 14. In general we should get next rank value of 13. For rank 12, we got 2 records. For Rank() we will get the next rank 14 (12+no of same repeated values (i.e 2) ).

hive> select empno,e.deptno,d.dname,e.sal,rank() over( order by sal ) rank from emp e join dept d on e.deptno = d.deptno;


3. Dense_Rank(): We will overcome above problems using this function. Returns rank of rows with in the partition of result set without any gaps. Rank of a row is one plus the number of distinct rank.

hive>select empno,e.deptno,d.dname,e.sal,dense_rank() over( order by sal ) rank from emp e join dept d on e.deptno = d.deptno;



4. Ntile(): Distributes records in an ordered partition into specified no of groups. Groups are started from 1 for each group.

hive>select empno,e.deptno,d.dname,e.sal,ntile(3) over(partition by e.deptno order by sal ) rank from emp e join dept d on e.deptno = d.deptno;


5. Percent_Rank(): Calculates the relative rank of a row within a group of rows. Value between 0 and 1.

hive> select empno,e.deptno,d.dname,e.sal,percent_rank() over(partition by e.deptno order by sal) rank from emp e join dept d on e.deptno = d.deptno;


6. Cume_Dist(): Calculates the cumulative distribution of a value in a group of values.

hive>select empno,e.deptno,d.dname,e.sal,cume_dist() over(partition by e.deptno order by sal ) rank from emp e join dept d on e.deptno = d.deptno;



Sunday 25 February 2018

Hive - Reading XML data

sample xml file.
Source file: emp.xml

<xml>
<employee>
<id>123</id>
<Name>Test1</Name>
<Branch>ECE</Branch>
  </employee>
<employee>
<id>453</id>
<Name>Test2</Name>
<Branch>EEE</Branch>
  </employee>
<employee>
<id>789</id>
<Name>Test3</Name>
<Branch>CSE</Branch>
  </employee>
<employee>
<id>100</id>
<Name>Test4</Name>
<Branch>IT</Branch>
  </employee>
</xml>


2. copy above jar file in lib of apache home folder in Hadoop.

3. Copy the given xml file to hdfs location. Here i Copied emp.xml file to /hive/externaldata/xml folder.
     hdfs dfs -put /home/emp.xml /hive/externaldata/xml

4. Login Hive, and run following command.

hive> create external table xml_file (id int,
Name string,
Branch string)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.id"="/employee/id/text()",
"column.xpath.Name"="/employee/Name/text()",
"column.xpath.Branch"="/employee/Branch/text()"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
location '/hive/externaldata/xml/'
TBLPROPERTIES (
"xmlinput.start"="<employee",
"xmlinput.end"="</employee>"
);

5. hive>select * from xml_file;

xml_file.id xml_file.name xml_file.branch
123 Test1 ECE
453 Test2 EEE
789 Test3 CSE
100 Test4 IT


Facebook