Sunday, 25 February 2018

Hive - Reading XML data

sample xml file.
Source file: emp.xml

<xml>
<employee>
<id>123</id>
<Name>Test1</Name>
<Branch>ECE</Branch>
  </employee>
<employee>
<id>453</id>
<Name>Test2</Name>
<Branch>EEE</Branch>
  </employee>
<employee>
<id>789</id>
<Name>Test3</Name>
<Branch>CSE</Branch>
  </employee>
<employee>
<id>100</id>
<Name>Test4</Name>
<Branch>IT</Branch>
  </employee>
</xml>


2. copy above jar file in lib of apache home folder in Hadoop.

3. Copy the given xml file to hdfs location. Here i Copied emp.xml file to /hive/externaldata/xml folder.
     hdfs dfs -put /home/emp.xml /hive/externaldata/xml

4. Login Hive, and run following command.

hive> create external table xml_file (id int,
Name string,
Branch string)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.id"="/employee/id/text()",
"column.xpath.Name"="/employee/Name/text()",
"column.xpath.Branch"="/employee/Branch/text()"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
location '/hive/externaldata/xml/'
TBLPROPERTIES (
"xmlinput.start"="<employee",
"xmlinput.end"="</employee>"
);

5. hive>select * from xml_file;

xml_file.id xml_file.name xml_file.branch
123 Test1 ECE
453 Test2 EEE
789 Test3 CSE
100 Test4 IT


No comments:

Post a Comment

Facebook