Loading data from HDFS file into HIVE table using Datastage

rsmohankumar · Post by **rsmohankumar** » Fri Sep 16, 2016 12:54 am

Hi all,

We are loading the data from CSV file (accessing using Bigdata file stage) in HDFS filesystem into the HIVE table using the JDBC stage in Datastage 11.5. Performance of loading is worst. It takes 22 seconds to insert one record into the HIVE table. Can you please let us know what can be do to improve the performance of loading through JDBC stage.

We guess the data is inserting as one row at a time into HIVE table even-though we gave 2000 rows per transactions.

Thanks in advance.

chulett · Post by **chulett** » Fri Sep 16, 2016 6:32 am

Welcome!

Rows per transaction just tells it when to commit. If you have an 'Array Size' property there, that would be what controls how many to send to the target at a time.

Timato · Post by **Timato** » Fri Sep 23, 2016 1:59 am

Which distribution of Hadoop are you using? From what I can gather, the BigData File Stage is primarily aimed at IBM's BigInsights and i'd imagine there may be issues when interacting with other distributions.

Have you tried using the File Connector stage instead? WebHDFS/HTTPFS is standard with most HDFS versions I think?

TNZL_BI · Post by **TNZL_BI** » Mon Apr 10, 2017 6:22 pm

Hi , I have been facing a similar issue. I am using the Hive Connector Stage to load / extract data .

However , the speed is dismal . Is there something that we can do to improve the performance of loading into Hive. Having said that , I dont expect Hive loading to be as fast as any other database , as Hive is just an easy interface which is database like but not a database in the typical sense since beneath the Hive surface , there are complex Java map reduce programs running.

Never the less , do we know of some ways to get this tuned. I see the array size in ODBC stage but not in the native Hive Connector stage .

Any info here in regards to fine tuning performance will be really helpful

TNZL_BI · Post by **TNZL_BI** » Sun Apr 30, 2017 6:01 pm

I have been suggested by IBM to run some patches. Will install this and then update .

eostic · Post by **eostic** » Mon May 01, 2017 9:48 am

I have talked with other customers who use the File Connector exclusively for loading --- writing directly to the hdfs file that Hive is abstracting --- precisely for performance reasons.

Ernie

TNZL_BI · Post by **TNZL_BI** » Sun May 14, 2017 10:04 pm

Exactly. I have been using the file connector stage now and its a better / faster way to put in data onto Hadoop rather than use a Hive or ODBC connector stage.
The other advantage is that the file connector stage also provides an option to create a Hive table as well which is like 2 steps in one.

dsuser_cai · Post by **dsuser_cai** » Tue Sep 12, 2017 10:34 pm

We use BigData Stage in a job to load data to HDFS and then use a script to create the HIVE table with correct partitions. We store data in /folder/structure/for_Hive/tableName/yyyy/mm/dd folder format and the HIVE tables are partitioned on Year, month and date. Both the loading HDFS and creating HIVE table is executed from a Job sequence.