Read zip file

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
eli.nawas_AUS
Premium Member
Premium Member
Posts: 39
Joined: Tue Apr 15, 2014 9:14 am

Read zip file

Post by eli.nawas_AUS »

I would like to be able to read a compressed (probably gzip, maybe other) file directly into Datastage from Hadoop. What is the correct way to do this? I have tried the Expand stage, but have not been able to make it work. (getting error The decode operator is being used on a non-encoded data set)

What is the best way to do this?
rameshrr3
Premium Member
Premium Member
Posts: 609
Joined: Mon May 10, 2004 3:32 am
Location: BRENTWOOD, TN

Post by rameshrr3 »

Can you try external Source stage- I think thats the best bet ? Did you try Decode stage - Although i think it woks on decoding compressed datasets, not sure about files.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Hadoop does not store compressed files in this sense. How are the data being retrieved from Hadoop and written into an archive? Yes, you could use an External Source stage.

If you have version 8.7 or later, why not use the Big Data File stage to access data from Hadoop directly? (Acutally it uses MapReduce under the covers, but this is transparent.)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
eli.nawas_AUS
Premium Member
Premium Member
Posts: 39
Joined: Tue Apr 15, 2014 9:14 am

Post by eli.nawas_AUS »

Had not tried the External Source stage. That seems to be working, with command:

hadoop dfs -cat /DL/INCOMING/GOOGLE_DFA/temp/aaa.gz | gunzip -c

Was wondering why the Expand stage does not seem to work, though. Zipped file is simply put onto hadoop for storage, we don't want to unzip them with the amount of space these files take up. Was using BDFS stage to read from hadoop. Tried just having the zipped file on server and using Sequential File stage, also does not work.
eli.nawas_AUS
Premium Member
Premium Member
Posts: 39
Joined: Tue Apr 15, 2014 9:14 am

Post by eli.nawas_AUS »

It looks to me like the Expand stage can only unpack a file that has been created in DataStage's own compressed format, is that correct? It cannot unzip an ordinary gzip file? I've tried using Sequential File stage and DataSet stage to read the zipped file and send to Expand, and it doesn't seem to work.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

:idea: Check the documentation. It explicitly states it only works with data sets:
The Expand stage uses the UNIX uncompress or GZIP utility to expand a data set. It converts a previously compressed data set back into a sequence of records from a stream of raw binary data. The complement to the Expand stage is the Compress stage.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

How about gunzip as a filter command on a Sequential File stage? Should work OK even in version 8.x
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply