Read zip file

eli.nawas_AUS · Post by **eli.nawas_AUS** » Wed Jul 30, 2014 3:54 pm

I would like to be able to read a compressed (probably gzip, maybe other) file directly into Datastage from Hadoop. What is the correct way to do this? I have tried the Expand stage, but have not been able to make it work. (getting error The decode operator is being used on a non-encoded data set)

What is the best way to do this?

rameshrr3 · Post by **rameshrr3** » Wed Jul 30, 2014 6:56 pm

Can you try external Source stage- I think thats the best bet ? Did you try Decode stage - Although i think it woks on decoding compressed datasets, not sure about files.

ray.wurlod · Post by **ray.wurlod** » Wed Jul 30, 2014 10:11 pm

Hadoop does not store compressed files in this sense. How are the data being retrieved from Hadoop and written into an archive? Yes, you could use an External Source stage.

If you have version 8.7 or later, why not use the Big Data File stage to access data from Hadoop directly? (Acutally it uses MapReduce under the covers, but this is transparent.)

eli.nawas_AUS · Post by **eli.nawas_AUS** » Thu Jul 31, 2014 1:33 pm

Had not tried the External Source stage. That seems to be working, with command:

hadoop dfs -cat /DL/INCOMING/GOOGLE_DFA/temp/aaa.gz | gunzip -c

Was wondering why the Expand stage does not seem to work, though. Zipped file is simply put onto hadoop for storage, we don't want to unzip them with the amount of space these files take up. Was using BDFS stage to read from hadoop. Tried just having the zipped file on server and using Sequential File stage, also does not work.

eli.nawas_AUS · Post by **eli.nawas_AUS** » Fri Aug 01, 2014 12:55 pm

It looks to me like the Expand stage can only unpack a file that has been created in DataStage's own compressed format, is that correct? It cannot unzip an ordinary gzip file? I've tried using Sequential File stage and DataSet stage to read the zipped file and send to Expand, and it doesn't seem to work.

chulett · Post by **chulett** » Fri Aug 01, 2014 2:02 pm

Check the documentation. It explicitly states it only works with data sets:

The Expand stage uses the UNIX uncompress or GZIP utility to expand a data set. It converts a previously compressed data set back into a sequence of records from a stream of raw binary data. The complement to the Expand stage is the Compress stage.

ray.wurlod · Post by **ray.wurlod** » Thu Mar 28, 2019 2:36 pm

How about gunzip as a filter command on a Sequential File stage? Should work OK even in version 8.x