I would like to be able to read a compressed (probably gzip, maybe other) file directly into Datastage from Hadoop. What is the correct way to do this? I have tried the Expand stage, but have not been able to make it work. (getting error The decode operator is being used on a non-encoded data set)
What is the best way to do this?
Read zip file
Moderators: chulett, rschirm, roy
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Hadoop does not store compressed files in this sense. How are the data being retrieved from Hadoop and written into an archive? Yes, you could use an External Source stage.
If you have version 8.7 or later, why not use the Big Data File stage to access data from Hadoop directly? (Acutally it uses MapReduce under the covers, but this is transparent.)
If you have version 8.7 or later, why not use the Big Data File stage to access data from Hadoop directly? (Acutally it uses MapReduce under the covers, but this is transparent.)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
-
- Premium Member
- Posts: 39
- Joined: Tue Apr 15, 2014 9:14 am
Had not tried the External Source stage. That seems to be working, with command:
hadoop dfs -cat /DL/INCOMING/GOOGLE_DFA/temp/aaa.gz | gunzip -c
Was wondering why the Expand stage does not seem to work, though. Zipped file is simply put onto hadoop for storage, we don't want to unzip them with the amount of space these files take up. Was using BDFS stage to read from hadoop. Tried just having the zipped file on server and using Sequential File stage, also does not work.
hadoop dfs -cat /DL/INCOMING/GOOGLE_DFA/temp/aaa.gz | gunzip -c
Was wondering why the Expand stage does not seem to work, though. Zipped file is simply put onto hadoop for storage, we don't want to unzip them with the amount of space these files take up. Was using BDFS stage to read from hadoop. Tried just having the zipped file on server and using Sequential File stage, also does not work.
-
- Premium Member
- Posts: 39
- Joined: Tue Apr 15, 2014 9:14 am
It looks to me like the Expand stage can only unpack a file that has been created in DataStage's own compressed format, is that correct? It cannot unzip an ordinary gzip file? I've tried using Sequential File stage and DataSet stage to read the zipped file and send to Expand, and it doesn't seem to work.
Check the documentation. It explicitly states it only works with data sets:
The Expand stage uses the UNIX uncompress or GZIP utility to expand a data set. It converts a previously compressed data set back into a sequence of records from a stream of raw binary data. The complement to the Expand stage is the Compress stage.
-craig
"You can never have too many knives" -- Logan Nine Fingers
"You can never have too many knives" -- Logan Nine Fingers
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact: