Page 1 of 1

How to read comma delimited files with missing column values

Posted: Mon Jan 22, 2018 9:59 am
by k.prashanthi
Hello,

I have a new requirement to work with comma delimited files with no quotes for blank values but I have never come across with this kind of files to work with. For example:

D,0,SB,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
D,1,IN,,0,0,0,0,90000000,0,0,0,0,0,0,0,0,0,0
D,4,FEDOUT,,0,0,0,0,90000000,0,0,0,0,0,0,0
D,6,EB,,0,0,0,0,0,0
D,7,DEP,LOC,0,0,0,0,90000000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,90000000

How to read this kind of files??

TIA.

Posted: Mon Jan 22, 2018 10:07 am
by PaulVL
Does the Sequential File stage with nullable columns not work for you?

Posted: Mon Jan 22, 2018 10:36 am
by k.prashanthi
Hi Paul,

No. It did not work. Any other ideas?

Posted: Mon Jan 22, 2018 12:46 pm
by chulett
Specifics, please. What exactly "did not work" about it?

Posted: Mon Jan 22, 2018 4:32 pm
by BillB
The server Sequential File stage will read a CSV file like your example. Set 'Outputs-Format-Missing columns action' to 'Map empty string', and 'Outputs-Columns-Incomplete Column' to 'Replace'.

If you need your CSV file to be an input to a parallel job, consider writing a pre-processor server job that reads your example CSV and writes a second CSV that becomes the parallel job input.

Posted: Mon Jan 22, 2018 6:56 pm
by chulett
Right, was going to get there. :wink:

Another option, a Server Shared Container in your PX job that just handles the sequential file read.

Posted: Wed Jan 24, 2018 12:54 pm
by k.prashanthi
BillB,

That would work but we don't have server job as we have Data Quality license only. Any other ideas on this?

Posted: Wed Jan 24, 2018 5:01 pm
by PaulVL
if all else fails, you could modify the input file.

https://askubuntu.com/questions/20414/f ... g-commands

Command:
sed -i 's/original/new/g' file.txt


SO something like this:
sed -i 's/,,/,"",/g' file.txt

Posted: Sun Jan 28, 2018 7:06 pm
by ray.wurlod
And the sed command that Paul suggested could even be a Filter in the Sequential File stage.
Or executed via a before-job subroutine.