Recursive Row Split and Append

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Recursive Row Split and Append

Post by gateleys »

I have a file containing a single column with values such as-

Code: Select all

x
av
y
b
cc
mmmm
a
x
mmmm
lt
mmmm
z
mmmm
ax
The file needs to be parse such that the input is split into 2 files when the value 'mmmm' is encountered first time-
1st file
--------

Code: Select all

x
av
y
b
cc
mmmm
and the 2nd file
------------

Code: Select all

a
x
mmmm
lt
mmmm
z
mmmm
ax
In the second pass, the 2nd file becomes the input and is processed in the same manner, except that after splitting at 'mmmm', the first part is appended to the first file produced during the first pass-
1st file
--------

Code: Select all

x
av
y
b
cc
mmmm
a
x
mmmm
and 2nd file
--------

Code: Select all

lt
mmmm
z
mmmm
ax
The process of splitting the 2nd file and appending the first part of the split to the 1st file goes on until there is no more 'mmmm' value. So, in the example, the final output will be-
1st file
--------

Code: Select all

x
av
y
b
cc
mmmm
a
x
mmmm
lt
mmmm
z
mmmm
ax
which is the same as the original input file, and the 2nd file's content is irrelevant since there are no more 'mmmm' values to process.
How can I achieve this, especially if the number of occurrence for 'mmmm' is in the order of hundreds?

gateleys
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

I think I need to digest this a bit, but it doesn't seem (at first reading) that you need anything recursive or even multiple pass - since the 2nd file is of no consequence to you.

Can you look at the problem as if it were a machine with two "states" - one before the mode switch, and one after. From you example your output is identical to your input so I'm not 100% of what you are doing.
dls
Premium Member
Premium Member
Posts: 96
Joined: Tue Sep 09, 2003 5:15 pm

Post by dls »

It's not April 1st yet, so....

Under what circumstance would the final output differ from the initial input?
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Post by gateleys »

The example that I had given was to abstract the other complexities of my logic from you guys and just exhibit the process wherein I was having a problem. In my job, the input file and the output file WILL BE different. But, right now, I just wanted help in splitting the file and appending them again to get back to the orginal file based on the logic (example).

Of course, I will be performing other tasks with the split files in a way that will result in a different output file.

gateleys
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

As you've described the algorithm the output file must end up exactly the same as the input file. Can you clarify the algorithm to show how it might be different?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
djm
Participant
Posts: 68
Joined: Wed Mar 02, 2005 3:42 am
Location: N.Z.

Post by djm »

Probably a dopey question, given you have flagged that your server is Windows, but just in case you have flagged your DataStage Client as Windows but your DataStage server is UNIX, there is a simple solution. Alternatively, if you are using a DataStage Windows server, you could install the Microsoft product "Services For Windows" (a.k.a. SFU), which is a free - but big - download available at the Microsoft web site. That will give you a pretty much full UNIX environment on a Windows box.

The UNIX command to do this is csplit. If you want more details, try issuing the command man csplit at the UNIX command prompt.

David
(Previously known as D)

Be alturistic and donate your spare CPU cycles to research. http://www.worldcommunitygrid.org/team/ ... TZ9H4CGVP1
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Post by gateleys »

ray.wurlod wrote:As you've described the algorithm the output file must end up exactly the same as the input file. Can you clarify the algorithm to show how it might be different?
Ray, what I need to do is append the content of file2 (flat, with 1 column) to the tail of file1(flat, with same column definition as file2) based on a match on a 'known' string in file1. The logic is-
1. If the input (file1) contains the string 'mmmm', then write all rows with @INROWNUM <= rownum of 'mmmm' to an TOPoutfile.
2. And all rows below 'mmmm' need to be written to BOTTOMoutfile.
(Essentially, splitting file1 into top and bottom parts, based on 'mmmm').
3. Then, append file2 (to be included file), to TOPoutfile.
4. Keep scanning (like fractals) the BOTTOMoutfile for the string 'mmmm', and perform the split-and-append(as discussed in steps 1-3) until there are no more 'mmmm' left.

So, if you look at this, it might 'look' dopey to some, but I posted it with a reason. I had only mentioned appending of the BOTTOMout file in my previous posting...which might have caused the confusion.

Further, it would be great if I could toggle the filenames of file1 and BOTTOMoutfile for every subsequent runs so that I could handle all the occurrence of the comparison string 'mmmm'. (Hence, my other post) :) .

I have actually managed to do the splitting of the file into TOP and BOTTOM, and append the to-be-included file to the TOP file. Only thing I am having problem is that the number of occurrences of 'mmmm' can vary from a few hundreds to a thousand. Thus, this design calls for multiple runs of the same job with the 2 filenames toggling.

Thanks.
gateleys
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Recursion is multiple runs anyway. Why not keep it simple? Use StartLoop and EndLoop activities in a job sequence, toggle the file names as advised in the other post, and provide an appropriate exit condition.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Post by gateleys »

ray.wurlod wrote:Use StartLoop and EndLoop activities in a job sequence, toggle the file names as advised in the other post, and provide an appropriate exit condition.
Thanks Ray, but StartLoop and EndLoop stages are not available in DS7.0. Is there any other way to run the same job, say n number of times, with varying filenames?

gateleys
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Gateleys,

you can create a server job, then leave the canvas empty and create your loop in DS/BASICN in the Job Control portion and use the drop-down list at the top to create your skeleton code to call a job (it automagically will creat the attach,set params, reset & run, wait for completion and close for your called job(s). You can then change the parameter values for the filename at each loop iteration.
gateleys
Premium Member
Premium Member
Posts: 992
Joined: Mon Aug 08, 2005 5:08 pm
Location: USA

Post by gateleys »

Arnd, Thanks.
Yeah, that seems to be the only option that I have now. Will get back with the results.

gateleys
Post Reply