Recursive Row Split and Append

gateleys · Post by **gateleys** » Tue Mar 28, 2006 11:00 am

I have a file containing a single column with values such as-

x
av
y
b
cc
mmmm
a
x
mmmm
lt
mmmm
z
mmmm
ax

The file needs to be parse such that the input is split into 2 files when the value 'mmmm' is encountered first time-
1st file
--------

Code: Select all

x
av
y
b
cc
mmmm

and the 2nd file
------------

Code: Select all

a
x
mmmm
lt
mmmm
z
mmmm
ax

In the second pass, the 2nd file becomes the input and is processed in the same manner, except that after splitting at 'mmmm', the first part is appended to the first file produced during the first pass-
1st file
--------

Code: Select all

x
av
y
b
cc
mmmm
a
x
mmmm

and 2nd file
--------

Code: Select all

lt
mmmm
z
mmmm
ax

The process of splitting the 2nd file and appending the first part of the split to the 1st file goes on until there is no more 'mmmm' value. So, in the example, the final output will be-
1st file
--------

Code: Select all

x
av
y
b
cc
mmmm
a
x
mmmm
lt
mmmm
z
mmmm
ax

which is the same as the original input file, and the 2nd file's content is irrelevant since there are no more 'mmmm' values to process.
How can I achieve this, especially if the number of occurrence for 'mmmm' is in the order of hundreds?

gateleys

ArndW · Post by **ArndW** » Tue Mar 28, 2006 11:18 am

I think I need to digest this a bit, but it doesn't seem (at first reading) that you need anything recursive or even multiple pass - since the 2nd file is of no consequence to you.

Can you look at the problem as if it were a machine with two "states" - one before the mode switch, and one after. From you example your output is identical to your input so I'm not 100% of what you are doing.

dls · Post by **dls** » Tue Mar 28, 2006 11:26 am

It's not April 1st yet, so....

Under what circumstance would the final output differ from the initial input?

gateleys · Post by **gateleys** » Tue Mar 28, 2006 12:35 pm

The example that I had given was to abstract the other complexities of my logic from you guys and just exhibit the process wherein I was having a problem. In my job, the input file and the output file WILL BE different. But, right now, I just wanted help in splitting the file and appending them again to get back to the orginal file based on the logic (example).

Of course, I will be performing other tasks with the split files in a way that will result in a different output file.

gateleys

ray.wurlod · Post by **ray.wurlod** » Tue Mar 28, 2006 3:25 pm

As you've described the algorithm the output file must end up exactly the same as the input file. Can you clarify the algorithm to show how it might be different?

djm · Post by **djm** » Tue Mar 28, 2006 4:00 pm

Probably a dopey question, given you have flagged that your server is Windows, but just in case you have flagged your DataStage Client as Windows but your DataStage server is UNIX, there is a simple solution. Alternatively, if you are using a DataStage Windows server, you could install the Microsoft product "Services For Windows" (a.k.a. SFU), which is a free - but big - download available at the Microsoft web site. That will give you a pretty much full UNIX environment on a Windows box.

The UNIX command to do this is csplit. If you want more details, try issuing the command man csplit at the UNIX command prompt.

David

gateleys · Post by **gateleys** » Tue Mar 28, 2006 4:42 pm

ray.wurlod wrote:As you've described the algorithm the output file must end up exactly the same as the input file. Can you clarify the algorithm to show how it might be different?

Ray, what I need to do is append the content of file2 (flat, with 1 column) to the tail of file1(flat, with same column definition as file2) based on a match on a 'known' string in file1. The logic is-
1. If the input (file1) contains the string 'mmmm', then write all rows with @INROWNUM <= rownum of 'mmmm' to an TOPoutfile.
2. And all rows below 'mmmm' need to be written to BOTTOMoutfile.
(Essentially, splitting file1 into top and bottom parts, based on 'mmmm').
3. Then, append file2 (to be included file), to TOPoutfile.
4. Keep scanning (like fractals) the BOTTOMoutfile for the string 'mmmm', and perform the split-and-append(as discussed in steps 1-3) until there are no more 'mmmm' left.

So, if you look at this, it might 'look' dopey to some, but I posted it with a reason. I had only mentioned appending of the BOTTOMout file in my previous posting...which might have caused the confusion.

Further, it would be great if I could toggle the filenames of file1 and BOTTOMoutfile for every subsequent runs so that I could handle all the occurrence of the comparison string 'mmmm'. (Hence, my other post)

.

I have actually managed to do the splitting of the file into TOP and BOTTOM, and append the to-be-included file to the TOP file. Only thing I am having problem is that the number of occurrences of 'mmmm' can vary from a few hundreds to a thousand. Thus, this design calls for multiple runs of the same job with the 2 filenames toggling.

Thanks.
gateleys

ray.wurlod · Post by **ray.wurlod** » Tue Mar 28, 2006 6:13 pm

Recursion is multiple runs anyway. Why not keep it simple? Use StartLoop and EndLoop activities in a job sequence, toggle the file names as advised in the other post, and provide an appropriate exit condition.

gateleys · Post by **gateleys** » Wed Mar 29, 2006 7:22 am

ray.wurlod wrote:Use StartLoop and EndLoop activities in a job sequence, toggle the file names as advised in the other post, and provide an appropriate exit condition.

Thanks Ray, but StartLoop and EndLoop stages are not available in DS7.0. Is there any other way to run the same job, say n number of times, with varying filenames?

gateleys

ArndW · Post by **ArndW** » Wed Mar 29, 2006 7:38 am

Gateleys,

you can create a server job, then leave the canvas empty and create your loop in DS/BASICN in the Job Control portion and use the drop-down list at the top to create your skeleton code to call a job (it automagically will creat the attach,set params, reset & run, wait for completion and close for your called job(s). You can then change the parameter values for the filename at each loop iteration.

gateleys · Post by **gateleys** » Wed Mar 29, 2006 7:50 am

Arnd, Thanks.
Yeah, that seems to be the only option that I have now. Will get back with the results.

gateleys