How to join huge flat files together

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
xli
Charter Member
Charter Member
Posts: 74
Joined: Fri May 09, 2003 12:31 am

How to join huge flat files together

Post by xli »

Hi,

I have 7 big flat files, each contains 24 million records which have been sorted by the first numeric field delimitered by Ctrl B (002).

I want to join them together side by side, for example :

File A:

1^Ba^Bb
2^Bx^By
....

File B:
1^Bc^Bd
2^Bm^Bn
....

After merged, it should look like :

File C :
1^Ba^Bb^Bc^Bd
2^Bx^By^Bm^Bn
....

I have a DataStag routine to join flat files together, it works fine with files with small volumn records. but it failed to process these huge files along with the error as below:

Abnormal termination of stage Merging4To1..Transform_data detected

From previous run
DataStage Job 870 Phantom 10927
jobnotify: Error 904 occurred.
DataStage Phantom Finished.
[79844] DSD.StageRun Merging8To1. Merging4To1.Transform_data 1 -2/50 - terminated.

Can anyone tell me how to join them together by DataStage or Unix ?

I will appreciate to any advices

xli
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

This really belongs on the server forum.

The routine probably barfed at a 2GB limit - remember that DataStage is a 32-bit environment, so one must assume that a 32-bit number controls the offset address within a text file.

You have mentioned in an earlier email that the Merge stage also does not cope with large files.

Therefore, it seems that you need something written at the operating system level, or to load the data into a large table. Insert the rows from the first file, and update the rows from the other files, into a different set of non-key columns. How many columns in total does the result have?

(It might be prudent to take your DBA out for a drink very soon!)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

Adding to what Ray has suggested, maybe you can concatenate + sort the files and use the agg stage (and specify the sort key column and sequence) to use as a control break. Your target will have all columns from both files. The output of the first file's columns can be obtained by using the first() agg function and the second by using the last() agg function.

However the limitation is that this will work only for predefined number of columns.

You also have other options such as multi-valued hash-file.

If you are trying in Unix, you can perform the paste / awk / sed commands.

One other option is to concatenate + sort files on the key and define your target to have 'no end-of-line terminator' and include your own eol_column as the last field. For every second line, you can enter a newline ('\n') char to the eol_column. This will concatenate every 2 records in the file.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If, as earlier posts suggest, your files are all sorted on the same key already, you may be able to use paste alone.

You will need the -d option to specify your non-default delimiter character. OR allow the default character (tab) to separate the original lines, and change them subsequently with a tr command.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi,
If you already have a basic code for merging the files, you only need to identify the row limit before it crashes and close the file and open a new one and so on till all are merged.
After you have all your merged files simply load them to your Db or whatever you need to do with them.
IHTH,
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
1stpoint
Participant
Posts: 165
Joined: Thu Nov 13, 2003 2:10 pm
Contact:

Somethings are better out of DataStage

Post by 1stpoint »

This is a typical pre-process type of problem that is easily solved outside of DataStage. Too much emphasis is being placed on DataStage to do flat file manipulation. This can be much more easily done in Python or Perl and then the resulting file is your input for your DataStage job.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

Why introduce a new programming language when it can be done in DataStage? Too often people with pet programming languages come in and develop something complex that gets tacked on to the start or end of a DataStage sequence. Within a year the programmer has moved on the script is unmaintainable and unscalable.

Are your files already sorted? If they are not then you are going to have problems no matter what approach you take. You are trying to join 7x24 million row tables together without any indexing.

If they are already sorted and you can rely on each file having identically ordered key fields then try this design, read in all seven files and pass the data to a link collector with round-robin sorting. Pass this to a transformer. In the transformer build up the output record in stage variables and output every 7th row.
Sainath.Srinivasan
Participant
Posts: 3337
Joined: Mon Jan 17, 2005 4:49 am
Location: United Kingdom

Post by Sainath.Srinivasan »

A few points to notice when you are developing an ETL is that there is no need to load everything in DataStage. If you expect DataStage to do everything, you may not have been provided the option of having a pre and post job routines or calls like DSExecute.

Also DataStage uses the external OS scheduler to run its scheduled jobs. When DataStage moved ahead to DataStage EE, routines changed from its base of Universe Basic to C/C++ language. In other words, things are moving ahead.

Also the topic of maintenance must be addressed periodically. Even any job stored within DataStage must be maintained. Any experienced ETL person will agree with saying that DataStage jobs never finish first time as we always receive change request.

In this case, we can develop in simple external tools - such as paste or sed - and when the requirement moves on to be NOT(simple), we can re-address it in some other way - which may be using DataStage.

Till that time, it is better to achieve the results in whichever way it is simpler, effictive, less time consuming and less cost. But always DO YOUR DOCUMENTATION properly to let others known about the process.
Post Reply