Merging 2 files with specail handling if there is a match

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
sarahendie
Charter Member
Charter Member
Posts: 7
Joined: Tue Mar 16, 2004 7:37 pm

Merging 2 files with specail handling if there is a match

Post by sarahendie »

Using DataStage PX, I am trying to figure out how to do the following:

I have 2 files that I need to merge together such that all records from File 1 and File 2 are written out UNLESS there is a match between the 2 files (key: Billing#, Ship#, Item#). If there is a match, only the 2nd File's version of the record is written.

File 1 (Customer Billing Level Prices):
Billing # - Ship# - Item# - Price
100020 - 001 - 222548 - 0.85
111111 - 001 - 160011 - 1.10
111111 - 002 - 160011 - 1.10
111111 - 003 - 160011 - 1.10

File 2 (Customer Shiping Level Prices):
Billing # - Ship# - Item# - Price
111111 - 002 - 160011 - 2.50
657454 - 001 - 555555 - 1.25

If there is a Customer Shipping Level Price that exists, it need to be used in place of the corresponding Customer Billing Level price.

Resulting Output:
Billing # - Ship# - Item# - Price
100020 - 001 - 222548 - 0.85
111111 - 001 - 160011 - 1.10
111111 - 002 - 160011 - 2.50
111111 - 003 - 160011 - 1.10
657454 - 001 - 555555 - 1.25

Does anyone have any suggestions on how this can be done? Any help would be appreciated!! Thanks.

Sara Henderson
Information Systems Department
Univar USA Inc.
bcarlson
Premium Member
Premium Member
Posts: 772
Joined: Fri Oct 01, 2004 3:06 pm
Location: Minnesota

Post by bcarlson »

We have into this as well. Here's what we have done:

1) After each import (on the output link, that is), add a Generator stage to add a file_id field that is hardcoded to 1 or 2 (depending on which link it is attached to)
2) Funnel the results together
3) Sort on your data keys (billing_nr, ship_nr, item_nr), and (last of all) the file_id (sort file_id ascending)
4) Use the Remove Duplicates Stage, using just the data keys (not file_id), and choose to Retain the last duplicate. If there are duplicates, only the rec with file_id 2 will be kept

Of course, this assumes that there are not duplicates within each individual file - those would be dropped also in this process.

Hope this helps.
Brad.
T42
Participant
Posts: 499
Joined: Thu Nov 11, 2004 6:45 pm

Post by T42 »

Use the Change Capture Stage. Chapter 30 on your Parallel Job Developer Guide (parjdev.pdf). Handle the results on the following filter stage/transform stage. For fields that are not used, you can use a merge stage to combine those into the main stream after the Change Capture stage.

Correction: Now that I read the OP a bit more closely, that is actually something a full outer join would do, with file 2 as the primary (left) join stage.
Post Reply