Using DataStage PX, I am trying to figure out how to do the following:
I have 2 files that I need to merge together such that all records from File 1 and File 2 are written out UNLESS there is a match between the 2 files (key: Billing#, Ship#, Item#). If there is a match, only the 2nd File's version of the record is written.
File 1 (Customer Billing Level Prices):
Billing # - Ship# - Item# - Price
100020 - 001 - 222548 - 0.85
111111 - 001 - 160011 - 1.10
111111 - 002 - 160011 - 1.10
111111 - 003 - 160011 - 1.10
File 2 (Customer Shiping Level Prices):
Billing # - Ship# - Item# - Price
111111 - 002 - 160011 - 2.50
657454 - 001 - 555555 - 1.25
If there is a Customer Shipping Level Price that exists, it need to be used in place of the corresponding Customer Billing Level price.
Resulting Output:
Billing # - Ship# - Item# - Price
100020 - 001 - 222548 - 0.85
111111 - 001 - 160011 - 1.10
111111 - 002 - 160011 - 2.50
111111 - 003 - 160011 - 1.10
657454 - 001 - 555555 - 1.25
Does anyone have any suggestions on how this can be done? Any help would be appreciated!! Thanks.
Sara Henderson
Information Systems Department
Univar USA Inc.
Merging 2 files with specail handling if there is a match
Moderators: chulett, rschirm, roy
-
- Charter Member
- Posts: 7
- Joined: Tue Mar 16, 2004 7:37 pm
We have into this as well. Here's what we have done:
1) After each import (on the output link, that is), add a Generator stage to add a file_id field that is hardcoded to 1 or 2 (depending on which link it is attached to)
2) Funnel the results together
3) Sort on your data keys (billing_nr, ship_nr, item_nr), and (last of all) the file_id (sort file_id ascending)
4) Use the Remove Duplicates Stage, using just the data keys (not file_id), and choose to Retain the last duplicate. If there are duplicates, only the rec with file_id 2 will be kept
Of course, this assumes that there are not duplicates within each individual file - those would be dropped also in this process.
Hope this helps.
Brad.
1) After each import (on the output link, that is), add a Generator stage to add a file_id field that is hardcoded to 1 or 2 (depending on which link it is attached to)
2) Funnel the results together
3) Sort on your data keys (billing_nr, ship_nr, item_nr), and (last of all) the file_id (sort file_id ascending)
4) Use the Remove Duplicates Stage, using just the data keys (not file_id), and choose to Retain the last duplicate. If there are duplicates, only the rec with file_id 2 will be kept
Of course, this assumes that there are not duplicates within each individual file - those would be dropped also in this process.
Hope this helps.
Brad.
Use the Change Capture Stage. Chapter 30 on your Parallel Job Developer Guide (parjdev.pdf). Handle the results on the following filter stage/transform stage. For fields that are not used, you can use a merge stage to combine those into the main stream after the Change Capture stage.
Correction: Now that I read the OP a bit more closely, that is actually something a full outer join would do, with file 2 as the primary (left) join stage.
Correction: Now that I read the OP a bit more closely, that is actually something a full outer join would do, with file 2 as the primary (left) join stage.