DSXchange: DataStage and IBM Websphere Data Integration Forum
View next topic
View previous topic
Add To Favorites
This topic has been marked "Resolved."
Author Message
hitmanthesilentassasin
Participant



Joined: 13 Mar 2007
Posts: 150

Points: 1382

Post Posted: Thu Jun 13, 2013 2:46 pm Reply with quote    Back to top    

DataStage® Release: 8x
Job Type: Parallel
OS: Unix
Hi,

I was looking for alternatives to perform the match. I have 2 different addresses captured in a single record which I wanted to use to identify duplicates. The ways I figured out is to split the records and have a single set of matching fields and run the matching specifications against it. However, this would generate a different matchsetid which could lead to complication of merge. another way is to have 2 set of matching keys like housenumber1, housenumber1_2, but in this scenario it wouldn't match across the fields. Is there a way out of this?

thanks!!
rjdickson
Participant



Joined: 16 Jun 2003
Posts: 378
Location: Chicago, USA
Points: 2531

Post Posted: Thu Jun 13, 2013 4:49 pm Reply with quote    Back to top    

Hi,

What is the business rule for the match? Is it 'the same person independent of the address'? In other words, what else is being used to match besides address components?

_________________
Regards,
Robert
Rate this response:  
Not yet rated
stuartjvnorton
Participant



Joined: 19 Apr 2007
Posts: 523
Location: Melbourne
Points: 3890

Post Posted: Thu Jun 13, 2013 5:30 pm Reply with quote    Back to top    

To follow on from Robert's post, presumably you have other fields to match against? eg: customers with both shipping and billing addresses in the customer master table.

A way I've done it in the past is to split the records so that if the customer has 2 different addresses, you make 2 records, each with 1 address. You can then use the PK to group them on the first pass and then use subsequent passes to do the "real" matching.
Experiment with using Independent or Transitive as the match type to see which gives you the better matches.

Remember, in the end though there is 1 customer record, so you may have do some transformation or reconstruction before you can do the Survive (if you are surviving to the same structure).
Rate this response:  
Not yet rated
hitmanthesilentassasin
Participant



Joined: 13 Mar 2007
Posts: 150

Points: 1382

Post Posted: Fri Jun 14, 2013 1:06 am Reply with quote    Back to top    

Hi,

Thanks for your replies, yes there are other fields like the names to identify the customers. even if we split the record then lets say record with key x -1 (with first address) is pointing to record with key z and x-2(with second address) is pointing to the record y merging could be a bit cumbersome.

is there a way out of this? can we not simply use multiple fields indicating to be used as one.
Rate this response:  
Not yet rated
ray.wurlod

Premium Poster
Participant

Group memberships:
Premium Members, Inner Circle, Australia Usergroup, Server to Parallel Transition Group

Joined: 23 Oct 2002
Posts: 54524
Location: Sydney, Australia
Points: 295662

Post Posted: Fri Jun 14, 2013 1:10 am Reply with quote    Back to top    

Probably not. While you could create a virtual column made up of all of those address fields, given that it actually contains two addresses per record how would you expect the Standardize stage to tr ...

_________________
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Rate this response:  
Not yet rated
rjdickson
Participant



Joined: 16 Jun 2003
Posts: 378
Location: Chicago, USA
Points: 2531

Post Posted: Fri Jun 14, 2013 12:05 pm Reply with quote    Back to top    

Can you please provide and example of your input and your desired results? Please try to use 'real' data, not hypothetical data.

I am asking for clarification because there is a QualityStage capability that may help here (unduplicate independent), but it acts differently (on purpose) than the 'default' of 'unduplicate dependent'.

_________________
Regards,
Robert
Rate this response:  
Not yet rated
hitmanthesilentassasin
Participant



Joined: 13 Mar 2007
Posts: 150

Points: 1382

Post Posted: Sat Jun 15, 2013 3:31 am Reply with quote    Back to top    

Hi Robert the sample is as shown below. It needs to be a criss-cross match between 2 sets of address lines. to standardize the 2 sets I could run the standardization twice to generate the standardized fields and then pass the data through the match stage to identify the duplicate customers within the same dataset.

https://plus.google.com/photos/116171145114013064554/albums/5889638574516657121

I hope I have given the required information.
Rate this response:  
Not yet rated
rjdickson
Participant



Joined: 16 Jun 2003
Posts: 378
Location: Chicago, USA
Points: 2531

Post Posted: Sat Jun 15, 2013 6:31 am Reply with quote    Back to top    

Indeed, the way to go is with Unduplicate Independent:

1. 'Stack' (my term) your data so that the result looks something like this:
Code:
Type   FirstName   LastName   Address   Phone   email
3   BillTo   John   Smith   123 Mary Street   40400000   abc@gmail.com
3   ShipTo   John   Smith   456 Elizabeth Street   40400000   abc@gmail.com
4   BillTo   John   Smith   456 Elizabeth Street      
5   BillTo   John   Smith   123 Mary Street      


The 'ID' is something I made up based on your spreadsheet (the first data row was on Excel row 3). I assume each input row will have a unique identifier.

Also note that I have removed 'empty' rows - where there was no BillTo or ShipTo to match on.

2. Standardize the data as normal.

3. Create an 'Undup Independent' match and create your passes as normal.

4. Add one final pass that blocks on the input ID only. There will be no matching fields. This pass is the 'glue' that will bring the records together.

If you did normal 'Undup Dependent' matching, you would probably end up with something like:

Code:
Set   ID   Type   FirstName   LastName   Address   Phone   email
100   3   BillTo   John   Smith   123 Mary Street   40400000   abc@gmail.com
200   3   ShipTo   John   Smith   456 Elizabeth Street   40400000   abc@gmail.com
200   4   BillTo   John   Smith   456 Elizabeth Street      
100   5   BillTo   John   Smith   123 Mary Street      


Making the match 'Undup Independent', and adding a pass to matching only on ID will 'join' match Sets 100 and 200 above, making the output look like this:

Code:
Set   ID   Type   FirstName   LastName   Address   Phone   email
100   3   BillTo   John   Smith   123 Mary Street   40400000   abc@gmail.com
100   3   ShipTo   John   Smith   456 Elizabeth Street   40400000   abc@gmail.com
100   4   BillTo   John   Smith   456 Elizabeth Street      
100   5   BillTo   John   Smith   123 Mary Street   


A few things about Unduplicate Independent:

1. ALL records go into EVERY pass. This is in contrast to the Undup Dependent where only the masters and unmatched records are delivered to subsequent passes.
2. The output pass from Unduplicate Independent is not useful - it should always be the last pass number.
3. When developing in the Match Designer, create one pass at a time, and put the others in the 'holding area'.

I hope this helps.

_________________
Regards,
Robert
Rate this response:  
Not yet rated
hitmanthesilentassasin
Participant



Joined: 13 Mar 2007
Posts: 150

Points: 1382

Post Posted: Tue Jun 18, 2013 5:44 pm Reply with quote    Back to top    

Hi Robert,

your trick does the job well

the only part I am not clear here is when we are grouping by the key in the last pass how come the keys are re-arranged?

Thank you very much. Appreciate your help!! Very Happy
Rate this response:  
Not yet rated
rjdickson
Participant



Joined: 16 Jun 2003
Posts: 378
Location: Chicago, USA
Points: 2531

Post Posted: Wed Jun 19, 2013 10:07 am Reply with quote    Back to top    

Hi,

Sorry, but not sure what you mean by 'keys are re-arranged'? Can you please elaborate?

_________________
Regards,
Robert
Rate this response:  
Not yet rated
hitmanthesilentassasin
Participant



Joined: 13 Mar 2007
Posts: 150

Points: 1382

Post Posted: Wed Jun 19, 2013 3:55 pm Reply with quote    Back to top    

I meant the keys were pointing to a single master record instead of multiple different records when they were blocked together.
Rate this response:  
Not yet rated
rjdickson
Participant



Joined: 16 Jun 2003
Posts: 378
Location: Chicago, USA
Points: 2531

Post Posted: Wed Jun 19, 2013 4:02 pm Reply with quote    Back to top    

Before the final pass of the Unduplicate Independent, your results would look like this:
Code:
Set   ID   Type   FirstName   LastName   Address   Phone   email
100   3   BillTo   John   Smith   123 Mary Street   40400000   abc@gmail.com
200   3   ShipTo   John   Smith   456 Elizabeth Street   40400000   abc@gmail.com
200   4   BillTo   John   Smith   456 Elizabeth Street       
100   5   BillTo   John   Smith   123 Mary Street     


The final pass (the one that just blocks on ID) matches the two records that originated with ID 3. That is what triggers the 'complication of merge' (your words from your original post). One record in is Set 100, the other is in Set 200, so QualityStage will, in that pass, effectively merge sets 100 and 200 so that the final output is one Set with all records:
Code:
Set   ID   Type   FirstName   LastName   Address   Phone   email
100   3   BillTo   John   Smith   123 Mary Street   40400000   abc@gmail.com
100   3   ShipTo   John   Smith   456 Elizabeth Street   40400000   abc@gmail.com
100   4   BillTo   John   Smith   456 Elizabeth Street       
100   5   BillTo   John   Smith   123 Mary Street   


Did I answer your question?

_________________
Regards,
Robert
Rate this response:  
Not yet rated
hitmanthesilentassasin
Participant



Joined: 13 Mar 2007
Posts: 150

Points: 1382

Post Posted: Wed Jun 19, 2013 4:30 pm Reply with quote    Back to top    

rjdickson wrote:
Did I answer your question?


Sure, you did - Thanks again!!
Rate this response:  
Not yet rated
Display posts from previous:       

Add To Favorites
View next topic
View previous topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



Powered by phpBB © 2001, 2002 phpBB Group
Theme & Graphics by Daz :: Portal by Smartor
All times are GMT - 6 Hours