multiple addresses in the single record

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

multiple addresses in the single record

Post by hitmanthesilentassasin »

Hi,

I was looking for alternatives to perform the match. I have 2 different addresses captured in a single record which I wanted to use to identify duplicates. The ways I figured out is to split the records and have a single set of matching fields and run the matching specifications against it. However, this would generate a different matchsetid which could lead to complication of merge. another way is to have 2 set of matching keys like housenumber1, housenumber1_2, but in this scenario it wouldn't match across the fields. Is there a way out of this?

thanks!!
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Hi,

What is the business rule for the match? Is it 'the same person independent of the address'? In other words, what else is being used to match besides address components?
Regards,
Robert
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

To follow on from Robert's post, presumably you have other fields to match against? eg: customers with both shipping and billing addresses in the customer master table.

A way I've done it in the past is to split the records so that if the customer has 2 different addresses, you make 2 records, each with 1 address. You can then use the PK to group them on the first pass and then use subsequent passes to do the "real" matching.
Experiment with using Independent or Transitive as the match type to see which gives you the better matches.

Remember, in the end though there is 1 customer record, so you may have do some transformation or reconstruction before you can do the Survive (if you are surviving to the same structure).
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

Post by hitmanthesilentassasin »

Hi,

Thanks for your replies, yes there are other fields like the names to identify the customers. even if we split the record then lets say record with key x -1 (with first address) is pointing to record with key z and x-2(with second address) is pointing to the record y merging could be a bit cumbersome.

is there a way out of this? can we not simply use multiple fields indicating to be used as one.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Probably not. While you could create a virtual column made up of all of those address fields, given that it actually contains two addresses per record how would you expect the Standardize stage to treat that free-form data? A moment's reflection will show that you need one record per address; as Stuart noted you can merge them subsequently, after you've standardized them, possibly in a first match pass.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Can you please provide and example of your input and your desired results? Please try to use 'real' data, not hypothetical data.

I am asking for clarification because there is a QualityStage capability that may help here (unduplicate independent), but it acts differently (on purpose) than the 'default' of 'unduplicate dependent'.
Regards,
Robert
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

Post by hitmanthesilentassasin »

Hi Robert the sample is as shown below. It needs to be a criss-cross match between 2 sets of address lines. to standardize the 2 sets I could run the standardization twice to generate the standardized fields and then pass the data through the match stage to identify the duplicate customers within the same dataset.

https://plus.google.com/photos/11617114 ... 4516657121

I hope I have given the required information.
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Indeed, the way to go is with Unduplicate Independent:

1. 'Stack' (my term) your data so that the result looks something like this:

Code: Select all

Type	FirstName	LastName	Address	Phone	email
3	BillTo	John	Smith	123 Mary Street	40400000	abc@gmail.com
3	ShipTo	John	Smith	456 Elizabeth Street	40400000	abc@gmail.com
4	BillTo	John	Smith	456 Elizabeth Street		
5	BillTo	John	Smith	123 Mary Street		
The 'ID' is something I made up based on your spreadsheet (the first data row was on Excel row 3). I assume each input row will have a unique identifier.

Also note that I have removed 'empty' rows - where there was no BillTo or ShipTo to match on.

2. Standardize the data as normal.

3. Create an 'Undup Independent' match and create your passes as normal.

4. Add one final pass that blocks on the input ID only. There will be no matching fields. This pass is the 'glue' that will bring the records together.

If you did normal 'Undup Dependent' matching, you would probably end up with something like:

Code: Select all

Set	ID	Type	FirstName	LastName	Address	Phone	email
100	3	BillTo	John	Smith	123 Mary Street	40400000	abc@gmail.com
200	3	ShipTo	John	Smith	456 Elizabeth Street	40400000	abc@gmail.com
200	4	BillTo	John	Smith	456 Elizabeth Street		
100	5	BillTo	John	Smith	123 Mary Street		
Making the match 'Undup Independent', and adding a pass to matching only on ID will 'join' match Sets 100 and 200 above, making the output look like this:

Code: Select all

Set	ID	Type	FirstName	LastName	Address	Phone	email
100	3	BillTo	John	Smith	123 Mary Street	40400000	abc@gmail.com
100	3	ShipTo	John	Smith	456 Elizabeth Street	40400000	abc@gmail.com
100	4	BillTo	John	Smith	456 Elizabeth Street		
100	5	BillTo	John	Smith	123 Mary Street	
A few things about Unduplicate Independent:

1. ALL records go into EVERY pass. This is in contrast to the Undup Dependent where only the masters and unmatched records are delivered to subsequent passes.
2. The output pass from Unduplicate Independent is not useful - it should always be the last pass number.
3. When developing in the Match Designer, create one pass at a time, and put the others in the 'holding area'.

I hope this helps.
Regards,
Robert
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

Post by hitmanthesilentassasin »

Hi Robert,

your trick does the job well

the only part I am not clear here is when we are grouping by the key in the last pass how come the keys are re-arranged?

Thank you very much. Appreciate your help!! :D
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Hi,

Sorry, but not sure what you mean by 'keys are re-arranged'? Can you please elaborate?
Regards,
Robert
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

Post by hitmanthesilentassasin »

I meant the keys were pointing to a single master record instead of multiple different records when they were blocked together.
rjdickson
Participant
Posts: 378
Joined: Mon Jun 16, 2003 5:28 am
Location: Chicago, USA
Contact:

Post by rjdickson »

Before the final pass of the Unduplicate Independent, your results would look like this:

Code: Select all

Set   ID   Type   FirstName   LastName   Address   Phone   email 
100   3   BillTo   John   Smith   123 Mary Street   40400000   abc@gmail.com 
200   3   ShipTo   John   Smith   456 Elizabeth Street   40400000   abc@gmail.com 
200   4   BillTo   John   Smith   456 Elizabeth Street       
100   5   BillTo   John   Smith   123 Mary Street      
The final pass (the one that just blocks on ID) matches the two records that originated with ID 3. That is what triggers the 'complication of merge' (your words from your original post). One record in is Set 100, the other is in Set 200, so QualityStage will, in that pass, effectively merge sets 100 and 200 so that the final output is one Set with all records:

Code: Select all

Set   ID   Type   FirstName   LastName   Address   Phone   email 
100   3   BillTo   John   Smith   123 Mary Street   40400000   abc@gmail.com 
100   3   ShipTo   John   Smith   456 Elizabeth Street   40400000   abc@gmail.com 
100   4   BillTo   John   Smith   456 Elizabeth Street       
100   5   BillTo   John   Smith   123 Mary Street   
Did I answer your question?
Regards,
Robert
hitmanthesilentassasin
Participant
Posts: 150
Joined: Tue Mar 13, 2007 1:17 am

Post by hitmanthesilentassasin »

rjdickson wrote:Did I answer your question?
Sure, you did - Thanks again!!
Post Reply