multiple addresses in the single record
-
- Participant
- Posts: 150
- Joined: Tue Mar 13, 2007 1:17 am
multiple addresses in the single record
Hi,
I was looking for alternatives to perform the match. I have 2 different addresses captured in a single record which I wanted to use to identify duplicates. The ways I figured out is to split the records and have a single set of matching fields and run the matching specifications against it. However, this would generate a different matchsetid which could lead to complication of merge. another way is to have 2 set of matching keys like housenumber1, housenumber1_2, but in this scenario it wouldn't match across the fields. Is there a way out of this?
thanks!!
I was looking for alternatives to perform the match. I have 2 different addresses captured in a single record which I wanted to use to identify duplicates. The ways I figured out is to split the records and have a single set of matching fields and run the matching specifications against it. However, this would generate a different matchsetid which could lead to complication of merge. another way is to have 2 set of matching keys like housenumber1, housenumber1_2, but in this scenario it wouldn't match across the fields. Is there a way out of this?
thanks!!
-
- Participant
- Posts: 527
- Joined: Thu Apr 19, 2007 1:25 am
- Location: Melbourne
To follow on from Robert's post, presumably you have other fields to match against? eg: customers with both shipping and billing addresses in the customer master table.
A way I've done it in the past is to split the records so that if the customer has 2 different addresses, you make 2 records, each with 1 address. You can then use the PK to group them on the first pass and then use subsequent passes to do the "real" matching.
Experiment with using Independent or Transitive as the match type to see which gives you the better matches.
Remember, in the end though there is 1 customer record, so you may have do some transformation or reconstruction before you can do the Survive (if you are surviving to the same structure).
A way I've done it in the past is to split the records so that if the customer has 2 different addresses, you make 2 records, each with 1 address. You can then use the PK to group them on the first pass and then use subsequent passes to do the "real" matching.
Experiment with using Independent or Transitive as the match type to see which gives you the better matches.
Remember, in the end though there is 1 customer record, so you may have do some transformation or reconstruction before you can do the Survive (if you are surviving to the same structure).
-
- Participant
- Posts: 150
- Joined: Tue Mar 13, 2007 1:17 am
Hi,
Thanks for your replies, yes there are other fields like the names to identify the customers. even if we split the record then lets say record with key x -1 (with first address) is pointing to record with key z and x-2(with second address) is pointing to the record y merging could be a bit cumbersome.
is there a way out of this? can we not simply use multiple fields indicating to be used as one.
Thanks for your replies, yes there are other fields like the names to identify the customers. even if we split the record then lets say record with key x -1 (with first address) is pointing to record with key z and x-2(with second address) is pointing to the record y merging could be a bit cumbersome.
is there a way out of this? can we not simply use multiple fields indicating to be used as one.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Probably not. While you could create a virtual column made up of all of those address fields, given that it actually contains two addresses per record how would you expect the Standardize stage to treat that free-form data? A moment's reflection will show that you need one record per address; as Stuart noted you can merge them subsequently, after you've standardized them, possibly in a first match pass.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Can you please provide and example of your input and your desired results? Please try to use 'real' data, not hypothetical data.
I am asking for clarification because there is a QualityStage capability that may help here (unduplicate independent), but it acts differently (on purpose) than the 'default' of 'unduplicate dependent'.
I am asking for clarification because there is a QualityStage capability that may help here (unduplicate independent), but it acts differently (on purpose) than the 'default' of 'unduplicate dependent'.
Regards,
Robert
Robert
-
- Participant
- Posts: 150
- Joined: Tue Mar 13, 2007 1:17 am
Hi Robert the sample is as shown below. It needs to be a criss-cross match between 2 sets of address lines. to standardize the 2 sets I could run the standardization twice to generate the standardized fields and then pass the data through the match stage to identify the duplicate customers within the same dataset.
https://plus.google.com/photos/11617114 ... 4516657121
I hope I have given the required information.
https://plus.google.com/photos/11617114 ... 4516657121
I hope I have given the required information.
Indeed, the way to go is with Unduplicate Independent:
1. 'Stack' (my term) your data so that the result looks something like this:
The 'ID' is something I made up based on your spreadsheet (the first data row was on Excel row 3). I assume each input row will have a unique identifier.
Also note that I have removed 'empty' rows - where there was no BillTo or ShipTo to match on.
2. Standardize the data as normal.
3. Create an 'Undup Independent' match and create your passes as normal.
4. Add one final pass that blocks on the input ID only. There will be no matching fields. This pass is the 'glue' that will bring the records together.
If you did normal 'Undup Dependent' matching, you would probably end up with something like:
Making the match 'Undup Independent', and adding a pass to matching only on ID will 'join' match Sets 100 and 200 above, making the output look like this:
A few things about Unduplicate Independent:
1. ALL records go into EVERY pass. This is in contrast to the Undup Dependent where only the masters and unmatched records are delivered to subsequent passes.
2. The output pass from Unduplicate Independent is not useful - it should always be the last pass number.
3. When developing in the Match Designer, create one pass at a time, and put the others in the 'holding area'.
I hope this helps.
1. 'Stack' (my term) your data so that the result looks something like this:
Code: Select all
Type FirstName LastName Address Phone email
3 BillTo John Smith 123 Mary Street 40400000 abc@gmail.com
3 ShipTo John Smith 456 Elizabeth Street 40400000 abc@gmail.com
4 BillTo John Smith 456 Elizabeth Street
5 BillTo John Smith 123 Mary Street
Also note that I have removed 'empty' rows - where there was no BillTo or ShipTo to match on.
2. Standardize the data as normal.
3. Create an 'Undup Independent' match and create your passes as normal.
4. Add one final pass that blocks on the input ID only. There will be no matching fields. This pass is the 'glue' that will bring the records together.
If you did normal 'Undup Dependent' matching, you would probably end up with something like:
Code: Select all
Set ID Type FirstName LastName Address Phone email
100 3 BillTo John Smith 123 Mary Street 40400000 abc@gmail.com
200 3 ShipTo John Smith 456 Elizabeth Street 40400000 abc@gmail.com
200 4 BillTo John Smith 456 Elizabeth Street
100 5 BillTo John Smith 123 Mary Street
Code: Select all
Set ID Type FirstName LastName Address Phone email
100 3 BillTo John Smith 123 Mary Street 40400000 abc@gmail.com
100 3 ShipTo John Smith 456 Elizabeth Street 40400000 abc@gmail.com
100 4 BillTo John Smith 456 Elizabeth Street
100 5 BillTo John Smith 123 Mary Street
1. ALL records go into EVERY pass. This is in contrast to the Undup Dependent where only the masters and unmatched records are delivered to subsequent passes.
2. The output pass from Unduplicate Independent is not useful - it should always be the last pass number.
3. When developing in the Match Designer, create one pass at a time, and put the others in the 'holding area'.
I hope this helps.
Regards,
Robert
Robert
-
- Participant
- Posts: 150
- Joined: Tue Mar 13, 2007 1:17 am
-
- Participant
- Posts: 150
- Joined: Tue Mar 13, 2007 1:17 am
Before the final pass of the Unduplicate Independent, your results would look like this:
The final pass (the one that just blocks on ID) matches the two records that originated with ID 3. That is what triggers the 'complication of merge' (your words from your original post). One record in is Set 100, the other is in Set 200, so QualityStage will, in that pass, effectively merge sets 100 and 200 so that the final output is one Set with all records:
Did I answer your question?
Code: Select all
Set ID Type FirstName LastName Address Phone email
100 3 BillTo John Smith 123 Mary Street 40400000 abc@gmail.com
200 3 ShipTo John Smith 456 Elizabeth Street 40400000 abc@gmail.com
200 4 BillTo John Smith 456 Elizabeth Street
100 5 BillTo John Smith 123 Mary Street
Code: Select all
Set ID Type FirstName LastName Address Phone email
100 3 BillTo John Smith 123 Mary Street 40400000 abc@gmail.com
100 3 ShipTo John Smith 456 Elizabeth Street 40400000 abc@gmail.com
100 4 BillTo John Smith 456 Elizabeth Street
100 5 BillTo John Smith 123 Mary Street
Regards,
Robert
Robert
-
- Participant
- Posts: 150
- Joined: Tue Mar 13, 2007 1:17 am