Multi-valued domain matching

Infosphere's Quality Product

Moderators: chulett, rschirm

Post Reply
ymadden@bmi.com
Premium Member
Premium Member
Posts: 21
Joined: Fri Feb 01, 2013 9:13 am

Multi-valued domain matching

Post by ymadden@bmi.com »

I'm not sure if the post title is accurate. Maybe that is why I'm not getting any search hits.

We have a requirement to match on attributes that can have multiple values, e.g., the multiple contacts below
Reference Set -
Company A, Contacts {Person A, Person B, Person C}
Company B, Contacts {Person D, Person E}
Data Set -
Company A, Contacts {Person B}

The desired result is that Company A should match based on the Person B contact. I'm not exactly sure how to accomplish this. Exploding the reference set so that there is a record per Person seems unreasonable. Would using a vector or subrecord column type be the correct approach? The documentation seems a little confusing.

Any input is appreciated.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

How are your data organised? In particular,does the Contacts table have one row per Person/Company combination? If so, you can pretty much use an exact match (or UNCERT if there are likely to be spelling errors) based on Company.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ymadden@bmi.com
Premium Member
Premium Member
Posts: 21
Joined: Fri Feb 01, 2013 9:13 am

Post by ymadden@bmi.com »

thanks Ray. Our data is in relational tables, so there's a Company table and a Contacts table. In reality, they're not company and contacts, those are just simple examples. We also have other multi-valued attributes of the business objects that we need to match on. Some of these need to be used in blocking strategies too. We've never used the subrecord or vector structures, so I'm wondering if they would be a better alternative to transforming the data into all possible combinations of every multi-valued attribute. Producing these combinations could create billions of records in the datasets for some business objects.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You have to ask yourself whether these billions of rows are the correct answer. If not, tighten your blocking or matching criteria. As a general rule, blocking in the first pass should eliminate as much as possible from each block of "matchable" records.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ymadden@bmi.com
Premium Member
Premium Member
Posts: 21
Joined: Fri Feb 01, 2013 9:13 am

Post by ymadden@bmi.com »

That billion row question is the exact one I am asking myself now. I've read back over the documentation and can see how we can restructure the billion rows by combining records to have vector columns. Has anyone ever actually used a vector column in a match command, and what were the benefits (performance, accuracy, etc.) or drawbacks of doing so?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I think you would be better off parsing the multi-valued fields into separate words (for example by converting the internal delimiter character to space) and using the MULT_EXACT or MULT_UNCERT comparisons in QualityStage match.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply