Pattern Matching

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
datastagedw
Participant
Posts: 53
Joined: Fri Mar 07, 2008 1:17 am

Pattern Matching

Post by datastagedw »

Hi All,

We have a requirement where in after the look up, not only the records with exact match but also any source records matching the pattern in the look up dataset should also pass.

We used the below option however there can be performance issues as the source data cann be huge:

Have a left outer join based on some dummy key column (generated on either side of the join). This will resut in cartesian product. Then using a transformer, using count function doing the pattern match and placing a constraint accordingly. the output will be 1 for pattern matched records and 0 for unmatched records.

Is there any other approach which can give better performance? Please help
ETL DEVELOPER
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Post by pandeesh »

you can try with index function
pandeeswaran
datastagedw
Participant
Posts: 53
Joined: Fri Mar 07, 2008 1:17 am

Post by datastagedw »

Thanks for the quick reply. The issue is that the substring to match is dynamic rather than static. The substring values to be looked for come from the reference dataset.
ETL DEVELOPER
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Post by pandeesh »

Can you please post your sample source and reference data as well as your expected result?
pandeeswaran
datastagedw
Participant
Posts: 53
Joined: Fri Mar 07, 2008 1:17 am

Post by datastagedw »

Sample Source data:

earabcrt
qewtuy
ghruo
ehraabc
abcrtch
anmnoas
mnopwwas
asdasmno
asdad
asdas
ryrED

Reference dataset
abc
mno

Output results:
earabcrt
ehraabc
abcrtch
anmnoas
mnopwwas
asdasmno

'abc' and 'mno' sub strings should be available in the source column
ETL DEVELOPER
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

try this unix script


while read record
do
cat srcsam.txt | grep '$record' >> tgtsam.txt
done < refsam.txt

Thsi should work
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I don't believe DataStage can do this. You might try using a sparse lookup (assuming you have a database that supports the same) with LIKE in the user-defined SQL.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
datastagedw
Participant
Posts: 53
Joined: Fri Mar 07, 2008 1:17 am

Post by datastagedw »

Thanks for the reply.. We are using SQL 2005, need to check if it supports sparse look up.

In the meanwhile, we decided to go with the approach I mentioned in the post and do a performance testing and observe the results and time taken
ETL DEVELOPER
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Only DB2 and Oracle support sparse lookup. Sorry.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Re: Pattern Matching

Post by pandeesh »

datastagedw wrote:Hi All,
Have a left outer join based on some dummy key column (generated on either side of the join). This will resut in cartesian product. Then using a transformer, using count function doing the pattern match and placing a constraint accordingly. can give better performance? Please help
Hi,

Can you please explain this approach little bit?

My source file contains below records:
source
--------
abc
bca
axdbc
cbhss
bcda


Reference
----------
bc

My output should be

abc
bca
axdbc

Thanks
pandeeswaran
pandeesh
Premium Member
Premium Member
Posts: 1399
Joined: Sun Oct 24, 2010 5:15 am
Location: CHENNAI, TAMIL NADU

Post by pandeesh »

ray.wurlod wrote: You might try using a sparse lookup (assuming you have a database that supports the same) with LIKE in the user-defined SQL.
Hi Ray,

I have tried the sparse lookup like below using oracle tables:

Source table contains a field VAL

and the records are

abc
bca
abcde
be

Reference table contains a field VAL and it contains only one record bc.

So the result should be

abc
bca
abcde

Job design:

Code: Select all



src table------------lkp stage(sparse)----------->seq file
                              |
                              |
                            ref table

In the ref table , i have specified the lookup as sparse and the user-defined sql is

Code: Select all

SELECT VAL FROM WHERE VAL LIKE '%ORCHESTRATE.VAL%'
in the src table, the query is like

Code: Select all

SELECT VAL FROM src
But nothing gets matched.
Please help me.

Thanks
pandeeswaran
samyamkrishna
Premium Member
Premium Member
Posts: 258
Joined: Tue Jul 04, 2006 10:35 pm
Location: Toronto

Post by samyamkrishna »

Hi Pandeesh,

I think the source and reference should be interchanged when you are trying that.
Post Reply