Remove duplicates after concatenation

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
spuli
Participant
Posts: 40
Joined: Thu Apr 09, 2015 12:13 pm

Remove duplicates after concatenation

Post by spuli »

Hi All,

I have got a requirement to eliminate the duplicates after concatenating a set of columns.
Col1 value: 123
Col2 value: 234
Col3 value: 123
Col4 value: 222
Col5 value: 222

After concatenating with pipe I should get the value 123|234|222||

I have the below logic to achieve this result:

Appending column by column to a stage variable and before appending checking the existing concatenated value using Index() function. If the result of Index() is 0 then append otherwise skip the column.

Please provide your suggestions if there is any other better logic.

Thanks in advance.
Thanks,
Sai
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

There are certainly other options that are more complex but are they 'better'? Hard to say. Find something that works for you and is 'self documented' properly. So do you actually only have five columns to check or that just a made up example that is only 'like' your source? (sorry, pet peeve alert)

For a finite 'small' number, rather than Index you might consider a brute force check:

Code: Select all

Col1:  Put in string
Col2:  If Col2 != Col1, append to string
Col3:  If Col3 != Col2 and Col3 != Col1, append to string.
Lather, rinse, repeat, then add any missing pipes for the empties. Or stick with the Index function if you've already gotten that to work. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

logic may not be sufficient.
what if you had

12345 234 345
would your algorithm fail to work because 234 is in 12345?
if the data is fixed width or has various criteria your code will work, but you should be aware of the implications so you can make an informed decision.

You can probably rig the logic to not double the separator if you want it that way, this is just per your requirement though. I personally would toss the extras unless dictated from on-high to keep them.

I can't think of anything any more efficient in datastage. C would eat it alive if you need more speed, or have a LOT of columns. If you try to do this on 100 column table with a few billion rows, you would want to do it in C instead. If you are for real just doing 5 columns of a moderate. The issue is the N*N search for duplicates, each column you add starts at the beginning and searches the string again for a dup, then the next one searches all that again + the last one... this algorithm will explode with a large table. With the C, you can brute force search it O(N) and be done with it. Someone here just did something like this (n squared string searching) and the C ran almost 60 times faster for a fairly small input size...
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I was assuming they needed the extra pipes as they were doing what looked to me like essentially a de-dup and coalesce / shuffle back into the current structure, front to back. And yes, also assuming a relatively 'small' number of columns with a volume NOT measured in the bajillions. :wink:

And to your point when using Index to check for substring matches, in order to avoid the dreaded false positive, I always wrap the full string to check and substring in delimiters. Meaning you would look for "|234|" in "|12345|234|345|" and thus avoid them.
-craig

"You can never have too many knives" -- Logan Nine Fingers
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

We want it fast, codeless?
-Read the data
-Horizontal pivot
-Apply Unique sort
-Vertical pivot
-Concat columns
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Yup, that was one of the 'other methods' I was thinking of when I mentioned there were others I could think of. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
spuli
Participant
Posts: 40
Joined: Thu Apr 09, 2015 12:13 pm

Post by spuli »

Hi JRodriguez,

Thanks for posting your idea. I thought of this approach before. There is a difficulty in this. I have to retain the rank of the columns.

Lets say col3 value is already in col1, I have to move col4 to col3 only if it is not a duplicate of existing columns.

Please let me know if there is any way to keep track of column order
Thanks,
Sai
spuli
Participant
Posts: 40
Joined: Thu Apr 09, 2015 12:13 pm

Post by spuli »

Thanks for your suggestion CHulett. I have 24 columns not 5.
Thanks,
Sai
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

That's fine as it doesn't really affect my answer, it's just one-time coding which you could grind out in all the time spent here looking for something else. And would perform just fine I would wager. :wink:

Now if you had said 240...
-craig

"You can never have too many knives" -- Logan Nine Fingers
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

24 would only be a problem if the "real" width of the strings were rather large (hundreds... 500+ might begin to have an impact, but it really depends on the machine behind the work). If they are typical ( < 100, say?) youll be fine with what you have.

I would keep what you have unless you can find a problem with it. If it performs within your needs and does the job, its fine. If not, there are ways to improve it.

If the pivot approach reorders the data, there is probably a way to recover it by doing something like appending a counter column that stores the order. But if what you have is working, leave it alone seems best to me.
spuli
Participant
Posts: 40
Joined: Thu Apr 09, 2015 12:13 pm

Post by spuli »

Thank you all for your wonderful suggestions.
Thanks,
Sai
Post Reply