Removal of Repeated Characters

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
oacvb
Participant
Posts: 128
Joined: Wed Feb 18, 2004 5:33 am

Removal of Repeated Characters

Post by oacvb »

We have a requirement to remove repetitive characters like AAA or BBB etc., from Name that occurs more than thrice. I tried using Convert function in transformer stage and gave string as AAA but it removed the character that appeared even once. Please let me know how this can be implemented in transformer stage, I tried in server routine but it can't be called from Parallel job.
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

You'll most likely have to write your own BASIC routine. Use a BASIC Transformer stage in a parallel job to call it. It's not shown in the Palette. Instead, go to Repository, Stage Types, Parallel, Processing, BASIC Transformer.
Choose a job you love, and you will never have to work a day in your life. - Confucius
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

You could write the routine in C++ and (having compiled and linked it and created a reference to it in DataStage) call it from a parallel Transformer stage.

Or you could use a BASIC Transformer stage in a parallel job.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

There is a recent thread on hand-rolling an e-replace function in C that is probably a good starting point if you are not strong at C. It would be similar to that, just a bit of adjusted logic.

You can probably also handle this with pattern action... I personally wouldn't, but you could.
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

Datastage 9.1 has eReplace in the parallel transformer stage.

Ray/qt_ky, if the OP was to replace strings such as AAAA with A using eReplace, how would he do it?

I would think that the OP would have to call eReplace 26 times in a nested manner. Agree?
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

26 for caps, 26 again for lower case, 10 more for numbers .... symbols.. and that assumes you can find a way to do it. If you had unicode it would be intractable, and it is horrible for only simple ascii.

the C or basic way would be to copy the original string into a new string, one byte at a time, dropping duplicates as you go (if current != previous, copy, else skip). This is a O(N) operation, which is pretty much as good as it gets here (you could also divide it across many threads if the strings were gigantic, but that is usually not necessary). Its a very simple and short chunk of code, highly recommend doing it this way...
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

No need to hard-code a scenario for every possible character...

Loop through the string from first to last position. Initialize a counter variable. Initialize a previous character variable.

If current character = previous character, increment a counter, else reset the counter.

If counter > 3 (or whatever your rule may be), then do something.
Choose a job you love, and you will never have to work a day in your life. - Confucius
abc123
Premium Member
Premium Member
Posts: 605
Joined: Fri Aug 25, 2006 8:24 am

Post by abc123 »

qt_ky, I am assuming that you are talking about a parallel transformer loop, right?
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

The generic pseudo code I outlined would be valid for any programming language, BASIC routine, Parallel routine, etc.

In theory, it could even be done using looping within a Parallel Transformer stage, although it would be a bit cumbersome.
Choose a job you love, and you will never have to work a day in your life. - Confucius
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

For clarity, that is the same algorithm I said except I suggested copying into a temp for simplicity & speed in the low level languages. Details aside, I think this is the best algorithm for general strings (it can be improved for specific strings, of course).
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Note there was one detail in the original post to remove repetitive characters that occur more than thrice... :wink:
Choose a job you love, and you will never have to work a day in your life. - Confucius
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

That is true! The core algorithm is unchanged, but you do need to handle this detail. That looks to be extra convoluted in datastage transformer logic.
chucksmith
Premium Member
Premium Member
Posts: 385
Joined: Wed Jun 16, 2004 12:43 pm
Location: Virginia, USA
Contact:

Post by chucksmith »

Back to the original question, please note that the convert() function deals with single byte comparison/conversion. The change() function deals with substrings.
asorrell
Posts: 1707
Joined: Fri Apr 04, 2003 2:00 pm
Location: Colleyville, Texas

Post by asorrell »

On a related note, because this is a per-character comparison of a string, its going to really slow down your job a lot, regardless of methodology. Be aware of the impact if this is a time critical job that processes a lot of records.
Andy Sorrell
Certified DataStage Consultant
IBM Analytics Champion 2009 - 2020
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

It shouldn't. I recently standardized a text file in a similar way (removal of extra characters). The file was about 60MB of text. Execution time was less than 3 seconds and that includes reading the file and writing the fixed file output on top of the processing time. That was a one-shot hack code, so I did not even multi-thread it. If I had multi-threaded it, it would have been ~ 1 second on a typical 4 cpu machine.

Ive done a couple of string standardization routines for datastage and they are invariable the fastest stages in the job.

Here is a quick, rough cut at it, since the topic has stayed alive for so long.
-----------

char outbuff[10000]; //yes, its an evil global variable.

char *strmax3(char * buff)
{
static char which = 0; //for parallel execution: a micro "memory manager"
//tweak for your system, this is fine for 4-8 cpu/threads.
//its faster than allocating new memory for each input.
which = (which+1)%10;
char * out = &(outbuff[which*100]); //100 is max length of input string (including 0 end), tweak if needed.

//null string or string too short to check do nothing, return the input and exit.
if(!buff) return buff;
int len = strlen(buff);
if(len < 4) return buff;
unsigned int dx, lc;
dx = lc = 0;

//seeds the algorithm to simplify code.
out[dx++] = buff[0];
out[dx++] = buff[1];
out[dx++] = buff[2];
for(lc = 3; lc < len; lc++)
{
if(buff[lc] == buff[lc-1] && buff[lc] == buff[lc-2] && buff[lc] == buff[lc-3]); // ;here = do nothing, so if true do nothing.
else //we do not have 4 identical chars in a row, so we copy into the target.
out[dx++] = buff[lc];
}
out[dx] = 0; //standard c end of string MUST be added to hand-cooked strings.
return out;
}
Post Reply