Removal of Repeated Characters

oacvb · Post by **oacvb** » Wed Oct 19, 2016 3:48 pm

We have a requirement to remove repetitive characters like AAA or BBB etc., from Name that occurs more than thrice. I tried using Convert function in transformer stage and gave string as AAA but it removed the character that appeared even once. Please let me know how this can be implemented in transformer stage, I tried in server routine but it can't be called from Parallel job.

qt_ky · Post by **qt_ky** » Thu Oct 20, 2016 9:18 am

You'll most likely have to write your own BASIC routine. Use a BASIC Transformer stage in a parallel job to call it. It's not shown in the Palette. Instead, go to Repository, Stage Types, Parallel, Processing, BASIC Transformer.

ray.wurlod · Post by **ray.wurlod** » Thu Oct 20, 2016 12:54 pm

You could write the routine in C++ and (having compiled and linked it and created a reference to it in DataStage) call it from a parallel Transformer stage.

Or you could use a BASIC Transformer stage in a parallel job.

UCDI · Post by **UCDI** » Fri Oct 21, 2016 10:27 am

There is a recent thread on hand-rolling an e-replace function in C that is probably a good starting point if you are not strong at C. It would be similar to that, just a bit of adjusted logic.

You can probably also handle this with pattern action... I personally wouldn't, but you could.

abc123 · Post by **abc123** » Mon Nov 14, 2016 11:33 am

Datastage 9.1 has eReplace in the parallel transformer stage.

Ray/qt_ky, if the OP was to replace strings such as AAAA with A using eReplace, how would he do it?

I would think that the OP would have to call eReplace 26 times in a nested manner. Agree?

UCDI · Post by **UCDI** » Mon Nov 14, 2016 1:43 pm

26 for caps, 26 again for lower case, 10 more for numbers .... symbols.. and that assumes you can find a way to do it. If you had unicode it would be intractable, and it is horrible for only simple ascii.

the C or basic way would be to copy the original string into a new string, one byte at a time, dropping duplicates as you go (if current != previous, copy, else skip). This is a O(N) operation, which is pretty much as good as it gets here (you could also divide it across many threads if the strings were gigantic, but that is usually not necessary). Its a very simple and short chunk of code, highly recommend doing it this way...

qt_ky · Post by **qt_ky** » Mon Nov 14, 2016 3:09 pm

No need to hard-code a scenario for every possible character...

Loop through the string from first to last position. Initialize a counter variable. Initialize a previous character variable.

If current character = previous character, increment a counter, else reset the counter.

If counter > 3 (or whatever your rule may be), then do something.

abc123 · Post by **abc123** » Mon Nov 14, 2016 3:40 pm

qt_ky, I am assuming that you are talking about a parallel transformer loop, right?

qt_ky · Post by **qt_ky** » Tue Nov 15, 2016 8:32 am

The generic pseudo code I outlined would be valid for any programming language, BASIC routine, Parallel routine, etc.

In theory, it could even be done using looping within a Parallel Transformer stage, although it would be a bit cumbersome.

UCDI · Post by **UCDI** » Tue Nov 15, 2016 9:05 am

For clarity, that is the same algorithm I said except I suggested copying into a temp for simplicity & speed in the low level languages. Details aside, I think this is the best algorithm for general strings (it can be improved for specific strings, of course).

qt_ky · Post by **qt_ky** » Tue Nov 15, 2016 9:56 am

Note there was one detail in the original post to remove repetitive characters that occur more than thrice...

UCDI · Post by **UCDI** » Tue Nov 15, 2016 2:46 pm

That is true! The core algorithm is unchanged, but you do need to handle this detail. That looks to be extra convoluted in datastage transformer logic.

chucksmith · Post by **chucksmith** » Thu Nov 17, 2016 9:38 am

Back to the original question, please note that the convert() function deals with single byte comparison/conversion. The change() function deals with substrings.

IBM Analytics Champion 2009 - 2020 · Post by **asorrell** » Thu Nov 17, 2016 9:46 am

On a related note, because this is a per-character comparison of a string, its going to really slow down your job a lot, regardless of methodology. Be aware of the impact if this is a time critical job that processes a lot of records.

UCDI · Post by **UCDI** » Thu Nov 17, 2016 11:03 am

It shouldn't. I recently standardized a text file in a similar way (removal of extra characters). The file was about 60MB of text. Execution time was less than 3 seconds and that includes reading the file and writing the fixed file output on top of the processing time. That was a one-shot hack code, so I did not even multi-thread it. If I had multi-threaded it, it would have been ~ 1 second on a typical 4 cpu machine.

Ive done a couple of string standardization routines for datastage and they are invariable the fastest stages in the job.

Here is a quick, rough cut at it, since the topic has stayed alive for so long.
-----------

char outbuff[10000]; //yes, its an evil global variable.

char *strmax3(char * buff)
{
static char which = 0; //for parallel execution: a micro "memory manager"
//tweak for your system, this is fine for 4-8 cpu/threads.
//its faster than allocating new memory for each input.
which = (which+1)%10;
char * out = &(outbuff[which*100]); //100 is max length of input string (including 0 end), tweak if needed.

//null string or string too short to check do nothing, return the input and exit.
if(!buff) return buff;
int len = strlen(buff);
if(len < 4) return buff;
unsigned int dx, lc;
dx = lc = 0;

//seeds the algorithm to simplify code.
out[dx++] = buff[0];
out[dx++] = buff[1];
out[dx++] = buff[2];
for(lc = 3; lc < len; lc++)
{
if(buff[lc] == buff[lc-1] && buff[lc] == buff[lc-2] && buff[lc] == buff[lc-3]); // ;here = do nothing, so if true do nothing.
else //we do not have 4 identical chars in a row, so we copy into the target.
out[dx++] = buff[lc];
}
out[dx] = 0; //standard c end of string MUST be added to hand-cooked strings.
return out;
}