Removal of Repeated Characters
Moderators: chulett, rschirm, roy
Removal of Repeated Characters
We have a requirement to remove repetitive characters like AAA or BBB etc., from Name that occurs more than thrice. I tried using Convert function in transformer stage and gave string as AAA but it removed the character that appeared even once. Please let me know how this can be implemented in transformer stage, I tried in server routine but it can't be called from Parallel job.
You'll most likely have to write your own BASIC routine. Use a BASIC Transformer stage in a parallel job to call it. It's not shown in the Palette. Instead, go to Repository, Stage Types, Parallel, Processing, BASIC Transformer.
Choose a job you love, and you will never have to work a day in your life. - Confucius
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
You could write the routine in C++ and (having compiled and linked it and created a reference to it in DataStage) call it from a parallel Transformer stage.
Or you could use a BASIC Transformer stage in a parallel job.
Or you could use a BASIC Transformer stage in a parallel job.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
26 for caps, 26 again for lower case, 10 more for numbers .... symbols.. and that assumes you can find a way to do it. If you had unicode it would be intractable, and it is horrible for only simple ascii.
the C or basic way would be to copy the original string into a new string, one byte at a time, dropping duplicates as you go (if current != previous, copy, else skip). This is a O(N) operation, which is pretty much as good as it gets here (you could also divide it across many threads if the strings were gigantic, but that is usually not necessary). Its a very simple and short chunk of code, highly recommend doing it this way...
the C or basic way would be to copy the original string into a new string, one byte at a time, dropping duplicates as you go (if current != previous, copy, else skip). This is a O(N) operation, which is pretty much as good as it gets here (you could also divide it across many threads if the strings were gigantic, but that is usually not necessary). Its a very simple and short chunk of code, highly recommend doing it this way...
No need to hard-code a scenario for every possible character...
Loop through the string from first to last position. Initialize a counter variable. Initialize a previous character variable.
If current character = previous character, increment a counter, else reset the counter.
If counter > 3 (or whatever your rule may be), then do something.
Loop through the string from first to last position. Initialize a counter variable. Initialize a previous character variable.
If current character = previous character, increment a counter, else reset the counter.
If counter > 3 (or whatever your rule may be), then do something.
Choose a job you love, and you will never have to work a day in your life. - Confucius
The generic pseudo code I outlined would be valid for any programming language, BASIC routine, Parallel routine, etc.
In theory, it could even be done using looping within a Parallel Transformer stage, although it would be a bit cumbersome.
In theory, it could even be done using looping within a Parallel Transformer stage, although it would be a bit cumbersome.
Choose a job you love, and you will never have to work a day in your life. - Confucius
-
- Premium Member
- Posts: 385
- Joined: Wed Jun 16, 2004 12:43 pm
- Location: Virginia, USA
- Contact:
Back to the original question, please note that the convert() function deals with single byte comparison/conversion. The change() function deals with substrings.
Chuck Smith
www.anotheritco.com
www.anotheritco.com
It shouldn't. I recently standardized a text file in a similar way (removal of extra characters). The file was about 60MB of text. Execution time was less than 3 seconds and that includes reading the file and writing the fixed file output on top of the processing time. That was a one-shot hack code, so I did not even multi-thread it. If I had multi-threaded it, it would have been ~ 1 second on a typical 4 cpu machine.
Ive done a couple of string standardization routines for datastage and they are invariable the fastest stages in the job.
Here is a quick, rough cut at it, since the topic has stayed alive for so long.
-----------
char outbuff[10000]; //yes, its an evil global variable.
char *strmax3(char * buff)
{
static char which = 0; //for parallel execution: a micro "memory manager"
//tweak for your system, this is fine for 4-8 cpu/threads.
//its faster than allocating new memory for each input.
which = (which+1)%10;
char * out = &(outbuff[which*100]); //100 is max length of input string (including 0 end), tweak if needed.
//null string or string too short to check do nothing, return the input and exit.
if(!buff) return buff;
int len = strlen(buff);
if(len < 4) return buff;
unsigned int dx, lc;
dx = lc = 0;
//seeds the algorithm to simplify code.
out[dx++] = buff[0];
out[dx++] = buff[1];
out[dx++] = buff[2];
for(lc = 3; lc < len; lc++)
{
if(buff[lc] == buff[lc-1] && buff[lc] == buff[lc-2] && buff[lc] == buff[lc-3]); // ;here = do nothing, so if true do nothing.
else //we do not have 4 identical chars in a row, so we copy into the target.
out[dx++] = buff[lc];
}
out[dx] = 0; //standard c end of string MUST be added to hand-cooked strings.
return out;
}
Ive done a couple of string standardization routines for datastage and they are invariable the fastest stages in the job.
Here is a quick, rough cut at it, since the topic has stayed alive for so long.
-----------
char outbuff[10000]; //yes, its an evil global variable.
char *strmax3(char * buff)
{
static char which = 0; //for parallel execution: a micro "memory manager"
//tweak for your system, this is fine for 4-8 cpu/threads.
//its faster than allocating new memory for each input.
which = (which+1)%10;
char * out = &(outbuff[which*100]); //100 is max length of input string (including 0 end), tweak if needed.
//null string or string too short to check do nothing, return the input and exit.
if(!buff) return buff;
int len = strlen(buff);
if(len < 4) return buff;
unsigned int dx, lc;
dx = lc = 0;
//seeds the algorithm to simplify code.
out[dx++] = buff[0];
out[dx++] = buff[1];
out[dx++] = buff[2];
for(lc = 3; lc < len; lc++)
{
if(buff[lc] == buff[lc-1] && buff[lc] == buff[lc-2] && buff[lc] == buff[lc-3]); // ;here = do nothing, so if true do nothing.
else //we do not have 4 identical chars in a row, so we copy into the target.
out[dx++] = buff[lc];
}
out[dx] = 0; //standard c end of string MUST be added to hand-cooked strings.
return out;
}