Remove multiple Ascii characters from input data

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Vrisha
Premium Member
Premium Member
Posts: 60
Joined: Sat Jul 15, 2017 9:32 pm
Location: Texas,USA

Remove multiple Ascii characters from input data

Post by Vrisha »

Hi,

I have a requirement to remove multiple ascii characters from the input data.
Eg- input data has below values

column name - Description
==================
Paratransit¿Web-Based¿Trip¿Boo
TRE DFW ROW Maintenance Tie Re

Input data length - char (30)
Output data length - Varchar(30)

I am getting an error 'A data conversion error was encountered in bulk load mode for row '


1)Transformation rule - Convert(char(160):char(168):char(194),'',TrimB(TrimF(NullToEmpty(In_ACACTIVITY.DESCRIPTION)))- error as above

2) CONVERT(CHAR(194), "", CONVERT(CHAR(191), "", Convert(char(160),'',TrimB(TrimF(NullToEmpty(In_ACACTIVITY.DESCRIPTION))))))-- error as above

3) When I changed the datalength to 40 everything got loaded.But not allowed to do so.

4)I changed the bulk load to write mode as 'Insert' and table action to
'Create' in target oracle stage and getting error of 'Error code: 12899, Error message: ORA-12899: value too large for column "MSPO_TEST"."DESC_PROJECTS2"."DESCRIPTION" (actual: 34, maximum: 30).


How should I remove the multiple ascii characters from the input data? I am able to remove one ascii character char(160) using the Ereplace function (which I posted in the forum).

Please let me know what is the mistake I am doing. Thanks.
Suja
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Perhaps you should look into why they are like that and see what needs to be done to correctly 'translate' them rather than simply removing them... just a thought.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Vrisha
Premium Member
Premium Member
Posts: 60
Joined: Sat Jul 15, 2017 9:32 pm
Location: Texas,USA

Post by Vrisha »

Hi Craig,

Source application team can't provide quality data. They told to remove the ascii characters using datastage and load into target.

In normal source oracle database, we can't see ascii characters. So using the export wizard, I uploaded into csv file and through eyecheck I found char values of 160,191,194 in the input.

I don't know how to write a logic to remove / translate them apart from what I tried above.
Suja
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

I found this to be extra painful in datastage and just did it in C.

make a lookup table like this

static const unsigned char pt[256] = {
' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ','!','"','#','$','%','&','\'','(',
')','*','+',',','-','.','/','0','1','2','3','4','5','6','7','8','9',':',';','<',
'=','>','?','@','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P',
'Q','R','S','T','U','V','W','X','Y','Z','[','\\',']','^','_','`','a','b','c','d',
'e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x',
'y','z','{','|','}','~',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' '
};

and then just loop over the string
newstring = pt[oldstring];


and then do an internal and external trim on space.
don't forget to fix the end of string : newstring[max] = 0;
Vrisha
Premium Member
Premium Member
Posts: 60
Joined: Sat Jul 15, 2017 9:32 pm
Location: Texas,USA

Post by Vrisha »

Thank you, UCDI.

Your solution is a bit confusing for me. I want to use ereplace or convert to remove multiple ascii characters.

I don't know what mistake I am doing in the syntax of Convert for handling multiple ascii characters
Suja
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

ereplace is horrible for this. you have to string like 100 of them together, its slow and the syntax is absolutely insane. It looks like
ereplace(ereplace(ereplace(ereplace .... which loops over the same string over 100 times replacing 1 bad character at a time in each loop.

my solution is to use c or basic or whatever to do it all at once in 1 loop.

If you are 100% sure you only have 3 bad characters and will only ever in the future have those 3 characters, e-replace should work when chained like this.

But you said you only spot checked and your source is unreliable, so I advise getting ALL the nonstandard characters out at once.

My lookup table replace approach kills everything that is not a normal character, not just the 3 you have found so far.

it is simple code. You make a lookup table of the entire ascii table (0-255) and replace the string data value with the lookup table value which is space for all junk characters and no change for the usable values. Then you can clean up extra spaces in datastage or the C whichever you prefer.

I can give you the basic code instead if you prefer, which you can use directly in a basic_transformer instead of a regular transformer. Its a little slower, but talking seconds per millions of records, hardly worth a mention in the grand scheme.
Vrisha
Premium Member
Premium Member
Posts: 60
Joined: Sat Jul 15, 2017 9:32 pm
Location: Texas,USA

Post by Vrisha »

The problem is resolved

I checked Ray's post (Line Breaks in the flat files) and find the clue. It worked.

Steps
1)Created 3 stage variables
svChar160 - Char(160,@TRUE)
svChar191 - Char(191,@TRUE)
svChar194 - Char(194,@TRUE)

2) Transformation rule in transformer
Convert(svChar160:svChar191:svChar194,' ',TrimB(TrimF(NullToEmpty(In_ACACTIVITY.DESCRIPTION))))

Source data - Paratransit¿Web-Based¿Trip¿Boo

Target data - ParatransitWeb-BasedTripBoo

Thank you, Ray for your valuable post.
Suja
Vrisha
Premium Member
Premium Member
Posts: 60
Joined: Sat Jul 15, 2017 9:32 pm
Location: Texas,USA

Post by Vrisha »

Thank you UCDI for your valuable suggestion.
Suja
BSridhar
Participant
Posts: 4
Joined: Fri May 25, 2018 9:53 am

Remove multiple Ascii (>256) characters from input data

Post by BSridhar »

Char and convert are not supporting for the characters whose ASCII is greater than 256. What to do in this case?

Thanks
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

UniChar() for Unicode data. ASCII stops at 255.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
BSridhar
Participant
Posts: 4
Joined: Fri May 25, 2018 9:53 am

Post by BSridhar »

Hi Ray,

I couldn't find UniChar in version 11.5. Could you please guide me to find? I am sorry.

Thank You
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

That suggests that you don't have NLS enabled, in which case you can't process anything other than ASCII.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Re: Remove multiple Ascii (>256) characters from input da

Post by UCDI »

BSridhar wrote:Char and convert are not supporting for the characters whose ASCII is greater than 256. What to do in this case?
not sure about datastage but 'most' software if told a 'text' file is in 1-byte code (ascii or one-off encodings) it will read unicode letters as a group of multiple ascii symbols.

For english and generally western languages, most of the unicode becomes some sort of space / blank symbol and the ascii letter, so hello looks like h e l l o and so on. However any truly non-english or non-ascii codes will make a mess and any code that attempts to clean it up will damage the actual data.

so while it is 'sort of' safe to hack around on 'american english text' that happens to be stored in unicode as if it were ascii, you are playing with fire to do this. If you are using unicode, use unicode for everything. If you are using ascii, you can promote to unicode if needed or stay in ascii. Avoid mixing them as much as possible. And there are other encodings from older systems that you have to use sometimes too, mostly 1 byte but NOT ascii.

Ill do a little of the hacking but not in datastage -- I have some code that can read dataset headers which are apparently in unicode for the internal textual parts... :)
Post Reply