Remove multiple Ascii characters from input data
Moderators: chulett, rschirm, roy
Remove multiple Ascii characters from input data
Hi,
I have a requirement to remove multiple ascii characters from the input data.
Eg- input data has below values
column name - Description
==================
Paratransit¿Web-Based¿Trip¿Boo
TRE DFW ROW Maintenance Tie Re
Input data length - char (30)
Output data length - Varchar(30)
I am getting an error 'A data conversion error was encountered in bulk load mode for row '
1)Transformation rule - Convert(char(160):char(168):char(194),'',TrimB(TrimF(NullToEmpty(In_ACACTIVITY.DESCRIPTION)))- error as above
2) CONVERT(CHAR(194), "", CONVERT(CHAR(191), "", Convert(char(160),'',TrimB(TrimF(NullToEmpty(In_ACACTIVITY.DESCRIPTION))))))-- error as above
3) When I changed the datalength to 40 everything got loaded.But not allowed to do so.
4)I changed the bulk load to write mode as 'Insert' and table action to
'Create' in target oracle stage and getting error of 'Error code: 12899, Error message: ORA-12899: value too large for column "MSPO_TEST"."DESC_PROJECTS2"."DESCRIPTION" (actual: 34, maximum: 30).
How should I remove the multiple ascii characters from the input data? I am able to remove one ascii character char(160) using the Ereplace function (which I posted in the forum).
Please let me know what is the mistake I am doing. Thanks.
I have a requirement to remove multiple ascii characters from the input data.
Eg- input data has below values
column name - Description
==================
Paratransit¿Web-Based¿Trip¿Boo
TRE DFW ROW Maintenance Tie Re
Input data length - char (30)
Output data length - Varchar(30)
I am getting an error 'A data conversion error was encountered in bulk load mode for row '
1)Transformation rule - Convert(char(160):char(168):char(194),'',TrimB(TrimF(NullToEmpty(In_ACACTIVITY.DESCRIPTION)))- error as above
2) CONVERT(CHAR(194), "", CONVERT(CHAR(191), "", Convert(char(160),'',TrimB(TrimF(NullToEmpty(In_ACACTIVITY.DESCRIPTION))))))-- error as above
3) When I changed the datalength to 40 everything got loaded.But not allowed to do so.
4)I changed the bulk load to write mode as 'Insert' and table action to
'Create' in target oracle stage and getting error of 'Error code: 12899, Error message: ORA-12899: value too large for column "MSPO_TEST"."DESC_PROJECTS2"."DESCRIPTION" (actual: 34, maximum: 30).
How should I remove the multiple ascii characters from the input data? I am able to remove one ascii character char(160) using the Ereplace function (which I posted in the forum).
Please let me know what is the mistake I am doing. Thanks.
Suja
Hi Craig,
Source application team can't provide quality data. They told to remove the ascii characters using datastage and load into target.
In normal source oracle database, we can't see ascii characters. So using the export wizard, I uploaded into csv file and through eyecheck I found char values of 160,191,194 in the input.
I don't know how to write a logic to remove / translate them apart from what I tried above.
Source application team can't provide quality data. They told to remove the ascii characters using datastage and load into target.
In normal source oracle database, we can't see ascii characters. So using the export wizard, I uploaded into csv file and through eyecheck I found char values of 160,191,194 in the input.
I don't know how to write a logic to remove / translate them apart from what I tried above.
Suja
I found this to be extra painful in datastage and just did it in C.
make a lookup table like this
static const unsigned char pt[256] = {
' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ','!','"','#','$','%','&','\'','(',
')','*','+',',','-','.','/','0','1','2','3','4','5','6','7','8','9',':',';','<',
'=','>','?','@','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P',
'Q','R','S','T','U','V','W','X','Y','Z','[','\\',']','^','_','`','a','b','c','d',
'e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x',
'y','z','{','|','}','~',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' '
};
and then just loop over the string
newstring = pt[oldstring];
and then do an internal and external trim on space.
don't forget to fix the end of string : newstring[max] = 0;
make a lookup table like this
static const unsigned char pt[256] = {
' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ','!','"','#','$','%','&','\'','(',
')','*','+',',','-','.','/','0','1','2','3','4','5','6','7','8','9',':',';','<',
'=','>','?','@','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P',
'Q','R','S','T','U','V','W','X','Y','Z','[','\\',']','^','_','`','a','b','c','d',
'e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x',
'y','z','{','|','}','~',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',
' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' '
};
and then just loop over the string
newstring = pt[oldstring];
and then do an internal and external trim on space.
don't forget to fix the end of string : newstring[max] = 0;
ereplace is horrible for this. you have to string like 100 of them together, its slow and the syntax is absolutely insane. It looks like
ereplace(ereplace(ereplace(ereplace .... which loops over the same string over 100 times replacing 1 bad character at a time in each loop.
my solution is to use c or basic or whatever to do it all at once in 1 loop.
If you are 100% sure you only have 3 bad characters and will only ever in the future have those 3 characters, e-replace should work when chained like this.
But you said you only spot checked and your source is unreliable, so I advise getting ALL the nonstandard characters out at once.
My lookup table replace approach kills everything that is not a normal character, not just the 3 you have found so far.
it is simple code. You make a lookup table of the entire ascii table (0-255) and replace the string data value with the lookup table value which is space for all junk characters and no change for the usable values. Then you can clean up extra spaces in datastage or the C whichever you prefer.
I can give you the basic code instead if you prefer, which you can use directly in a basic_transformer instead of a regular transformer. Its a little slower, but talking seconds per millions of records, hardly worth a mention in the grand scheme.
ereplace(ereplace(ereplace(ereplace .... which loops over the same string over 100 times replacing 1 bad character at a time in each loop.
my solution is to use c or basic or whatever to do it all at once in 1 loop.
If you are 100% sure you only have 3 bad characters and will only ever in the future have those 3 characters, e-replace should work when chained like this.
But you said you only spot checked and your source is unreliable, so I advise getting ALL the nonstandard characters out at once.
My lookup table replace approach kills everything that is not a normal character, not just the 3 you have found so far.
it is simple code. You make a lookup table of the entire ascii table (0-255) and replace the string data value with the lookup table value which is space for all junk characters and no change for the usable values. Then you can clean up extra spaces in datastage or the C whichever you prefer.
I can give you the basic code instead if you prefer, which you can use directly in a basic_transformer instead of a regular transformer. Its a little slower, but talking seconds per millions of records, hardly worth a mention in the grand scheme.
The problem is resolved
I checked Ray's post (Line Breaks in the flat files) and find the clue. It worked.
Steps
1)Created 3 stage variables
svChar160 - Char(160,@TRUE)
svChar191 - Char(191,@TRUE)
svChar194 - Char(194,@TRUE)
2) Transformation rule in transformer
Convert(svChar160:svChar191:svChar194,' ',TrimB(TrimF(NullToEmpty(In_ACACTIVITY.DESCRIPTION))))
Source data - Paratransit¿Web-Based¿Trip¿Boo
Target data - ParatransitWeb-BasedTripBoo
Thank you, Ray for your valuable post.
I checked Ray's post (Line Breaks in the flat files) and find the clue. It worked.
Steps
1)Created 3 stage variables
svChar160 - Char(160,@TRUE)
svChar191 - Char(191,@TRUE)
svChar194 - Char(194,@TRUE)
2) Transformation rule in transformer
Convert(svChar160:svChar191:svChar194,' ',TrimB(TrimF(NullToEmpty(In_ACACTIVITY.DESCRIPTION))))
Source data - Paratransit¿Web-Based¿Trip¿Boo
Target data - ParatransitWeb-BasedTripBoo
Thank you, Ray for your valuable post.
Suja
Remove multiple Ascii (>256) characters from input data
Char and convert are not supporting for the characters whose ASCII is greater than 256. What to do in this case?
Thanks
Thanks
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Re: Remove multiple Ascii (>256) characters from input da
not sure about datastage but 'most' software if told a 'text' file is in 1-byte code (ascii or one-off encodings) it will read unicode letters as a group of multiple ascii symbols.BSridhar wrote:Char and convert are not supporting for the characters whose ASCII is greater than 256. What to do in this case?
For english and generally western languages, most of the unicode becomes some sort of space / blank symbol and the ascii letter, so hello looks like h e l l o and so on. However any truly non-english or non-ascii codes will make a mess and any code that attempts to clean it up will damage the actual data.
so while it is 'sort of' safe to hack around on 'american english text' that happens to be stored in unicode as if it were ascii, you are playing with fire to do this. If you are using unicode, use unicode for everything. If you are using ascii, you can promote to unicode if needed or stay in ascii. Avoid mixing them as much as possible. And there are other encodings from older systems that you have to use sometimes too, mostly 1 byte but NOT ascii.
Ill do a little of the hacking but not in datastage -- I have some code that can read dataset headers which are apparently in unicode for the internal textual parts...