Handling non English characters while using substring

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

UTF8 and bounded varchars come with the sort of baggage that you're seeing. That's due to UTF8 having a variable number of bytes per character. You should switch to nvarchar (or the equivalent of adding Unicode extended property to your varchar).

Also if you look closely at the output that you think is fine, you will see you only have 29 characters. With nvarchar you would have 30 "complete" characters.

Mike
lindatrgeo
Participant
Posts: 3
Joined: Mon Jun 15, 2009 5:01 am

Post by lindatrgeo »

Thanks Mike for the response.

In fact, we were using NVARCHAR initially. In that case, DS reads the special character as single byte. When applying SUBSTR [1,30], it extracted 31 bytes of data. But the target Oracle table has size restriction of 30 bytes. So the record was getting rejected by DB for exceeding the length.

While using VARCHAR, the byte size goes in sync with DB.
Mike
Premium Member
Premium Member
Posts: 1021
Joined: Sun Mar 03, 2002 6:01 pm
Location: Tampa, FL

Post by Mike »

As you see, bytes and characters have a variable ratio when you are dealing with UTF8. Characters will range in size from 1 to 4 bytes. Your 30 byte column in Oracle could potentially hold as few as 7 and as many as 30 complete characters. If you don't stick to character semantics, you run the risk of truncating a string in the middle of a multi-byte character.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I've never understood why so many people who deal with multi-byte characters stick to BYTE semantics in Oracle. Why not alter that column in the table to be VARCHAR2(30 CHAR) instead?
-craig

"You can never have too many knives" -- Logan Nine Fingers
lindatrgeo
Participant
Posts: 3
Joined: Mon Jun 15, 2009 5:01 am

Post by lindatrgeo »

Thanks Mike and Craig !

@Mike, That's exactly what happened in the scenario I explained. The string got truncated in the middle of multi byte character. As a workaround, I added a check to see if the 30th character is alphanumeric. If not I extract only the 29 characters to avoid and loading issue.

@Craig, My target system is a legacy one, and they don't allow any change in their DB :(
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Just for completeness, the substring functions work with characters, not with bytes.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply