UTF-8 to UTF-16 conversion in Datastage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
dscatetl
Premium Member
Premium Member
Posts: 2
Joined: Fri Dec 19, 2014 10:07 am

UTF-8 to UTF-16 conversion in Datastage

Post by dscatetl »

Hi, I have searched on DSXchange for an answer but couldn't find one that worked. I did find a thread with the same issue. The suggested resolution to set the job level to the destination format did not work for me. So I am posting here.

The source is UTF-8 from Salesforce and the destination sequential file stage is UTF-16 or is intended to be encoded to UTF-16. Our data in Salesforce, in our database and our Datastage project is all set to UTF-8. So my goal is to convert the data to UTF-16 in the output file to the vendor.

UStringToString and StringToUstring are not working for the unicode to unicode conversion. I have since successfully tried the UNIX command ICONV and it works but I would like to learn a Datastage only solution, if one exists.

So just to clear, Salesforce.com UTF-8 -> Xfm (StringToUstring UTF-16) -> Sequential File (UTF-16). This is producing garbage data. If I change all that back to UTF-8, the data comes through good. So I assume the Datastage conversion functions are not working. I mean, I know they work for non-unicode to unicode because I have seen that work in the past. But unicode to unicode is not working properly. Any ideas?

Also, I should mention the field datatypes were set to NVarChar and also I tried VarChar Unicode.

Thanks, Matt
dscatetl
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

Hi Matt,
Internally DataStage parallel jobs use UTF-16 encoding by default regardless of the encoding of sources and targets. It means that internally the data from one stage to the next flows already encoded in UTF-16 and there is not need to use functions to convert from one encoding to another in a transformer stage, that is all taking care by DataStage NLS support. For this to work characters fields must be defined as NVarchar/NChar or set to Extended Unicode and the size, very important, the size of the columns should be long enough to hold the new multibytes values.

Datastage is not able to automatically identify the NLS for sources, you need to properly and correctly define the NLS of your source... this is a tricky piece and you need the provider of the data to state the encoding, if not then you will be guessing. Also you would need to properly define the NLS of your target, and sometimes this is tricky too specially for UTF-16 due to all the flavors...ask what is the expected UTF-16..is it UTF-16LE, UTF-16BE ...do you need a BOM marker? Basically need to find out the endiannes (byte order) of the consumer of your result file...


For your task I would do below:

- Define Salesforce stage NLS mapping as UTF-8
- Define all character fields/columns as NVAR/NVARCHAR or set them to Unicode
- Define proper length of columns to be able to hold the multibytes values or leave the size empty
- Directly map source - target columns derivations without any functions in your transformer
- Define correctly the NLS of the target. UTF-16 come in a lot of flavors
- Finally, test your result file with a good editor, viewing the results in DataStage will look like garbage if not proper configured, MS excel is a good one as it will automatically sense the encoding for you

This works like a charm for an Anti Money Laundering process that I helped a customer to implement several yeras ago and takes records in any possible human language

Tried and let us know how it goes...
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
Post Reply