Page 1 of 1

Chinese Char /UTF-8

Posted: Thu Oct 20, 2016 7:59 am
by dj
Hi,

I'm trying to read Chinese file and it fails when CHAR and for VARCHAR it works.

The existing other regions have CHAR and we are trying to minimize the changes in the layout.

The layout is Char Fixed width file Unicode.

FIRSTNAME:Char(30)-Unicode
MiddleName:Char(1)-Unicode
LastName:Char(30)-Unicode

Existing data:
COLNIE MPROLL
chinese data:
李娜 MPROLL

when viewed in Hex editor -chinese char took around 3bytes.
Hence i tried firstname:6bytes(data)+24padchars but no luck.

External ustring too short. Imported only 0 external characters into a ustring of fixed length 1.
##W IIS-DSEE-TFIG-00201 09:53:53(001) <SQ,0> Field "MiddleName" has import error and no default value; data: <empty>, at offset: 787


Is it only varchar is supported for multi-byte?

Thanks in advance!

Posted: Fri Oct 21, 2016 8:48 am
by PaulVL
Any reason you are not using UTF16?

Chinese is a double byte characterset.

Posted: Wed Oct 26, 2016 3:35 am
by dj
I'm looking at the various options of handling the chinese/Thai data.

To get started,selected UTF-8 as it handles multi-byte data.

1)Is it still UTF-8,unicode will not be handle double byte data?
2) And it has to be always variable length?

Thanks

Posted: Thu Oct 27, 2016 9:09 am
by ray.wurlod
UTF-8 is an encoding of the Unicode code points, and does handle multi-byte data (though using up to four bytes per character).

VarChar will give you fewer problems than Char, because the latter requires fields to be padded to length.

Posted: Thu Oct 27, 2016 6:06 pm
by abyss
not sure about thai but always use UTF-16 for chinese, japanese and koran characters

Posted: Mon Oct 31, 2016 8:21 am
by pjedson
Handling NLS data is bit tricky.
Troubleshooting depends on the database involved and OS.

If ODBC stages are used, check following.
Check IANAAppCodePage value in odbc.ini
Use wide character types wherever possible

Hope this helps.

Posted: Mon Nov 07, 2016 7:50 am
by dj
Thanks for your replies.

Are there any other issues apart from bytes space b/w UTF-8 /UTF-16?

1) We were able to use UTF-8 for thai and china - both seq file as i/p and o/p.

2) Is there a way to check in ds as right now i dont have temp db to check for bytes space usage b/w utf-8/utf-16.

3) Is it possible to read mainframe i/p UTF-16 ,process it in datastage and load into MDM tables(utf-8)?

Thanks.