Transferring Language Sensitie Data from SQL Srver to Oracle

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
loveojha2
Participant
Posts: 362
Joined: Thu May 26, 2005 12:59 am

Transferring Language Sensitie Data from SQL Srver to Oracle

Post by loveojha2 »

Hi All,

Good to see this forum, after long long time, active and happening.

Here is the description of what I am trying to accomplish.

Transferring data from Sql server stored in nvarchar columns (language sensitive, Japanese and english+may be some other language) to Oracle (UTF8 character set enabled database) in nvarchar column.

The SQL Server database is on Latin1-General (from sp_helpsort).
The Oracle database is on UTF8 (NLS_NCHAR_CHARACTERSET is UTF8 and NLS_LANGUAGE is American).

I am using DRS stage for the source as well as target connections.

The NLS setting specified in the Source DRS (sql server) is set to the default (which is MS1252).
The NLS setting specified in the Target DRS (Oracle) is set to the UTF8.

The problem is that English data is transferring correctly but the non english data is appearing as (?) mark upside down.

Hope this describes the problem correctly.

Thanks in advance.
Success consists of getting up just one more time than you fall.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Why would you even expect "non-English data" (more technically, I guess, characters outside the specified map set) to be handled properly by a particular character map? (Almost?) all maps incorporate the "English" characters, so choose the one that accurately represents how those other characters are natively encoded.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
loveojha2
Participant
Posts: 362
Joined: Thu May 26, 2005 12:59 am

Post by loveojha2 »

Hey Ray,

Thanks for the quick answer. Although I could not read your post completely but got the point that the NLS setting for the SQL Server (or the character map) is inappropriate. But can you plz let me know what character map should I choose for the Sql server character set Latin1_General, because I don't see such a char set in the list.

Thanks again.
Success consists of getting up just one more time than you fall.
jdmiceli
Premium Member
Premium Member
Posts: 309
Joined: Wed Feb 22, 2006 10:03 am
Location: Urbandale, IA

Post by jdmiceli »

I work with SQL Server 2000 so I can only give input from that vantage point. I am curious as to how the data can be stored in a Latin1_General collation and store Japanese characters. Though Latin1_General can handle unicode data, I am not sure it can represent Japanese characters correctly. I suspect that the default collation for your SQL Server instance is Latin1_General, but could you check to make sure the specific columns where your data resides it set to the default please?

You can check this out by having Enterprise Manager (or other available tool) generate the SQL script for the table. Examine that and if there are collations other than the instance default, they will be evident with the field(s) you are concerned with.

If you are dealing with Japanese characters mixed with English, then I would expect that your fields should actually be one of the following:
1. Japanese_BIN, sort order 192
2. Japanese_CI_AS, sort order 193
3. Japanese_CS_AS, sort order 200

Otherwise, I don't see how you could be under the configuration you described. Once again, I am coming at this from the viewpoint of SQL Server 2000. If you are on SQL Server 2005, then what I have stated here may be moot (I have no experience with that version yet).

I hope that helps in some way... :?
Bestest!

John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services


"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Your problem stems from the fact that you declare LATIN-1 in your source but have characters that are not in that character set. You will need to discover what mapping is used in the source before you can do anything at all.
loveojha2
Participant
Posts: 362
Joined: Thu May 26, 2005 12:59 am

Post by loveojha2 »

Thanks jdmiceli and ArndW, thanks for your replies for helping me out.

Here is what I have done.

create table langdata (data1 nvarchar(2000))

sp_help langdata gives:

Code: Select all

data1	nvarchar	no	4000	     	     	yes	(n/a)	(n/a)	SQL_Latin1_General_CP1_CI_AS
Then I did this:

Code: Select all

insert into langdata values (N'この種類のドキュメントでは、添付の保存はサポートされていません')

insert into langdata values (N'loveojha')
And doing
select * from langdata
gives

Code: Select all

この種類のドキュメントでは、添付の保存はサポートされていません
loveojha
Meaning it shows the data correctly.

The Japanese data is copied from a notepad file.


Thanks
Success consists of getting up just one more time than you fall.
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

Ok, your Japanese text deals with text attachments that aren't supported...

It works in your SQL Server because you told it that you have LATIN-1 data. When you inserted your cut-and-pasted data it did no conversion, assuming that your screen text is also LATIN-1. The same happens when you read from the database, it does no conversion and your display manager is smart enough to convert the text into Unicode.

I will assume that you are using UTF8 in notepad - but that doesn't have to be the case. You could be using some windows Shift-JIS set.

Your data gets mangled when DataStage tries to convert what it thinks is 1-byte per character LATIN-1 encoding into UTF8.

The best solution is to use the correct UTF8 encoding in your source database as both the nvarchar as well as the display type.

The second best is to "lie" and state UTF8 when reading from SQL Server
jdmiceli
Premium Member
Premium Member
Posts: 309
Joined: Wed Feb 22, 2006 10:03 am
Location: Urbandale, IA

Post by jdmiceli »

Yeah, what he said... :P
Bestest!

John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services


"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
ArndW
Participant
Posts: 16318
Joined: Tue Nov 16, 2004 9:08 am
Location: Germany
Contact:

Post by ArndW »

jd - now I'm not sure if what I posted made sense...
jdmiceli
Premium Member
Premium Member
Posts: 309
Joined: Wed Feb 22, 2006 10:03 am
Location: Urbandale, IA

Post by jdmiceli »

Well, try this and see if it makes any sense.

Disclaimer: I am getting this information from documentation and NOT practical experience. Therefore, it should be validated (if it makes any sense at all in the first place! :roll: )

SQL Server 2000 takes into consideration Windows locale settings and it's associated collation settings when it is first installed. So, if the Windows locale settings do not have the proper character code pages in place at the time SQL Server is set up, then additional steps need to be taken when they are added later.

SQL Server also has its own internal collations that determine how it is to sort character data stored in either single byte or double byte character sets. We know from Loveojha2's last post that the default collation that SQL Server is applying to nvarchar data (Unicode representation) is Latin1_General_CP1_CI_AS (Windows equivalent would be Latin1_General_CI_AS). Break all that apart and this is what it means:
  • 1. Latin1_General = code page 1252 (SQL Server level code page?) --> Latin1 (ANSI), used by Western European languages

    2. CP1 = Code Page 1 (got this from a post online, but I don't know if it refers to a default value for Windows locale or what)

    3. CI = case insensitive

    4. AS = accent sensitive

    Interesting to note this SQL collation also denotes kanatype insensitive, width insensitive for Unicode data, and SQL Server sort order ID 52 on code page 1252 for non-Unicode data
You can make sure your SQL Server system supports things correctly by typing "select * from ::fn_helpcollations()" to get a list of the Windows collations that are covered by your install.

I think you then need to collect the same type of info from your target system (I don't work with Oracle so I have no idea how). I would also think you need to do the same with the system running DataStage to make sure that it's Windows locale collation settings allow for the the representation of data your source is sending to it.

I don't know if any of this has made any sense at all, but I guess what I would try doing is to make sure that all steps treat the data as nvarchar in DataStage on to Oracle. From your source job, view the data and make sure that the source data is represented correctly when looking at it from the RDBMS Stage. Maybe, create a job that extracts some of the data from source and puts it into a hashed file so you can see what DataStage is doing with the data. If that appears fine, then push the data from the hashed file to a test table with the appropriate structure on Oracle.

I'm just not proficient enough with Internationalization to offer anything else. I hope I didn't say anything too stupid :oops:

Toodles!
Bestest!

John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services


"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
jdmiceli
Premium Member
Premium Member
Posts: 309
Joined: Wed Feb 22, 2006 10:03 am
Location: Urbandale, IA

Post by jdmiceli »

Well, try this and see if it makes any sense.

Disclaimer: I am getting this information from documentation and NOT practical experience. Therefore, it should be validated (if it makes any sense at all in the first place! :roll: )

SQL Server 2000 takes into consideration Windows locale settings and it's associated collation settings when it is first installed. So, if the Windows locale settings do not have the proper character code pages in place at the time SQL Server is set up, then additional steps need to be taken when they are added later.

SQL Server also has its own internal collations that determine how it is to sort character data stored in either single byte or double byte character sets. We know from Loveojha2's last post that the default collation that SQL Server is applying to nvarchar data (Unicode representation) is Latin1_General_CP1_CI_AS (Windows equivalent would be Latin1_General_CI_AS). Break all that apart and this is what it means:
  • 1. Latin1_General = code page 1252 (SQL Server level code page?) --> Latin1 (ANSI), used by Western European languages

    2. CP1 = Code Page 1 (got this from a post online, but I don't know if it refers to a default value for Windows locale or what)

    3. CI = case insensitive

    4. AS = accent sensitive

    Interesting to note this SQL collation also denotes kanatype insensitive, width insensitive for Unicode data, and SQL Server sort order ID 52 on code page 1252 for non-Unicode data
You can make sure your SQL Server system supports things correctly by typing "select * from ::fn_helpcollations()" to get a list of the Windows collations that are covered by your install.

I think you then need to collect the same type of info from your target system (I don't work with Oracle so I have no idea how). I would also think you need to do the same with the system running DataStage to make sure that it's Windows locale collation settings allow for the the representation of data your source is sending to it.

I don't know if any of this has made any sense at all, but I guess what I would try doing is to make sure that all steps treat the data as nvarchar in DataStage on to Oracle. From your source job, view the data and make sure that the source data is represented correctly when looking at it from the RDBMS Stage. Maybe, create a job that extracts some of the data from source and puts it into a hashed file so you can see what DataStage is doing with the data. If that appears fine, then push the data from the hashed file to a test table with the appropriate structure on Oracle.

I'm just not proficient enough with Internationalization to offer anything else. I hope I didn't say anything too stupid :oops:

Toodles!
Bestest!

John Miceli
System Specialist, MCP, MCDBA
Berkley Technology Services


"Good Morning. This is God. I will be handling all your problems today. I will not need your help. So have a great day!"
Post Reply