Paralle routine to count num bytes

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
Alma1
Participant
Posts: 5
Joined: Thu Mar 09, 2017 5:27 am

Paralle routine to count num bytes

Post by Alma1 »

Hi, I need to count number of bytes of all fields in input from a positional TXT file without separator.
I need to use a parallel job but in parallel does not exists a function that count number of bytes (Len function count number of characters).
I've done a parallel routine in C that tested with an external gcc compiler rightly count number of byes with strlen() function.

int ParLenByte(char *s)
{
return strlen(s);
}

In datastage I read TXT file with Sequential File Stage (or Complex Flat File) type CHAR to segmentate records in fields of known length.
Then I pass records to a trasformer when I call the parallel routine.
The result is the length of the CHAR type defined in datastage and not the real bytes' number.

If I read the entire record VARCHAR (without segmentation of fields) it works so I presume datatage passes truncate string to the parallel routine when I read it CHAR type.

Any suggestion?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Welcome.

I assume you want the length of the data in the CHAR field, something automatically padded with (typically) spaces out to its full size... a.ka. the nature of the beast. Meaning a CHAR(10) that looks like this:

Code: Select all

"6CHARS    "
You want to know that the actual data length is 6 rather than 10, is that correct? So that you can do what next? I'm thinking knowing what comes next / what that knowledge would be used for can lead to a best practice solution which probably does not include the need for a custom "parallel routine to count bytes".
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Do you need to be able to work with multi-byte data?

In the BASIC Transformer stage you have access to three length functions.
LEN returns the number of characters
BYTELEN returns the number of bytes
DISPLEN returns the number of display positions (e.g. when using double-width or half-width characters)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

strlen counts whitespace, it counts everything until 0 (end of C string) is hit in ascii, and similar with the wide version for multi-byte.

char in datastage is padded with spaces to always consume the max length.
varchar in datastage is what the data is, up to the max length (truncates).

I would get the data as varchar, and I think a regular transformer can get the length of the string there? Is that possible with your design?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Would still like to know the "why" of this.
-craig

"You can never have too many knives" -- Logan Nine Fingers
boxtoby
Premium Member
Premium Member
Posts: 138
Joined: Mon Mar 13, 2006 5:11 pm
Location: UK

Post by boxtoby »

As it's a unix flat file would the unix command "wc" not suffice?
Bob Oxtoby
Alma1
Participant
Posts: 5
Joined: Thu Mar 09, 2017 5:27 am

Post by Alma1 »

chulett wrote:I assume you want the length of the data in the CHAR field, something automatically padded with (typically) spaces out to its full size... a.ka. the nature of the beast. Meaning a CHAR(10) that looks like this:

Code: Select all

"6CHARS    "
You want to know that the actual data length is 6 rather than 10, is that correct? So that you can do what next? I'm thinking knowing what comes next / what that knowledge would be used for can lead to a best practice solution which probably does not include the need for a custom "parallel routine to count bytes".
No I want to count spaces also, so I want 10 as result.
ray.wurlod wrote:Do you need to be able to work with multi-byte data?

In the BASIC Transformer stage you have access to three length functions.
LEN returns the number of characters
BYTELEN returns the number of bytes
DISPLEN returns the number of display positions (e.g. when using double-width or half-width characters)
I don't want to use BASIC Transformer but parallel transformer
UCDI wrote:strlen counts whitespace, it counts everything until 0 (end of C string) is hit in ascii, and similar with the wide version for multi-byte.

char in datastage is padded with spaces to always consume the max length.
varchar in datastage is what the data is, up to the max length (truncates).

I would get the data as varchar, and I think a regular transformer can get the length of the string there? Is that possible with your design?
I try to read entire record varchar but when i truncate it with subristring and I apply routine function I have the same wrong result.
chulett wrote:Would still like to know the "why" of this.
Reason is that I read file from foreign banks such as arabian so a character that seems to has length of char=1 can take 2 bytes.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Including your reason in your original post would have gone a long way towards shortening this conversation. Always best to lead with that rather than your perceived solution. IMHO.
-craig

"You can never have too many knives" -- Logan Nine Fingers
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

if you want the raw # of bytes for unicode or multi-byte chars, you are going to have to extract it in a way that you can look at bytes, maybe sql extract it as hex, an then count those.

Datastage can handle multi byte characters to get the data to you but I don't know that a string length will give you what you want because the length of 5 2 byte letters is 5, not 10... you have to force it to be bytes and count those. And you can't just do it casually with C or the like, because a 2 byte char has stuff like 00 A3 or whatever and that 00 converted to a byte looks like the end of string... in fact, are you sure that your strlen approach actually is the correct answer...?
Alma1
Participant
Posts: 5
Joined: Thu Mar 09, 2017 5:27 am

Post by Alma1 »

Yes I'm sure.
I've done a lot of tests with an external gcc compiler using arabian, Cyrillic strings.
and result is right
JRodriguez
Premium Member
Premium Member
Posts: 425
Joined: Sat Nov 19, 2005 9:26 am
Location: New York City
Contact:

Post by JRodriguez »

...While reading the file with the sequential stage or complex flat file you would need to use NChar data type or the unicode extended atribute, and would need to define the NLS for the file to handle the multi bytes characters
Julio Rodriguez
ETL Developer by choice

"Sure we have lots of reasons for being rude - But no excuses
Alma1
Participant
Posts: 5
Joined: Thu Mar 09, 2017 5:27 am

Post by Alma1 »

How do i find the correct NLS map file type?
Post Reply