Page 1 of 1

Paralle routine to count num bytes

Posted: Thu Mar 09, 2017 8:18 am
by Alma1
Hi, I need to count number of bytes of all fields in input from a positional TXT file without separator.
I need to use a parallel job but in parallel does not exists a function that count number of bytes (Len function count number of characters).
I've done a parallel routine in C that tested with an external gcc compiler rightly count number of byes with strlen() function.

int ParLenByte(char *s)
{
return strlen(s);
}

In datastage I read TXT file with Sequential File Stage (or Complex Flat File) type CHAR to segmentate records in fields of known length.
Then I pass records to a trasformer when I call the parallel routine.
The result is the length of the CHAR type defined in datastage and not the real bytes' number.

If I read the entire record VARCHAR (without segmentation of fields) it works so I presume datatage passes truncate string to the parallel routine when I read it CHAR type.

Any suggestion?

Posted: Thu Mar 09, 2017 10:19 am
by chulett
Welcome.

I assume you want the length of the data in the CHAR field, something automatically padded with (typically) spaces out to its full size... a.ka. the nature of the beast. Meaning a CHAR(10) that looks like this:

Code: Select all

"6CHARS    "
You want to know that the actual data length is 6 rather than 10, is that correct? So that you can do what next? I'm thinking knowing what comes next / what that knowledge would be used for can lead to a best practice solution which probably does not include the need for a custom "parallel routine to count bytes".

Posted: Thu Mar 09, 2017 1:42 pm
by ray.wurlod
Do you need to be able to work with multi-byte data?

In the BASIC Transformer stage you have access to three length functions.
LEN returns the number of characters
BYTELEN returns the number of bytes
DISPLEN returns the number of display positions (e.g. when using double-width or half-width characters)

Posted: Thu Mar 09, 2017 3:17 pm
by UCDI
strlen counts whitespace, it counts everything until 0 (end of C string) is hit in ascii, and similar with the wide version for multi-byte.

char in datastage is padded with spaces to always consume the max length.
varchar in datastage is what the data is, up to the max length (truncates).

I would get the data as varchar, and I think a regular transformer can get the length of the string there? Is that possible with your design?

Posted: Thu Mar 09, 2017 3:54 pm
by chulett
Would still like to know the "why" of this.

Posted: Fri Mar 10, 2017 7:34 am
by boxtoby
As it's a unix flat file would the unix command "wc" not suffice?

Posted: Fri Mar 10, 2017 11:06 am
by Alma1
chulett wrote:I assume you want the length of the data in the CHAR field, something automatically padded with (typically) spaces out to its full size... a.ka. the nature of the beast. Meaning a CHAR(10) that looks like this:

Code: Select all

"6CHARS    "
You want to know that the actual data length is 6 rather than 10, is that correct? So that you can do what next? I'm thinking knowing what comes next / what that knowledge would be used for can lead to a best practice solution which probably does not include the need for a custom "parallel routine to count bytes".
No I want to count spaces also, so I want 10 as result.
ray.wurlod wrote:Do you need to be able to work with multi-byte data?

In the BASIC Transformer stage you have access to three length functions.
LEN returns the number of characters
BYTELEN returns the number of bytes
DISPLEN returns the number of display positions (e.g. when using double-width or half-width characters)
I don't want to use BASIC Transformer but parallel transformer
UCDI wrote:strlen counts whitespace, it counts everything until 0 (end of C string) is hit in ascii, and similar with the wide version for multi-byte.

char in datastage is padded with spaces to always consume the max length.
varchar in datastage is what the data is, up to the max length (truncates).

I would get the data as varchar, and I think a regular transformer can get the length of the string there? Is that possible with your design?
I try to read entire record varchar but when i truncate it with subristring and I apply routine function I have the same wrong result.
chulett wrote:Would still like to know the "why" of this.
Reason is that I read file from foreign banks such as arabian so a character that seems to has length of char=1 can take 2 bytes.

Posted: Fri Mar 10, 2017 12:05 pm
by chulett
Including your reason in your original post would have gone a long way towards shortening this conversation. Always best to lead with that rather than your perceived solution. IMHO.

Posted: Fri Mar 10, 2017 1:57 pm
by UCDI
if you want the raw # of bytes for unicode or multi-byte chars, you are going to have to extract it in a way that you can look at bytes, maybe sql extract it as hex, an then count those.

Datastage can handle multi byte characters to get the data to you but I don't know that a string length will give you what you want because the length of 5 2 byte letters is 5, not 10... you have to force it to be bytes and count those. And you can't just do it casually with C or the like, because a 2 byte char has stuff like 00 A3 or whatever and that 00 converted to a byte looks like the end of string... in fact, are you sure that your strlen approach actually is the correct answer...?

Posted: Mon Mar 13, 2017 5:42 am
by Alma1
Yes I'm sure.
I've done a lot of tests with an external gcc compiler using arabian, Cyrillic strings.
and result is right

Posted: Mon Mar 13, 2017 6:25 am
by JRodriguez
...While reading the file with the sequential stage or complex flat file you would need to use NChar data type or the unicode extended atribute, and would need to define the NLS for the file to handle the multi bytes characters

Posted: Mon Mar 13, 2017 9:51 am
by Alma1
How do i find the correct NLS map file type?