How does IA determine the data type of a value

This forum contains ProfileStage posts and now focuses at newer versions Infosphere Information Analyzer.

Moderators: chulett, rschirm

Post Reply
xicheng_my_love
Participant
Posts: 23
Joined: Tue Jun 19, 2007 11:51 pm

How does IA determine the data type of a value

Post by xicheng_my_love »

When running Column Analysis, the report will show data value ,length, data type and other attributes of a column. Now I need to generate the exact same report using DB procedure.
1-1TM991 data type: STRING
+091619220895 data type: INT64
32743625244 data type: INT64
What is the complete logic behind this?

Thanks
a newbie to an expert
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The complete logic is documented in the Information Analyzer User Guide.

For example, your two numbers are Int64 because they contain only numeric digits and a sign, and are too large to be Int32.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

It just tries to downcast to the smallest data type it successfully fits into, starting from the smallest and going up until nothing else works and it can only ever be a string.

Why would you want to recreate this, if you already have the CA results?
xicheng_my_love
Participant
Posts: 23
Joined: Tue Jun 19, 2007 11:51 pm

Post by xicheng_my_love »

I can deduce most of the rules from lots of Column Analysis we have done.

By coding a DB procedure to generate the Column Analysis and by coding a Java program to write the Column Analysis result from DB tables to Excel, I can have my Column Analysis results in Excel by one-click.
stuartjvnorton wrote:It just tries to downcast to the smallest data type it successfully fits into, starting from the smallest and going up until nothing else works and it can only ever be a string.

Why would you want to recreate this, if you already have the CA results?
a newbie to an expert
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

OK, now let's see you profile compliance with data rules "with one click in Excel".
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

Ray has a good point.

You can do a lot of stuff (especially relating to rules) through the HTTP API and IAAdmin command line, but unfortunately, comprehensive CA results are not included (trivial value/frequency stuff only, I'm afraid).

You could get it out of the reports, but another (and in my mind, the biggest) shortcoming of the API is the inability to run a report automatically through the API. Having this would allow you to write DS jobs that ran all of your reports and then picked through the results.
That would solve all of our problems in one hit (an make half the current API redundant).
xicheng_my_love
Participant
Posts: 23
Joined: Tue Jun 19, 2007 11:51 pm

Post by xicheng_my_love »

We write a DB procedure to profile data and write results into a table, then a Java program will write the data in the table into Excel.

For other profiling information,like data value, frequency count, percent, it is easy to produce in DB. For format and data type, there are two many varieties. We can just try to conclude the rules from IA column analysis reports we have produced.

This is the code for getting format and data type:

CREATE FUNCTION "F_GET_TYPE_FORMAT" (
"P_GET_WHAT" VARCHAR(10),
"P_INSTR" VARCHAR(100) )
RETURNS VARCHAR(100)

BEGIN ATOMIC

DECLARE v_format VARCHAR(100) DEFAULT '' ;
DECLARE v_type VARCHAR(10) DEFAULT '' ;
DECLARE v_pos SMALLINT ;
DECLARE v_cur_char CHAR(1) ;
DECLARE v_len SMALLINT ;

IF p_instr is null
THEN
SET v_format='NA';
SET v_type='NA';
ELSEIF p_instr=''
THEN
SET v_format='NA';
SET v_type='STRING';
ELSE

SET v_len = LENGTH(TRIM(p_instr));
SET v_pos = 1;
WHILE v_pos <= v_len DO
SET v_cur_char = SUBSTR(p_instr,v_pos,1);
IF v_cur_char IN ('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z')
THEN SET v_format=v_format||'A';
ELSEIF v_cur_char IN ('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z')
THEN SET v_format=v_format||'a';
ELSEIF v_cur_char IN ('0','1','2','3','4','5','6','7','8','9')
THEN SET v_format=v_format||'9';
ELSE
SET v_format=v_format||v_cur_char;
END IF;
SET v_pos = v_pos + 1;
END WHILE;

SET v_type='INT';

SET v_len = LENGTH(v_format);
SET v_pos = 1;
WHIL:
WHILE v_pos <= v_len DO
SET v_cur_char = SUBSTR(v_format,v_pos,1);
IF v_cur_char <> '9'
THEN
SET v_type='STRING';
LEAVE WHIL;
END IF;
SET v_pos = v_pos + 1;
END WHILE;


IF v_type='INT' and BIGINT(p_instr)<128
THEN SET v_type='INT8';
ELSEIF v_type='INT' and BIGINT(p_instr)<32768
THEN SET v_type='INT16';
ELSEIF v_type='INT' and BIGINT(p_instr)<2147483648
THEN SET v_type='INT32';
ELSEIF v_type='INT' and BIGINT(p_instr)>2147483647
THEN SET v_type='INT64';
END IF;

END IF;

IF P_GET_WHAT='T'
THEN RETURN v_type;
ELSE
RETURN v_format;
END IF;

END;

ray.wurlod wrote:OK, now let's see you profile compliance with data rules "with one click in Excel". ...
a newbie to an expert
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

It's not the point Ray is making though.
Yes, you can do the basic CA stuff yourself, but you won't get most of the useful stuff that IA gets you, like the data classes, or rules or taking default values or invalid patterns into consideration.
Post Reply