Column Analysis Sample - recommendations, best practices

This forum contains ProfileStage posts and now focuses at newer versions Infosphere Information Analyzer.

Moderators: chulett, rschirm

Post Reply
truenorth
Participant
Posts: 139
Joined: Mon Jan 18, 2010 4:59 pm
Location: San Antonio

Column Analysis Sample - recommendations, best practices

Post by truenorth »

We are experiencing performance issues because of column analysis jobs that run for hours, some simultaneously.

We want to introduce CA sampling to our IA developers.

We already have an IBM Guidelines for Sampling Large Files pdf (sorry can't find a link for it) and someone's previous recommendations document from his past IA projects.

But I'm still looking for ideas from others. Can anyone recommend best practices on how to use sampling in CA?
Todd Ramirez
Sr Consultant, Data Quality
San Antonio TX
stuartjvnorton
Participant
Posts: 527
Joined: Thu Apr 19, 2007 1:25 am
Location: Melbourne

Post by stuartjvnorton »

Sampling can be a tricky one.

I have seen large tables profiled in the past and set a pre-defined sample size, eg: 1M records. Need to be careful here.
In one example, review showed a particular field was not populated. Failed the sniff test, so I took a look. Turns out the table had 495M records and 100M were not populated. The 1M sample took the first 1M and missed the actual data completely.

I'd be tempted to use "Every Nth". It might take a bit of finger in the air type activity to get an N that is practical to run but as comprehensive a sample as you can afford to run. Should also sample the data over time, so you get a picture of how the table has evolved.

As a rough approach, I'd look into using record counts to group tables together with the same N value, then use the batchAnalysis (IIRC) IAAdmin option to batch run each group of tables. You can throttle the number of tables that get run at a time, and it only does the internal job setup stuff for each table when the batching manager gets to it.
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

We had this problem as well, and while we half-solved it by putting the CA workload on its own server, I ended up making a C++ routine that did pretty much the same thing and was able to do the analysis as fast as the database could be read, piped thru a quality stage job that did the statistics on the patterns it found.

Granted this code isnt useful for unicode data. It would take a bit of extra work to allow that, but our data was not unicoded.

looks something like this.. (may need to customize the table for your data, eg I have 10 and 13 and unprintable but you may want them as white)

Code: Select all


/*
converts strings to pattern similar to quality stage/IA
B whitespace 
A char uc
a char lc
9 num
P punctuation
U unprintable
N null
*/


static const unsigned char pt[256] = {
'N','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U',
'U','U','U','U','U','U','U','U','U','U','U','B','P','P','P','P','P','P','P','P',
'P','P','P','P','P','P','P','9','9','9','9','9','9','9','9','9','9','P','P','P',
'P','P','P','P','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','A',
'A','A','A','A','A','A','A','A','A','A','P','P','P','P','P','P','a','a','a','a',
'a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a',
'a','a','P','P','P','P','U','U','U','U','U','U','U','U','U','U','U','U','U','U',
'U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U',
'U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U',
'U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U',
'U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U',
'U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U','U',
'U','U','U','U','U','U','U','U','U','U','U','U','U','U','U'
};


//this simple routine just replaces every character in a string with the character from the table above
//via a lookup.   so 'X' becomes 'A',  '7' becomes '9', etc.  
char* pattern(char* OrigInputString) 
{  
    static int i;    
    const int leng = strlen(OrigInputString);	 
    unsigned char * InputString = new unsigned char[leng+1];    
    if(leng == 0)
{
      InputString[0] = 'N';
      InputString[1] = 0;
      return (char*)InputString;
}
  
    for(i = 0; i < leng; i++)
     InputString[i] = pt[OrigInputString[i]];
     InputString[leng] = 0;
    return (char*)InputString;
}
Its not much of an answer, but its 'an' answer. With the high performance, we did the whole table for most of our data. No reason to sample/limit.
Post Reply