Info Analyzer performance/sizining

This forum contains ProfileStage posts and now focuses at newer versions Infosphere Information Analyzer.

Moderators: chulett, rschirm

Post Reply
mee
Participant
Posts: 23
Joined: Sat Mar 20, 2004 12:22 am
Location: None

Info Analyzer performance/sizining

Post by mee »

We have some large files (~ few GB) that we need to get from outside vendors and one major problem is quality of the data files. We also have a fixed time window in which the profiling must complete and report back issues to the vendors. We are likely to do column profiling as well as primary key inference against these files. The column type is of varchar 256. What are some guidance on HW and storage? I am looking for approximate number of CPUs/cores, memory size and disk size to complete the job in approximately in 2 hours.

Secondly, it's likely that file sizes will grow down the line (but the prolfing functionality will remain same). Is there any way I can maintain the same 2 hour window for column profiling and key inference by doing some data partitioning and parallel processing? If so how would that be done?

Lastly, how do I perform "join" analysis between two files to determine the "join" key between two files?

Thanks in advance.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Column analysis takes as long as it takes. It is affected by several variables, not least data types and variability in the data, so it is not possible to estimate in advance how long will be required. Agreeing to a fixed time window was probably a Bad Idea unless the window is generous. There are, therefore, no "guidelines" of the type that you seek.
"Join" analysis is called "cross table analysis" or "foreign key analysis" (depending on the version that you are using). Both table analysis and cross-table analysis depend upon results from column analysis, so it is important that the column analysis is as comprehensive as possible - but, again, there are some columns tha may reasonably be excluded from analysis, such as columns called COMMENTS.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

I assume that since profiling runs on the parallel architecture if you throw more CPUs and RAM at it the process will scale - unless there is a step in the profiling that is single threaded. Impossible to scale your system bassed on sparse information and varchar fields can be resource hungry, do you already have an information analyzer running? Have you got hold of some sample/test files?

One way to size your machine and make sure you can execute the profiling is to run it on a test server first. If you are considering seperate test/dev and prod environments you could order your dev hardware first configure for 2 or 4 CPUs and run profiling in that to see how long it takes and then size your prod server accordingling.
Post Reply