Profile Stage performance

This forum contains ProfileStage posts and now focuses at newer versions Infosphere Information Analyzer.

Moderators: chulett, rschirm

Post Reply
mee
Participant
Posts: 23
Joined: Sat Mar 20, 2004 12:22 am
Location: None

Profile Stage performance

Post by mee »

We have about 100GB of data that we want to profile. For business reason, we can't use the sampling, and must profile the entire data set.

The profiling seem to be hanging (or not able to complete). The data is ASCII data. Machine is a 4 CPU.

I have two questions.

a) Has anyone else seen this behaviour ?
b) What is the largest amount of data volume that anyone has profiled in 2 to 3 hours ?

Thanks for any help here.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

At my current site a few runs were performed that ran more than 2-3 hours, and were definitely nowhere near 100GB. Of course, elapsed time is as much as anything a function of how complex the analysis you're performing is. More complex analyses take longer; this is a simple fact of life.
It being Sunday as I write I don't have access to the actual figures. However, I think you're being over-optimistic expecting to process that volume of data in so short an interval.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
mee
Participant
Posts: 23
Joined: Sat Mar 20, 2004 12:22 am
Location: None

Post by mee »

Ray, Thanks for the reply.

We are doing some basic column profiling and inter table key relationship. How much time should I allow for that volume of data ?

How much data did you profile in 2-3 hours and what sort of analysis? This would help me compare also.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Here's a reply from the guy who actually ran the Profile Stage analyses at my current site. Hope it's useful.

I'm not sure how useful this information is going to be because, as you pointed out, there are many factors that affect the processing time. Probably the most relevant example that I can recount is the table that contained 54 million rows (one of the claim history tables if memory serves me correctly). This table took 14 hours to perform the column analysis within ProfileStage. One of the factors that affects performance is the number of columns that the table contains as well as the number of rows that the table contains. You can generally get an idea of how long the column analysis will take by setting it off and checking the log to see how long it takes for ProfileStage to process each 10,000 rows. The time gap between each of the 10,000 rows can then be used, in conjunction with the total number of rows in the table, to calculate a rough estimate of how long it will take to perform the column analysis for all of the rows in that table. It won't be exact, but it would give you an idea. Once you've got this estimate, you can kill the ProfileStage job and then plan to schedule and run it "properly".

The table in question had only 28 columns. I'm afraid I don't have any size figures available. ProfileStage was running on a 4 CPU AIX machine, but other DataStage processes, also processing similar volumes, were also running during the same overnight period, some moving data from a remote Informix database to a local Oracle database.

So it really is "how long is a piece of string".
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
metabill
Premium Member
Premium Member
Posts: 26
Joined: Wed Aug 27, 2003 9:55 am

Post by metabill »

Other factors can also affect throughput for other phases of analysis. For instance, since table analysis looks at functional dependencies among the columns within a given table, the number of columns and the degree to which the column values imply dependencies will affect how many dependencies are found, and hence the elapsed time required.
namisbis
Participant
Posts: 4
Joined: Wed Feb 16, 2005 8:45 am

one trick is to use 4 cpu work for u

Post by namisbis »

i noticed that u got 4 cpu server.
One trick probably can help to fasten the column analysis:
f.inst: u hv 4 tables to analyze, u can create 4 jobs,each contains only one table on the analysis-server, profilestage wl use the 4 cpu do 4 analysis simulatanously, which wl much shrtern the time.

hope this help u.

franky
Post Reply