Comparison between Database and dataset

shiva459 · Post by **shiva459** » Thu Feb 07, 2008 9:05 pm

Hi All,

I am working in a project where is we use Datastage 7.5.2 and db2 as database.I have a senario where in I have to do a lookup on the code tables many times.
I want to know is it better to use database everytime or load the table in a dataset and then do a look up on that dataset.
If anyone has experience on tihs please let me know.

Thanks in advance

ray.wurlod · Post by **ray.wurlod** » Thu Feb 07, 2008 11:51 pm

In general, unless you explicitly specify "sparse" lookup, you will be performing your lookup against a Data Set - a virtual Data Set. There's one associated with every link.

"Sparse" lookups - directly to DB2 or Oracle tables - will be substantially slower than to a virtual Data Set (which is in local memory). However, there is an implied (and enforced) requirement that the reference Data Set must be able to be loaded into memory.

Therefore, with a very large reference table on which you can not be selective, look for alternatives - perhaps a join in the database, perhaps a Join stage in DataStage, perhaps a "sparse" lookup.

shiva459 · Post by **shiva459** » Fri Feb 08, 2008 1:05 am

Ray Thanks for your reply.The lookup that I use is static. and there will be max 500 records in it.But I use this many times.So in this case also 'Sparse' lookup is better than dataset right?

ray.wurlod wrote:In general, unless you explicitly specify "sparse" lookup, you will be performing your lookup against a Data Set - a virtual Data Set. There's one associated with every link.

"Sparse" lookups - directly to DB2 or Oracle tables - will be substantially slower than to a virtual Data Set (which is in local memory). However, there is an implied (and enforced) requirement that the reference Data Set must be able to be loaded into memory.

Therefore, with a very large reference table on which you can not be selective, look for alternatives - perhaps a join in the database, perhaps a Join stage in DataStage, perhaps a "sparse" lookup.

ArndW · Post by **ArndW** » Fri Feb 08, 2008 2:08 am

If your lookup contains only 500 records then you are much better off using a normal lookup instead of a static one.

shiva459 · Post by **shiva459** » Fri Feb 08, 2008 2:40 am

So I dont have to load the data in dataset and then do a lookup right?

ArndW · Post by **ArndW** » Fri Feb 08, 2008 3:09 am

No - as Ray already stated, if you use a database for a lookup the engine will load that result into a virtual dataset and do the reference from that.

ray.wurlod · Post by **ray.wurlod** » Fri Feb 08, 2008 3:18 am

Take a look at the job score. Your Lookup stage generates a "composite operator" (two operators called LUT_CreateOp and LUT_ProcessOp). The first of these loads data from your table into a virtual Data Set and builds an index on it. The second actually performs the lookups, via the index, into the virtual Data Set.

shiva459 · Post by **shiva459** » Fri Feb 08, 2008 4:56 am

I understand that if i use the database for a lookup it creates virtual dataset.My problem is I am doing this lookup many times and almost in all the jobs.So is it good to use the database everytime?

Thanks

shiva459 · Post by **shiva459** » Fri Feb 08, 2008 4:57 am

I understand that if i use the database for a lookup it creates virtual dataset.My problem is I am doing this lookup many times and almost in all the jobs.So is it good to use the database everytime?

Thanks

ArndW · Post by **ArndW** » Fri Feb 08, 2008 5:01 am

If you use the same SELECT for the lookup several times then it is best to perform the database lookup once to a dataset or lookup and then use that for the 2nd and further lookups.
The point of the previous posts is that a "sparse" lookup into a table is to be avoided unless absolutely necessary.

shiva459 · Post by **shiva459** » Fri Feb 08, 2008 5:12 am

Thanks a lot ArndW and Ray for helping me in this.
I will go ahead and design my job lin this way.

Thanks
Shiva

DSDexter · Post by **DSDexter** » Sat Feb 09, 2008 1:13 am

ArndW wrote:If you use the same SELECT for the lookup several times then it is best to perform the database lookup once to a dataset or lookup and then use that for the 2nd and further lookups.
The point of the ...

Why not load database into lookupfileset instead of dataset

According to my understanding lookup fileset will give you that additional throughput as compared to a dataset

ArndW · Post by **ArndW** » Sat Feb 09, 2008 3:35 am

The performance benefits that I've seen with lookup filesets isn't very significant, but lookup filesets have the drawback of not letting you view the data using any tool and need to have the keys that are going to be used hardcoded. I have used Datasets for development in the past and then, once all checking has been completed, converted them to lookup filesets.

ray.wurlod · Post by **ray.wurlod** » Mon Feb 11, 2008 10:27 pm

The main benefit of using a Lookup File Set is that the index is built when the Lookup File Set is created, and stored within its structure, rather than having to be built "on the fly" by the LUT_CreateOp operator.