Is lookup stage an alternative to hash file stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
richdhan
Premium Member
Premium Member
Posts: 364
Joined: Thu Feb 12, 2004 12:24 am

Is lookup stage an alternative to hash file stage

Post by richdhan »

I have been using hash files for lookups all this time.
Is lookup stage an alternative to hash file stage.
Can anyone explain the significance of lookup stage
and difference between hash file stage and lookup stage.

Cant we use hash file stage in parallel jobs? If so what is the
appropriate stage that should be used which uses the hashing
algorithm.

Thanks.

--Rich
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The PX engine does not know about hashed files.

The only way you can use them is to encapsulate a server job in a shared container, and run that in the parallel environment. Probably more trouble than it's worth; the datasets used by the Lookup stage (and the other stages that can do reference lookups/joins in the PX environment) are memory resident in any case.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
richdhan
Premium Member
Premium Member
Posts: 364
Joined: Thu Feb 12, 2004 12:24 am

Post by richdhan »

Ray thanks for your comments.

In my server jobs I used to populate the hash file and then use hash file for lookup for loading data.

From your comments what I understand is that I should populate a lookup dataset and use the lookup dataset (The lookup dataset is resident in memory and not in file) for lookup/reference for loading data.

Pls correct me if my understanding is wrong.

Thanks

--Rich
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Well, you're part way there.

In the PX environment you get the choice of three stages that do in-memory "joins" (the Join stage, the Lookup stage and the Merge stage). You have to select the correct tool. Read about them in the Parallel Job Developer's Guide.

(If you think about it, what you were doing in server jobs with Hashed File stages was a memory-based, primary-key-based, left outer join.)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
richdhan
Premium Member
Premium Member
Posts: 364
Joined: Thu Feb 12, 2004 12:24 am

Post by richdhan »

Ray you have mentioned that hash file is memory based but as far as I know it is file based(two files .OVR and .DAT are created) and the data is loaded into memory only if the pre-load to memory option is selected.

There is a stage in PX the lookup fileset(this creates a file with .fs extension) I think this is similar to hash file stage. You can use the lookup fileset for lookup to load data.

Ray, I went through the documentation but I just wanted to make sure that my understanding is right.

Correct me if Iam wrong.

Thanks

--Rich
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Your beliefs about hashed files (except the file names) being disk based are correct. However, nearly everyone enables pre-load for read, which is why I claim that they are - or, properly, can be - memory based.
The file names are DATA.30, OVER.30 and .Type30.

File sets and data sets are different in PX. The difference is best understood by reading Chapter 4 (Data Set Stage) and Chapter 7 (File Set Stage) in the Parallel Job Developer's Guide.

The main difference is that file sets carry formatting information.

The ".fs" files (and the ".ds" files for data sets) are just control files that describe where the data really are, when the data are stored in persistent form. What these contain will depend on the information in the configuration file and the partitioning choices made in the job design. In operation, a dataset is always in (virtual) memory.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply