Is lookup stage an alternative to hash file stage

richdhan · Post by **richdhan** » Thu Apr 01, 2004 12:56 am

I have been using hash files for lookups all this time.
Is lookup stage an alternative to hash file stage.
Can anyone explain the significance of lookup stage
and difference between hash file stage and lookup stage.

Cant we use hash file stage in parallel jobs? If so what is the
appropriate stage that should be used which uses the hashing
algorithm.

Thanks.

--Rich

ray.wurlod · Post by **ray.wurlod** » Thu Apr 01, 2004 1:07 am

The PX engine does not know about hashed files.

The only way you can use them is to encapsulate a server job in a shared container, and run that in the parallel environment. Probably more trouble than it's worth; the datasets used by the Lookup stage (and the other stages that can do reference lookups/joins in the PX environment) are memory resident in any case.

richdhan · Post by **richdhan** » Thu Apr 01, 2004 2:34 am

Ray thanks for your comments.

In my server jobs I used to populate the hash file and then use hash file for lookup for loading data.

From your comments what I understand is that I should populate a lookup dataset and use the lookup dataset (The lookup dataset is resident in memory and not in file) for lookup/reference for loading data.

Pls correct me if my understanding is wrong.

Thanks

--Rich

ray.wurlod · Post by **ray.wurlod** » Thu Apr 01, 2004 3:02 am

Well, you're part way there.

In the PX environment you get the choice of three stages that do in-memory "joins" (the Join stage, the Lookup stage and the Merge stage). You have to select the correct tool. Read about them in the Parallel Job Developer's Guide.

(If you think about it, what you were doing in server jobs with Hashed File stages was a memory-based, primary-key-based, left outer join.)

richdhan · Post by **richdhan** » Thu Apr 01, 2004 3:22 am

Ray you have mentioned that hash file is memory based but as far as I know it is file based(two files .OVR and .DAT are created) and the data is loaded into memory only if the pre-load to memory option is selected.

There is a stage in PX the lookup fileset(this creates a file with .fs extension) I think this is similar to hash file stage. You can use the lookup fileset for lookup to load data.

Ray, I went through the documentation but I just wanted to make sure that my understanding is right.

Correct me if Iam wrong.

Thanks

--Rich

ray.wurlod · Post by **ray.wurlod** » Thu Apr 01, 2004 3:58 pm

Your beliefs about hashed files (except the file names) being disk based are correct. However, nearly everyone enables pre-load for read, which is why I claim that they are - or, properly, can be - memory based.
The file names are DATA.30, OVER.30 and .Type30.

File sets and data sets are different in PX. The difference is best understood by reading Chapter 4 (Data Set Stage) and Chapter 7 (File Set Stage) in the Parallel Job Developer's Guide.

The main difference is that file sets carry formatting information.

The ".fs" files (and the ".ds" files for data sets) are just control files that describe where the data really are, when the data are stored in persistent form. What these contain will depend on the information in the configuration file and the partitioning choices made in the job design. In operation, a dataset is always in (virtual) memory.