Questions regarding Hash files and hash file stage

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Let's begin with the terminology. It's properly "hashed" file, not "hash" file.

To understand the purpose of DATA.30 and OVER.30 you need to understand the internal structure of a hashed file. Records are organized into "groups" (pages, if you like), the size of which is determined by the GROUP.SIZE parameter (1 = 2KB, 2 = 4KB). Groups ideally consist of only one page (the "primary buffer"). If the hashing algorithm needs more space than one buffer, then the group daisy-chains into as many "secondary buffers" as needed. Oversized records (those larger than the LARGE.RECORD parameter) have their key stored in the group and their data stored in other secondary buffers. Primary buffers are stored in the DATA.30 file (which grows and shrinks as required, which is where the name "dynamic" comes from), secondary buffers are stored in the OVER.30 file. So, if OVER.30 is large, this might mean that you have overflowed groups and/or that you have oversized records. The ANALYZE.FILE utility with STATS keyword will report the numbers of both.

RESIZE is ordinarily done at TCL level and requires that the hashed file have a VOC entry. (Search for SETFILE command.) $DSHOME/bin/resize can be used, but its syntax is not documented. Compilation of the job is not relevant to the resize operation. RESIZE will not affect the size of OVER.30 if OVER.30 contains oversized records (which can not be stored in group buffers).

"Create file" in the Hashed File stage opens a dialog in which you can specify tuning parameters when the job creates the Hashed File. If Create File is not checked then no attempt is made to create the hashed file. If "Delete before Create" is not checked, then no attempt is made to create the hashed file if it already exists. The "container directory" for the hashed file only disappears if you delete the hashed file using DELETE.FILE command (if the hashed file is in the project) or rm -r command (if the hashed file is in a directory and has no VOC entry).

A VOC entry is created only if the account (= project) name is provided. You can subsequently create a VOC entry for a directory-pathed hashed file using the SETFILE command. If you later want to delete this hashed file, you also need to delete the VOC entry using a DELETE query.

Average record size for HFC is a manual calculation, and you must remember that everything is stored as string. ANALYZE.FILE command with the STATS option will show record sizes.

Separation is the size of a group in a static hashed file. It is in units of 512 bytes (for historical reasons). It does not apply to dynamic hashed files, for which the GROUP.SIZE parameter specifies the group buffer size.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

ray.wurlod wrote:"Create file" in the Hashed File stage opens a dialog in which you can specify tuning parameters when the job creates the Hashed File. If Create File is not checked then no attempt is made to create the hashed file. If "Delete before Create" is not checked, then no attempt is made to create the hashed file if it already exists.
Sorry, had to pop in here and correct something.

The first and third sentences are correct. The second, however, is not - unless this is a change in behaviour in the 8.x version which would surprise me. The hashed file is auto-created regardless of that box being checked or not. Granted, it is a wee bit counter-intutive but what that checkbox does is give you the option to override the default values that the hashed file would otherwise use when auto-created. Typical choice of something to change there would be the Minimum Modulus and/or Group Size. It is also the only way to change the Type from from its default of 30 for Dynamic to one of the umpty other 'static' types. And as noted, it gets you access to the 'Delete before Create' which is akin to dropping and recreating a database table each run.
-craig

"You can never have too many knives" -- Logan Nine Fingers
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

Craig & Ray,

Thanks for your reply. I applied for membership 3-4 days ago and still waiting to get a confirmation. Once I've my membership I'll read through your messages and reply back.

Thanks again.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What Craig mentioned was introduced in version 7. Prior to that you had to validate the job to get the hashed file created.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

True.
-craig

"You can never have too many knives" -- Logan Nine Fingers
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

Hello Ray & Craig,

I'll still waiting for my membership to be approved. Don't know why it's taking so long.

In the meanwhile I have another question -

If a Hashed file is created on a 64-bit Sun OS sun4v sparc SUNW,SPARC-Enterprise-T5220 machine, is it by default a 64-bit hashed file? Or is it that it's created as a 32-bit hashed file initially by DataStage and needs to be explicitly converted to a 64-bit file?

Thanks.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

They are 32bit by default and via the GUI. The only way to get one created 64bit is via the command line. Or you can convert an existing one as you noted.
-craig

"You can never have too many knives" -- Logan Nine Fingers
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Either can be the default, as specified by the 64BIT_FILES setting in the uvconfig file. You don't want the default to be 64-bit, however, because this is wasteful for small hashed files, and the majority of hashed files are small (less than 1GB).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

That's true, I always forget you can change the default, mostly because you really (really) shouldn't do it. :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

Thanks for your replies.

In one of the jobs, data is loaded into a hashed file.
Options selected for this stage are
1) In Stage tab, Use Account Name (checked) and Account Name empty
2) In Inputs tab, General tab, a name, say HX, is provided in File Name and Allow stage write cache is checked.

Can I do a SELECT COUNT(*) FROM HX in TCL to count the number of records in HX?

If 'Yes' then I ran that query and found that the count returned by the above query is different than the count against the actual table. TCL count returned 1.2 mil rows while table count returned 1.8 mil rows. The job which populates this is DRS stage ---> Hashed file stage.

Think I'm missing something here.

Please clarify. Thanks.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Keep in mind the fact that hashed files do a "destructive overwrite" based on the key(s) defined and "last one in wins". In other words, just because you wrote X records to the hashed file doesn't mean there will be X in there when you select back out from it... or count them.
-craig

"You can never have too many knives" -- Logan Nine Fingers
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

Craig,

Got it!

But is the one(Select statement in UV) I mentioned earlier, one of the ways to count or select records from a hashed file?

Thanks.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Yes. I'm pretty sure that as long as there is a VOC record for the hashed file (and there is one automatically when it is created in an "Account") then sql statements like that are valid.

There is other (Retrieve?) syntax that is valid as well, I'll leave that for Ray or Kim to add.
-craig

"You can never have too many knives" -- Logan Nine Fingers
mavrick21
Premium Member
Premium Member
Posts: 335
Joined: Sun Apr 23, 2006 11:25 pm

Post by mavrick21 »

Craig/Ray,

I found and read several posts, when I searched for RESIZE, and most of them had syntax to convert a 32-bit hashed file to 64-bit.

Can I use RESIZE to restructure the existing dynamic hashed file? By restructure I mean increase the number of groups, change the GROUP.SIZE value and other parameters. If Yes then can you please tell its syntax or guide me to the post which contains it?

If No then is this the right way? - Create a new hashed file with increased number of groups(and other properties) and then copy over the old hashed file to the new one using a DS job.

I'm assuming, by doing one of the above, the lookup hashed file will have most of its data in DATA.30 and this might improve lookup performance.

Thank you
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Yes, RESIZE can accept any of the dynamic hashed file properties on its command line.

Code: Select all

RESIZE hashedfilename * * * SPLIT.LOAD 80 MERGE.LOAD 60
Your assumption about whether things will be improved or not is predicated on intelligent choices being made regarding the new settings! However, your belief that if most of the data are in primary groups (DATA.30) lookup performance will be improved is exactly correct.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply