non-database sparse lookup alternative options

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

non-database sparse lookup alternative options

Post by qt_ky »

We have some real-time always-on DataStage jobs with sparse lookups, with indexed key columns, running under ISD. Our servers are on AIX.

Unfortunately, our organization has embraced database security to the point of uselessness.

What are the alternative options for enabling high volume, high performance, key-value pair, non-database equivalent to sparse lookups?
Choose a job you love, and you will never have to work a day in your life. - Confucius
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

what does non-database mean?

you can do a lookup in a 'binary' fixed-width file with a hashing algorithm.
If it will fit, you can do that in memory even faster, or a memory mapped file.

but we are way, way outside of datastage coding at this point...
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Non-database means not using a RDMBS style Connector stage and instead using something that is well outside of the DBA area of control (as that is where the security shenanigans are taking place), such as a file-based structure that exists and functions on Information Server alone...

Something like a parallel version of a hashed file in a Server job that can support higher volumes (over 2 GB). From what I recall about hashed files in Server jobs, and this has been a very long time ago, was that when you increased the key size beyond the default 32 bits, then they could handle larger volumes but performance suffered greatly.
Choose a job you love, and you will never have to work a day in your life. - Confucius
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

ok, that is what I understood also. If you are OK going outside of datastage and databases, sure, its very simple to make a C program do this, again looking at a memory mapped file, or eating up a bunch of ram on one box, or maybe doing a fixed record random access file on a huge (pricey) flash drive server would do it. The file size or being 64 or even 128 bit wouldnt really matter (Some, a growing number, modern computers are now sporting 128 bit registers). If it were an always on service type program, it would respond in true near real time (sub ms) at the worst on even a good desktop system.

If you are saying don't go outside of datastage off into weird stuff land, then I don't know. You can do a double hash, which is frustrating to set up and maintain... that is, you could do a bunch of 32 bit files and which file the data was in governed by another 32 bit hashed file ... this seems very suboptimal and clunky, but its a possible approach.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

qt_ky wrote:From what I recall about hashed files in Server jobs, and this has been a very long time ago, was that when you increased the key size beyond the default 32 bits, then they could handle larger volumes but performance suffered greatly.
We had some hashed files that were created with 64bit addressing since they lived over the 2GB line and don't recall any unexpected performance issues. For whatever that is worth. :wink:

And I'm assuming the "double hash" mentioned are distributed hashed files, a couple of discussions are here and here.
-craig

"You can never have too many knives" -- Logan Nine Fingers
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Thanks all for the ideas and the links so far. I do recall the double hashed file or cascading hashed file / lookup concept. Is a distributed file yet a different type of file?

I haven't ruled out going outside of DataStage to code a solution, but would also prefer to stay within the tool if possible, especially to make it easier on anyone who comes along in the future to support it.

Are there any DataStage parallel job built in stage types that can achieve this?
Choose a job you love, and you will never have to work a day in your life. - Confucius
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

AFAIK, distributed hashed files are just the official name for what people are calling "double" or "cascading" hashed files. And I don't believe there are any built-in PX stages that can achieve this. :?


Out of morbid curiosity (since I'm working in the DoD space and we also have a crap ton of database security issues to deal with) what brought on this line of research? If you don't mind going there, of course.
-craig

"You can never have too many knives" -- Logan Nine Fingers
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

The security group is pushing to revoke any database privilege that smells like it could be elevated, as well as pushing changes that would force us to change database passwords on application/system IDs that were already exempted to be non-expiring, which would force us to recompile thousands of jobs within an impossible timeframe, etc.

As a result, we are looking at options to move anything on the ETL back end that we can from being table-based to file-based, and the main hangup seems to be on the equivalent sparse lookup functionality. It's disturbing to be reminded that parallel jobs don't have something similar to hashed files.
Choose a job you love, and you will never have to work a day in your life. - Confucius
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Gotcha. We've had to fight that whole password expiration issue as well.

Silly thought perhaps but is there anything about your ISD jobs that require them to be parallel jobs? Server jobs work in that capacity as well. Built several of them back in the day, hence the question. Seems to me though, that you'll be hard-pressed to replicate that sparse lookup functionality in an always on job... how would you flow changes into the data in real time if the lookup source is say, file based? Or is it reference data that does not change?
-craig

"You can never have too many knives" -- Logan Nine Fingers
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Those are good points. The ISD jobs do not need to be parallel jobs; that's just what how were developed to begin with, out of familiarity, each running under a single node config file.

Most are serving up reference data that is updated daily, weekly, monthly, or every 5 years. The smaller ones use normal lookups and don't need to be changed. The larger ones are in the 5 to 10 GB range and use sparse lookups with indexed columns. We have only one that truly requires full sparse lookup functionality and we will not be able to modify that one. So it's really just the larger reference data we would like to move away from a database if possible. A normal lookup was tested on those in the past, and after an unacceptable 45+ minute preparation time, it performed well once it finally loaded in memory, but being a real-time ISD job, the long spin up time was a no go. We do have SSD storage avaiable.
Choose a job you love, and you will never have to work a day in your life. - Confucius
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

So when the reference data needs to be refreshed, sounds like a scheduled shutdown / restart of the ISD jobs would be needed?
-craig

"You can never have too many knives" -- Logan Nine Fingers
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

Yes, we do scheduled bouncing for those having normal lookups. I have scripted some ISD commands to disable then enable the ISD applications, and those scripts are called at the tail end of any sequence job that rebuilds the reference data. Those sequence jobs are scheduled to run either after hours or during a small maintenance window. For those using sparse lookups today we don't usually bounce the ISD applications but it sounds like that may need to change.
Choose a job you love, and you will never have to work a day in your life. - Confucius
UCDI
Premium Member
Premium Member
Posts: 383
Joined: Mon Mar 21, 2016 2:00 pm

Post by UCDI »

SSD was me thinking of dumping the file into ram as fast as possible. I think the lookup stage is doing something more beyond just a grab --- it may not be the disk read time, but the tool, if it took 45 min. I mean you should be able to read a gig every couple of seconds, and a 10gb file should drop into ram in under 60 sec on even low end hardware. Something else is going on there -- its trying to understand the file or reading line by line or something.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Remember too that hashed files can be cached in memory (up to a configurable maximum size).

You can put a Hashed File stage in a Server Shared Container and use that in a parallel job (just an alternative strategy to a server job - I'd prefer the server job).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
qt_ky
Premium Member
Premium Member
Posts: 2895
Joined: Wed Aug 03, 2011 6:16 am
Location: USA

Post by qt_ky »

A normal Lookup is doing something more behind the scenes that does not appear to be documented--as it reads data from the reference link, it writes it to some sort of memory mapped file that may or may not have an index built on the key columns (I don't know), and then finally it reads that file into memory. I have posted about that hidden feature before. Now I vaguely recall there was an APT environment variable we tested that made it do that step quickly. Will have to search...

We will also experiment with some hashed file variations and server jobs! It may be a while before I have any results to share. Still a bit miffed that parallel jobs don't have something like hashed files...
Choose a job you love, and you will never have to work a day in your life. - Confucius
Post Reply