Page 1 of 2

non-database sparse lookup alternative options

Posted: Tue May 22, 2018 7:34 am
by qt_ky
We have some real-time always-on DataStage jobs with sparse lookups, with indexed key columns, running under ISD. Our servers are on AIX.

Unfortunately, our organization has embraced database security to the point of uselessness.

What are the alternative options for enabling high volume, high performance, key-value pair, non-database equivalent to sparse lookups?

Posted: Tue May 22, 2018 8:14 am
by UCDI
what does non-database mean?

you can do a lookup in a 'binary' fixed-width file with a hashing algorithm.
If it will fit, you can do that in memory even faster, or a memory mapped file.

but we are way, way outside of datastage coding at this point...

Posted: Tue May 22, 2018 8:33 am
by qt_ky
Non-database means not using a RDMBS style Connector stage and instead using something that is well outside of the DBA area of control (as that is where the security shenanigans are taking place), such as a file-based structure that exists and functions on Information Server alone...

Something like a parallel version of a hashed file in a Server job that can support higher volumes (over 2 GB). From what I recall about hashed files in Server jobs, and this has been a very long time ago, was that when you increased the key size beyond the default 32 bits, then they could handle larger volumes but performance suffered greatly.

Posted: Wed May 23, 2018 2:25 pm
by UCDI
ok, that is what I understood also. If you are OK going outside of datastage and databases, sure, its very simple to make a C program do this, again looking at a memory mapped file, or eating up a bunch of ram on one box, or maybe doing a fixed record random access file on a huge (pricey) flash drive server would do it. The file size or being 64 or even 128 bit wouldnt really matter (Some, a growing number, modern computers are now sporting 128 bit registers). If it were an always on service type program, it would respond in true near real time (sub ms) at the worst on even a good desktop system.

If you are saying don't go outside of datastage off into weird stuff land, then I don't know. You can do a double hash, which is frustrating to set up and maintain... that is, you could do a bunch of 32 bit files and which file the data was in governed by another 32 bit hashed file ... this seems very suboptimal and clunky, but its a possible approach.

Posted: Wed May 23, 2018 3:25 pm
by chulett
qt_ky wrote:From what I recall about hashed files in Server jobs, and this has been a very long time ago, was that when you increased the key size beyond the default 32 bits, then they could handle larger volumes but performance suffered greatly.
We had some hashed files that were created with 64bit addressing since they lived over the 2GB line and don't recall any unexpected performance issues. For whatever that is worth. :wink:

And I'm assuming the "double hash" mentioned are distributed hashed files, a couple of discussions are here and here.

Posted: Thu May 24, 2018 6:42 am
by qt_ky
Thanks all for the ideas and the links so far. I do recall the double hashed file or cascading hashed file / lookup concept. Is a distributed file yet a different type of file?

I haven't ruled out going outside of DataStage to code a solution, but would also prefer to stay within the tool if possible, especially to make it easier on anyone who comes along in the future to support it.

Are there any DataStage parallel job built in stage types that can achieve this?

Posted: Thu May 24, 2018 6:49 am
by chulett
AFAIK, distributed hashed files are just the official name for what people are calling "double" or "cascading" hashed files. And I don't believe there are any built-in PX stages that can achieve this. :?


Out of morbid curiosity (since I'm working in the DoD space and we also have a crap ton of database security issues to deal with) what brought on this line of research? If you don't mind going there, of course.

Posted: Thu May 24, 2018 6:58 am
by qt_ky
The security group is pushing to revoke any database privilege that smells like it could be elevated, as well as pushing changes that would force us to change database passwords on application/system IDs that were already exempted to be non-expiring, which would force us to recompile thousands of jobs within an impossible timeframe, etc.

As a result, we are looking at options to move anything on the ETL back end that we can from being table-based to file-based, and the main hangup seems to be on the equivalent sparse lookup functionality. It's disturbing to be reminded that parallel jobs don't have something similar to hashed files.

Posted: Thu May 24, 2018 7:16 am
by chulett
Gotcha. We've had to fight that whole password expiration issue as well.

Silly thought perhaps but is there anything about your ISD jobs that require them to be parallel jobs? Server jobs work in that capacity as well. Built several of them back in the day, hence the question. Seems to me though, that you'll be hard-pressed to replicate that sparse lookup functionality in an always on job... how would you flow changes into the data in real time if the lookup source is say, file based? Or is it reference data that does not change?

Posted: Thu May 24, 2018 7:50 am
by qt_ky
Those are good points. The ISD jobs do not need to be parallel jobs; that's just what how were developed to begin with, out of familiarity, each running under a single node config file.

Most are serving up reference data that is updated daily, weekly, monthly, or every 5 years. The smaller ones use normal lookups and don't need to be changed. The larger ones are in the 5 to 10 GB range and use sparse lookups with indexed columns. We have only one that truly requires full sparse lookup functionality and we will not be able to modify that one. So it's really just the larger reference data we would like to move away from a database if possible. A normal lookup was tested on those in the past, and after an unacceptable 45+ minute preparation time, it performed well once it finally loaded in memory, but being a real-time ISD job, the long spin up time was a no go. We do have SSD storage avaiable.

Posted: Thu May 24, 2018 9:15 am
by chulett
So when the reference data needs to be refreshed, sounds like a scheduled shutdown / restart of the ISD jobs would be needed?

Posted: Thu May 24, 2018 9:57 am
by qt_ky
Yes, we do scheduled bouncing for those having normal lookups. I have scripted some ISD commands to disable then enable the ISD applications, and those scripts are called at the tail end of any sequence job that rebuilds the reference data. Those sequence jobs are scheduled to run either after hours or during a small maintenance window. For those using sparse lookups today we don't usually bounce the ISD applications but it sounds like that may need to change.

Posted: Thu May 24, 2018 11:13 am
by UCDI
SSD was me thinking of dumping the file into ram as fast as possible. I think the lookup stage is doing something more beyond just a grab --- it may not be the disk read time, but the tool, if it took 45 min. I mean you should be able to read a gig every couple of seconds, and a 10gb file should drop into ram in under 60 sec on even low end hardware. Something else is going on there -- its trying to understand the file or reading line by line or something.

Posted: Thu May 24, 2018 5:11 pm
by ray.wurlod
Remember too that hashed files can be cached in memory (up to a configurable maximum size).

You can put a Hashed File stage in a Server Shared Container and use that in a parallel job (just an alternative strategy to a server job - I'd prefer the server job).

Posted: Fri May 25, 2018 5:55 am
by qt_ky
A normal Lookup is doing something more behind the scenes that does not appear to be documented--as it reads data from the reference link, it writes it to some sort of memory mapped file that may or may not have an index built on the key columns (I don't know), and then finally it reads that file into memory. I have posted about that hidden feature before. Now I vaguely recall there was an APT environment variable we tested that made it do that step quickly. Will have to search...

We will also experiment with some hashed file variations and server jobs! It may be a while before I have any results to share. Still a bit miffed that parallel jobs don't have something like hashed files...