Best way for incremental loading using sequential file

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

Post Reply
sivakr4056@gmail.com
Participant
Posts: 4
Joined: Thu Jul 05, 2018 5:57 pm

Best way for incremental loading using sequential file

Post by sivakr4056@gmail.com »

Hi, i have Oracle database as a source and Sq file is the output. i would like to know the best way for the incremental load in sq file based on the updated timestamp.

Below is what I am thinking. Please tell me if there is any better approach.

Parallel Job stages => Oracle Connector -> Transformer ->2 Sq Files (1st Sq File for saving Last execution timestamp, 2nd Sq file for original data loading)

In Oracle connector, I use "SELECT * FROM table_name WHERE date >= last_execution_timestamp" (this I am saving 1st Sq File from transformer stage).

Can anyone suggest a simple and better approach for incremental loading?
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Welcome!

Can you detail for us exactly what you mean by "Sq file"? I'd rather not guess what it is and how you plan on using it for an incremental load. Thanks.
-craig

"You can never have too many knives" -- Logan Nine Fingers
sivakr4056@gmail.com
Participant
Posts: 4
Joined: Thu Jul 05, 2018 5:57 pm

Post by sivakr4056@gmail.com »

Sq file -> sequential file
1st sq file contains updated date and 2nd sq file contains incremental data.
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

Okay... was hoping for a wee bit more detail than that. So we're not even touching on the 'loading' aspect of this, just the extraction side for now. Fine.

Here's a high level look at how we handle things of that nature. We generally stream data in from the source where the data is > our last high water mark. Note that I'm not talking about execution time but instead an actual timestamp value from the source. We will in some instances use ">=" depending on the data and frequency of change but then we also add duplicate handling for the overlap between two loads from pulling some of the information twice.

Then it's straight on to the target, no landed file required, with a lookup driving the action there - insert or update or discard. We also split the source field that drives the extraction off to an aggregator set to "max" and capture that output as our next starting high water mark. For us, that HWM is stored in a database table (one record per source and table) and is selected out as the parameter for the next run.

Now, maybe you are landing the data because they are always inserts and they are all handled by a bulk loader... but you didn't share that detail with us.
-craig

"You can never have too many knives" -- Logan Nine Fingers
Post Reply