Incremental Extract based on a timestamp

peternolan9 · Post by **peternolan9** » Tue Jul 13, 2004 4:31 am

Hi All,
oh joy oh joy. :D I'm doing a DS project where ALL the source tables actually have a timestamp for their last update and they are small enough to be able to extract from the source oracle database using a where clause on the timestamp. This is the first time ever for me...

So, I was wondering, what is the very best way of setting up DS so that we can maintain a table in our staging area that has 'last timestamp' for the extraction process for this source system that we will maintain and then use that timestamp in the where clause for all the extraction processes that run against the remote database?

I can think of a few ways myself including passing the last extract date time as a parameter to each job that performs and extract, or even doing a remote join but I figured that some bright spark on this list will have found 'the best' way to do this....

We are going to use ODBC or OCI and we do not have the ability to put a table into the target system that we can maintain.

Anyone have a good suggestion for doing this?

Thanks

Peter Nolan

1stpoint · Post by **1stpoint** » Tue Jul 13, 2004 7:11 am

Source Stage (OCI/ODBC)
Select ...
From SourceTable
Where TimeColumn > (select max(LastExtracted) from staging_control_table)

@DATE contains the date/time the job was started and will update the LastExtracted date when the job finishes. I would use an aggregator stage to save this value and an update to the "staging_control_table".

By using the subquery method you avoid table scanning the source and the need for a constraint.

chulett · Post by **chulett** » Tue Jul 13, 2004 7:15 am

peternolan9 wrote:I can think of a few ways myself including passing the last extract date time as a parameter to each job that performs an extract

Peter, don't know about the 'bright spark' bit or if this is really the 'best way' to do it, but I've recently done exactly this and used the methodology quoted above.

Basically, I'm using two jobs and a Sequencer to tie them together along with an Oracle table to store 'audit' type information for the process. This information includes the last extract timestamp and a row count from the extraction process. The first job queries this table for the max(timestamp) and creates a new record in audit table. The previous timestamp is passed out via UserStatus and comes into the Extract job as a parameter to constrain the where clause to only transactions that have occurred after the previous extract. It also updates the audit record with the results of the extract, so it also takes a job parameter that is the 'key' to the audit record, something also passed out via UserStatus from the first job. Note that the first job is capable of generating an initial timestamp when the job runs for the first time without having to 'seed' the audit table with a starter record.

I'm sure there are other ways, but this has been working flawlessly for me for several months now. Hope this helps.

thamark · Post by **thamark** » Tue Jul 13, 2004 8:40 am

The previous timestamp is passed out via UserStatus and comes into the Extract job as a parameter to constrain the where clause to only transactions that have occurred after the previous extract.

Can you please give some more detail on UserStatus. How this is passed and all?

chulett · Post by **chulett** » Tue Jul 13, 2004 9:18 am

Each job has a 'User Status' area associated with it. You can 'park' information there and it is available to any other process after that... at least until the job is rerun or recompiled. A Sequencer job can 'automatically' pick up information from there and bring it into a job as a parameter, it will show up in the Parameter Helper thingy as 'stagename.$UserStatus'.

A couple of things to know. It can only pass out one 'item' but you can get around that restriction by passing out a dynamic array or a delimited string and then parsing out the individual pieces. You use DSSetUserStatus to put something in there, but since it must be called you have to encapsulate it in a routine if you want to use it in a derivation. My simple little routine has two lines of code:

Code: Select all

Call DSSetUserStatus(Arg1)
Ans=Arg1

This allows a field to 'pass through' the routine and also end up in the user status area. Note that it is basically a hash that is holding this information, so the 'last one in wins' rule applies. No matter how many times you write to it, only the last thing you wrote will be available. I typically use this in the derivation of Stage Variables to help control the number of times this gets called.

From Job Control you would use the DSGetJobInfo function with an InfoType of DSJ.USERSTATUS to retrieve the information. Or, as I said, automate it using a Sequencer job. These functions are documented in your online help available from the Designer.

Hope this helps,

peternolan9 · Post by **peternolan9** » Wed Jul 14, 2004 2:12 am

Hi Craig,
times like this I wish that we would have a place for people to publish/share DS code.......because I'd really like to see how you have done this..... (if you are feeling generous peter@peternolan.com

)

When you say you have used the methodology quoted above, did you mean my suggestion of a parameter or the other append with passing the constraint from the remote database?

On Auditing.

Your comment really caught my eye.....I used to do billing systems so auditing is a 'big deal' for me....and I did auditing in my first DS project...but the code (written by the very, very best around, non other than the great John Whyman) I felt was so complicated that I gave up on auditing in DS.....

I would truely love to know how to do a decent job of auditing data transfer in DS into a database that is not too hard....I used to keep batch number, file cycle number, row count, timestamps for each file coming into the DW. I still do in my own software. This is what John wrote for the project I mentioned. It worked, but it meant that we needed a start job and a stop job for each transfer job.....we had 70+ tables in the staging area on that job so we needed 140 jobs to control the 70.....it all ended up being too many jobs......

Hoping you are feeling in a generous mood...

Peter

chulett wrote:
peternolan9 wrote:I can think of a few ways myself including passing the last extract date time as a parameter to each job that performs an extract
Peter, don't know about the 'bright spark' bit or if this is really the 'best way' to do it, but I've recently done exactly this and used the methodology quoted above.

Basically, I'm using two jobs and a Sequencer to tie them together along with an Oracle table to store 'audit' type information for the process. This information includes the last extract timestamp and a row count from the extraction process. The first job queries this table for the max(timestamp) and creates a new record in audit table. The previous timestamp is passed out via UserStatus and comes into the Extract job as a parameter to constrain the where clause to only transactions that have occurred after the previous extract. It also updates the audit record with the results of the extract, so it also takes a job parameter that is the 'key' to the audit record, something also passed out via UserStatus from the first job. Note that the first job is capable of generating an initial timestamp when the job runs for the first time without having to 'seed' the audit table with a starter record.

I'm sure there are other ways, but this has been working flawlessly for me for several months now. Hope this helps.

peternolan9 · Post by **peternolan9** » Wed Jul 14, 2004 2:23 am

peternolan9 wrote:Hi All,
oh joy oh joy. :D I'm doing a DS project where ALL the source tables actually have a timestamp for their last update and they are small enough to be able to extract from the source oracle database using a where clause on the timestamp. This is the first time ever for me...

Hi All,
just thinking about this some more....

I was thinking that since the system I am getting data from is an operational system where data will be being updated during the extraction process surely I will need two timestamps.

I will need a timestamp for last date-time extracted. That goes without saying.

But surely I will also need a timestamp for 'start of extraction processing' so that as I perform the extractions any rows that are updated during period extraction is taking place will not be extracted, they will be extracted next time. And I must say that the datetime last update of the row must be BETWEEN the lastextract datetime and thisextractdatetime.

If I don't do this I will see rows inserted during the extraction process either twice or not at all depending on whether the row was inserted before I got to it or after I passed it and the timestamp I used for lastextracteddatetime, the start or the end of the extraction process....

Or am I crazy?

peternolan9 · Post by **peternolan9** » Wed Jul 14, 2004 2:28 am

1stpoint wrote:Source Stage (OCI/ODBC)
Select ...
From SourceTable
Where TimeColumn > (select max(LastExtracted) from staging_control_table)

@DATE contains the date/time the job was started and will update the LastExtracted date when the job finishes. I would use an aggregator stage to save this value and an update to the "staging_control_table".

By using the subquery method you avoid table scanning the source and the need for a constraint.

Hi 1stpoint,
but our staging control table will not be in the same database as our operational data and we have no way of putting that table into the operational system....(or I doubt it anyway...)

Will what you suggest still work?

Thanks

Peter

kduke · Post by **kduke** » Wed Jul 14, 2004 4:53 am

Peter

The time should be based on the extract system's time. The first job should extract the current time from this source and store it in a hash file and in the target. This time could be displayed on the EDW reports. This way all tables will have a snapshot at exactly the same time. Hopefully this will maintain integrity across tables. This is the best methodology that I have seen. If someone has a better idea then please share.

chucksmith · Post by **chucksmith** » Wed Jul 14, 2004 7:22 am

Search for

self-healing process

. I responded to a similar question regarding a Unidata source.

ogmios · Post by **ogmios** » Wed Jul 14, 2004 9:38 am

Be careful for Oracle and its multi-versioning which could possibly break down your scheme.

Ogmios

peternolan9 · Post by **peternolan9** » Thu Jul 15, 2004 1:58 am

1stpoint wrote:Source Stage (OCI/ODBC)
Select ...
From SourceTable
Where TimeColumn > (select max(LastExtracted) from staging_control_table)

@DATE contains the date/time the job was started and will update the LastExtracted date when the job finishes. I would use an aggregator stage to save this value and an update to the "staging_control_table".

By using the subquery method you avoid table scanning the source and the need for a constraint.

Hi 1stpoint/All,
the more I think about this the more I believe that we must select using a between lastextracttsamp and thisextracttstamp.

Why? If we select with just later than lastextracttstamp we get the two following possibilities.

1. A row is inserted into the table before we reach that position in the table.
2. A row is inserted into a table after we have scanned past the page that it has been inserted into but we have not finished scanning yet.

In case 1 we will exract the row in this pass because the lastupdatetstamp on the row will be greater than the last extract. But we will also get it next time because the lastupdatetstamp will still be greater than the start time of lastextracttstamp.

In case 2 we will miss the row this time but we will get it next time.

Clearly, if we try to fix case 1 by saving the end extract tstamp as the lastupdatetstamp then we will completely miss the row in case 2.

Obviously we cannot turn on repeatable read for an extract due to contention.

So, the more I think about it, the more I believe that to extract data from the source database that is being updated during the extract requires two timestamps. Lastextracttstamp and thisextracttstamp where thisextracttstamp is set at the start of extract processing. This way, we will see any record that was updated before the extract processing started. We will not see any record updated after extract processing started, we will get that the next time around.

Anybody out there see a hole in this thinking?

Also, we are first going to try and do this via an oracle statement which uses a subselect on the control table and a select on the source table and see if oracle is smart enough to perform the sub select once and send the two tstamps to the source system so that the performance will be ok.

Second option is to fetch the two tstamps from the database and pass them as parameters to each extract job.

I'll let you all know how it goes....

Thanks

ogmios · Post by **ogmios** » Thu Jul 15, 2004 3:41 am

Peter,

I see one big hole: http://asktom.oracle.com/pls/ask/f?p=49 ... 677475837,.

Which has bitten me already a few times. In Oracle you should make sure no-one is updating the table where you're getting the updates from. In other database you would have a similar problem but in Oracle they are more "real".

Ogmios

ogmios · Post by **ogmios** » Thu Jul 15, 2004 3:46 am

ogmios wrote:Peter,

I see one big hole: http://asktom.oracle.com/pls/ask/f?p=49 ... 677475837,.

Which has bitten me already a few times. In Oracle you should make sure no-one is updating the table where you're getting the updates from. In other database you would have a similar problem but in Oracle they are more "real".

With Oracle the commit point and start of your extract query is the most important. Any change which gets done to your table after you started the extract query will not be seen: Oracle multi-versioning. However if a process is updating the table at that time it may insert a row with a current timestamp which you would miss then. For 1 project where every change was important I used Oracle streams to get changes out of my system.

So your case 1 and 2 can never occur.

Ogmios

peternolan9 · Post by **peternolan9** » Thu Jul 15, 2004 5:34 am

Hi Ogmios,

hhhmmmm, maybe it was not 'Joy oh Joy' after all

I always thought the last updated timestamp would be set to the system time at commit time unless it is the program that puts the timestamp into the column.

I seem to remember having this discussion about 15 years ago regarding DB2 and being told if you use system datetime and assign that to the last update timestamp you can be assured that it will be the system date time at the time of the commit on the row....does oracle do something different?.....

If so last updated timestamp is about as useful as an ash tray on a motorbike (and I used to own one...)

Peter

ogmios wrote:
ogmios wrote:Peter,

I see one big hole: http://asktom.oracle.com/pls/ask/f?p=49 ... 677475837,.

With Oracle the commit point and start of your extract query is the most important. Any change which gets done to your table after you started the extract query will not be seen: Oracle multi-versioning.

Ogmios

DSXchange

Incremental Extract based on a timestamp

Incremental Extract based on a timestamp

idea

Re: Incremental Extract based on a timestamp

Re: Incremental Extract based on a timestamp

Re: Incremental Extract based on a timestamp

Re: Incremental Extract based on a timestamp

Re: idea

Re: idea

Re: idea

Re: idea

Re: idea