Live Data Feed capturing through DS

manish1005 · Post by **manish1005** » Mon Mar 05, 2007 2:35 am

Hi,

I have been given a scenario in which I have to capture data from a live feed(can be in xml format, or csv), apply a couple of transforms and then store data into the database(DB2).

I wanted to know if it can be done using any of the DS stages?? I obviously dont want to download the whole feed into one file and then read it through sequential file or xml file stage....instead I want to do some sort of near real time transformation!!

any idea anyone?

kumar_s · Post by **kumar_s** » Mon Mar 05, 2007 4:11 am

RTI stage plugin is available. Or MQ.
Can you explain bit more on your requirement.

manish1005 · Post by **manish1005** » Mon Mar 05, 2007 7:00 am

Actually, I have a custom java application at a server that streams out some data after every couple of seconds for about 4-5 hours a day. I can configure the port to use for streaming the data out.

The data stream is in xml format.

Now I want to collect that data stream on some other system (where I have datastage installed). As and when data is collected, it has to be fed into database(DB2).

I dont know how to connect RealTime XML input stage to the data feed. And then where & how to parse it and store values in relational Db2 tables.

btw what is MQ?

chulett · Post by **chulett** » Mon Mar 05, 2007 7:53 am

You would have to purchase the SOA Edition of DataStage to get true 'real time' processing of data. And possibly the Java PACK if you needed to use a Java 'app' to process the data. SOA would allow you to deploy your ETL job as a web service and thus process your XML in real time as your Java app called it.

Otherwise you'd need to do something more like 'near real time' processing. Small or 'micro' batches depending on your terminology, launching your processing job every X minutes to grab whatever is available to be processed.

As to the last question: http://en.wikipedia.org/wiki/WebSphere_MQ

kumar_s · Post by **kumar_s** » Mon Mar 05, 2007 5:35 pm

You are right candidate to use MQseries. You can make the stage to listen to the port where the data is been ported out. And you an feed it in to XML stage to have persistent storage.
Do you already got RTI stage installed??

manish1005 · Post by **manish1005** » Wed Mar 07, 2007 3:57 am

Do you already got RTI stage installed??

Under the real time tag, I only gave xml stages installed, NO MQSeries stages installed.

I have been reading about MQSeries and SOA for a few hours now. I figured out that MQ series would best suit to my needs as suggested by kumar. But for that I need to *BUY* SOA edition of DS AND IBM MQSeries product. is that correct?

I guess, i would be better off writing a java application to collect the feed & writing it to a file, and then runing a DS job after regular intervals to collect data from file to simulate a near real time feed (as suggested by chulett).

The only problem with that approach is that how to clear the data after it has been read by the sequential file stage, also while the job is in progress more data can come in so I can not just use an after job routine to delete and recreate the file.
The problem becomes more complex if I have to simulate that functionality of MQStage, which deletes already read data only after job has been *sucessfully* finished.

Any coments?

jhmckeever · Post by **jhmckeever** » Wed Mar 07, 2007 4:22 am

manish1005,

I've used a similar technique to read and process data streamed into a named pipe. The pipe was created with a 'mkfifo' at the start of the job and removed at the end (using before and after ExecSH entries)

In my case, the pipe was being populated by the 'top' command to capture performance statistics whilst numerous jobs were running.

You could just leave your job running for the duration - you wouldn't necessarily need to run it periodically, although that would also work. You could get your Java process (or whatever populates the pipe) to output some special character string (guaranteed not to appear in genuine data!) to inform your job that the datastream is finished, allowing your job to shutdown gracefully.

HTH
J.

manish1005 · Post by **manish1005** » Wed Mar 07, 2007 5:50 am

jhmckeever, thanks for the reply. So, here I will have to create the named pipe before hand or through java application. and will have to share the path/name with DS Job.

If I get it right, named pipe will be accessed through sequential file stage like any normal file.
Simple Job design(keeping aside xml requirements for the time being)
Sequential File(with path of named pipe)--->Transformer---->DB2.

You could just leave your job running for the duration - you wouldn't necessarily need to run it periodically, although that would also work.

But what I couldn't get is, why will the job access the same sequential stage again and again as the data comes in?? or is there something else needed to be done to map named pipe in datastage??

Also, I am using Windows2000 server, so probably I will need to figure out how to use named pipe on windows.

kumar_s · Post by **kumar_s** » Wed Mar 07, 2007 5:55 am

You can use the same sequential file with the name of the pipe mentioned in it.
If you read one record, the next record will be available to read.
And you can read it periodically, as the data will be available in the pipe given by your Java stream.
Windows too have named pipe option. You can check for that.

jhmckeever · Post by **jhmckeever** » Wed Mar 07, 2007 6:20 am

manish,

Yes - You'll need to synchronise the name of the pipe your Java app is writing to with the one your DS job is reading from.

Depending on your configuration you could either get DS to invoke the Java app to populate the pipe, or get your Java app to invoke your ds job with either (Java app or DS job) passing the name of the shared pipe to the other.

Yes - The DS jobs will access the named pipe as a sequential file.

The job will continue to read the sequential file until the pipe is closed, or until your job shuts down. I don't know what would happen if you put an EOF on the pipe - maybe that would work?
HTH,
J.

Smeitei · Post by **Smeitei** » Fri Mar 09, 2007 3:00 pm

MQ Series... Get the Xml format through MQ and run a 24*7 job to keep on getting the Xml input and write to a file. You can have Daily unload files which you can sum up over the weekend and process it for insertion/update or as your frequency demands.