Data Set vs Sequential Stage

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
pkothana
Participant
Posts: 50
Joined: Tue Oct 14, 2003 6:12 am

Data Set vs Sequential Stage

Post by pkothana »

Hi All,

Could anybody let me know whether using the DataSet stage in place of sequential file stage (in PX jobs) will have any advantages in terms of performance or any other criteria? Also, please let me know if there are any drawback using the DataSet.

Thanks in advance,
Regards,
Pinkesh
Peytot
Participant
Posts: 145
Joined: Wed Jun 04, 2003 7:56 am
Location: France

Post by Peytot »

DataSet File : - It keeps the paralellism so if you create in a first Job a DataSet file, in your second job, you will run faster.
- You cannot read these files with server.
- Only PX can read this kind of file
- You can do a view data in the stage

Sequential : - You loose your parallelism. So you loose performance
- You cannot do a view data in the stage but you can access outside in your file (under Unix or windows if you wish). You can modify the data for the test (for example).
- You can archive them.

Regards,

Pey
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

A dataset is an internal data staging file format for Parallel jobs. If Parallel jobs are going to have a common/repeated dataset for merge or lookup operations, landing it to a dataset is beneficial. It preserves the data so that many jobs could benefit by using the exact same data during their operations.

A sequential file has no referencing capability. It has to reside on a specific server file system.

The dataset format is proprietary and the only way to inspect it is via a DataStage. From a staging standpoint, it is not useful as a means of preparing a ready-to-load "file", because of the proprietary nature and tendency to be non-persistent. The sequential file is easy to audit the data, as it can be inspected by just about any text browser (more, cat, grep, vi, etc).

A sequential file is easy to manipulate and Ralph Kimball (oooohhmmm) recommends that you use it as the preferred method for milestone/recovery/restart staging formats because of the audit/transportability/ease-of-use of this format. Use datasets if you need reference capabilities and the data is not-persistent, meaning is temporary work files.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
bigpoppa
Participant
Posts: 190
Joined: Fri Feb 28, 2003 11:39 am

Data Set vs Sequential Stage

Post by bigpoppa »

You can also read/write from/to a file set using PX. A file set is a set of partitioned sequential files, similar to a dataset, yet viewable w/o PX.

A PX trick with using datasets is to keep your data byte-aligned. I got this tip a while ago, and from what I understand, byte-alligned data is easier and faster for PX to process.

- BP
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Can you expand on this; in particular how can data NOT be byte-aligned? Do you mean word-aligned?
What are the implications for NLS, where the number of bytes used to store any particular character may be one, two three or even four?
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
bigpoppa
Participant
Posts: 190
Joined: Fri Feb 28, 2003 11:39 am

Data Set vs Sequential Stage

Post by bigpoppa »

Ray,

You're right. Word-aligned, I believe. I don't know much more than what I already posted. I don't know about the NLS stuff.

- BP
Post Reply