Data Set vs Sequential Stage

pkothana · Post by **pkothana** » Fri Nov 07, 2003 4:25 am

Hi All,

Could anybody let me know whether using the DataSet stage in place of sequential file stage (in PX jobs) will have any advantages in terms of performance or any other criteria? Also, please let me know if there are any drawback using the DataSet.

Thanks in advance,
Regards,
Pinkesh

Peytot · Post by **Peytot** » Fri Nov 07, 2003 7:33 am

DataSet File : - It keeps the paralellism so if you create in a first Job a DataSet file, in your second job, you will run faster.
- You cannot read these files with server.
- Only PX can read this kind of file
- You can do a view data in the stage

Sequential : - You loose your parallelism. So you loose performance
- You cannot do a view data in the stage but you can access outside in your file (under Unix or windows if you wish). You can modify the data for the test (for example).
- You can archive them.

Regards,

Pey

kcbland · Post by **kcbland** » Fri Nov 07, 2003 8:11 am

A dataset is an internal data staging file format for Parallel jobs. If Parallel jobs are going to have a common/repeated dataset for merge or lookup operations, landing it to a dataset is beneficial. It preserves the data so that many jobs could benefit by using the exact same data during their operations.

A sequential file has no referencing capability. It has to reside on a specific server file system.

The dataset format is proprietary and the only way to inspect it is via a DataStage. From a staging standpoint, it is not useful as a means of preparing a ready-to-load "file", because of the proprietary nature and tendency to be non-persistent. The sequential file is easy to audit the data, as it can be inspected by just about any text browser (more, cat, grep, vi, etc).

A sequential file is easy to manipulate and Ralph Kimball (oooohhmmm) recommends that you use it as the preferred method for milestone/recovery/restart staging formats because of the audit/transportability/ease-of-use of this format. Use datasets if you need reference capabilities and the data is not-persistent, meaning is temporary work files.

bigpoppa · Post by **bigpoppa** » Fri Nov 07, 2003 10:07 am

You can also read/write from/to a file set using PX. A file set is a set of partitioned sequential files, similar to a dataset, yet viewable w/o PX.

A PX trick with using datasets is to keep your data byte-aligned. I got this tip a while ago, and from what I understand, byte-alligned data is easier and faster for PX to process.

- BP

ray.wurlod · Post by **ray.wurlod** » Fri Nov 07, 2003 5:09 pm

Can you expand on this; in particular how can data NOT be byte-aligned? Do you mean word-aligned?
What are the implications for NLS, where the number of bytes used to store any particular character may be one, two three or even four?

bigpoppa · Post by **bigpoppa** » Tue Nov 11, 2003 1:11 pm

Ray,

You're right. Word-aligned, I believe. I don't know much more than what I already posted. I don't know about the NLS stuff.

- BP