How to calculate the size of the text file

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

paddu
Premium Member
Premium Member
Posts: 232
Joined: Tue Feb 22, 2005 11:14 am
Location: California

How to calculate the size of the text file

Post by paddu »

I know its not Datastage question but this information would be very useful to datastage users.

I do not know how we calculate the size of a text file .
i know it depends on number of columns , total length of the columns .

Please throw some light

Thanks
paddu
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The operating system will report the size of the text file when you do a DIR command.

Calculating - or, more accurately, estimating - in advance is a bit more tricky.

For example, numbers are converted into strings of characters. Do you have in mind a fixed-width format (in which case the file size is simply a product of the number of lines and the line length, and the line length is simply the sum of the field lengths plus 2 for the line terminator)?

Or do you have in mind delimited format (in which case you will need to estimate the average field size for each field - for example how many digits on average in an Integer, how many characters on average in a VarChar?)? Sum these averages, add two and multiply by the number of lines.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
paddu
Premium Member
Premium Member
Posts: 232
Joined: Tue Feb 22, 2005 11:14 am
Location: California

Post by paddu »

Ray,

I wanted to know by and large how we estimate the size of the file but anyways i have pipe delimited file with 5 columns ( field size expected are Numeric 6,char 10 , date 8 , char 10 and Nchar 3)

Number of lines or rows expected are 15Million.
narasimha
Charter Member
Charter Member
Posts: 1236
Joined: Fri Oct 22, 2004 8:59 am
Location: Staten Island, NY

Post by narasimha »

One approach-
Run your job to load 1000 rows into a sequential file, check the size of the file.
Multiply the size by 15000.
You will get the size estimate for 15 million rows!
Narasimha Kade

Finding answers is simple, all you need to do is come up with the correct questions.
paddu
Premium Member
Premium Member
Posts: 232
Joined: Tue Feb 22, 2005 11:14 am
Location: California

Post by paddu »

Narasimha-No it did not work that way .

I file description i gave before is about size 684,078KB for exactly 15446662 records.

I thought there should be general formula to calculate .

Thanks
paddu

I
narasimha
Charter Member
Charter Member
Posts: 1236
Joined: Fri Oct 22, 2004 8:59 am
Location: Staten Island, NY

Post by narasimha »

Well, that was just a thought.
Can you share as to how much variance you got with this estimate?

- What was the size of the file to which you loaded 1000 rows ?
- When you multiply that size with 15446.662 (from your total numbers), how much is the variance to your final size 684,078 KB.
Narasimha Kade

Finding answers is simple, all you need to do is come up with the correct questions.
paddu
Premium Member
Premium Member
Posts: 232
Joined: Tue Feb 22, 2005 11:14 am
Location: California

Post by paddu »

oh oh

The 684,078KB file is my source (pipe delimited )which has 5 columns ,15446662 rows and lengths i mentioned earlier .


Coming to loading into text file (pipe delimited ) with 1000 rows ,the size is 34kb.
34 times 15446.662 is 525186.508

I hope now you see the difference
paddu
Last edited by paddu on Thu Feb 15, 2007 5:28 pm, edited 1 time in total.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There's no general formula. You must construct a formula for each file.

Let's consider your example of pipe delimited file with 5 columns ( field size expected are Numeric 6,char 10 , date 8 , char 10 and Nchar 3). Numeric 6 may contain from 1 to 6 characters - let's assume it's always six characters. Then the record size is the sum of:

Code: Select all

Numeric(6)                  6 bytes
Char(10)                   10 bytes (12 if quoted)
Date(8)                     8 bytes 
Char(10)                   10 bytes (12 if quoted)
NChar(3)                    6 bytes (8 or 10 if quoted, depending on the quote character used)
Pipe delimiters             4 bytes 
Line terminator (Windows)   2 bytes 
Worst case is (6+12+8+12+10+4+2) = 54 bytes per line. For 15,446,662 lines that equates to 834,119,748 bytes.
Best case is (6+10+8+10+6+4+2) = 46 bytes per line. For 15,446,662 lines that equates to 710,546,452 bytes.
Dare I suggest that your Char(10) columns are not fully populated? That is, the delimited strings sometimes contain fewer than 10 characters, in which case they are more accurately described as VarChar(10).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
paddu
Premium Member
Premium Member
Posts: 232
Joined: Tue Feb 22, 2005 11:14 am
Location: California

Post by paddu »

you are absolutely right not all the values of Char(10) are populated .

Thank you Ray for the information .
i never thought of pipe delimiters and line terminators bytes :(



Every day is a learning day
Thanks :)
paddu
narasimha
Charter Member
Charter Member
Posts: 1236
Joined: Fri Oct 22, 2004 8:59 am
Location: Staten Island, NY

Post by narasimha »

Yeah..as Ray said no gereral formula or something like Average Row length in Oracle to estimate the row size.
But I would be curious to know what is the size of your target file after loading all the 15446662 roecords into it.
(Just in case there was some transformation/loss in the process of loading)
Narasimha Kade

Finding answers is simple, all you need to do is come up with the correct questions.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

There's probably an end-of-file byte (Ctrl-Z on Windows, Ctrl-D on UNIX) as well, but I didn't think a difference of one byte would be of concern.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
paddu
Premium Member
Premium Member
Posts: 232
Joined: Tue Feb 22, 2005 11:14 am
Location: California

Post by paddu »

Narasimha ,

i am loading the file into a Hashed file and i do not have access to Hashed file directory (No Admin rights :cry: )
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

Load it in your project directory.
Also, If you want to calculate the file size on windows, I posted a small routine that utilizes STATUS basic command to get the file size. Search for it, its here somewhere in the forum.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The calculations are different for hashed files, as there are internal pointers that also must be taken into account. The Hashed File Calculator (on your DataStage clients CD in the unsupported utilities folder) knows about these. You can use, as an estimate, 14 bytes per row for regular hashed files, or 26 bytes per row for 64-bit hashed files.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
paddu
Premium Member
Premium Member
Posts: 232
Joined: Tue Feb 22, 2005 11:14 am
Location: California

Post by paddu »

hashed file size is 2KB. ( loaded only 3 columns of lengths Char(10) , Nchar(3) and date 8 ( as per requirement).
Post Reply