How to calculate the size of the text file
Moderators: chulett, rschirm, roy
How to calculate the size of the text file
I know its not Datastage question but this information would be very useful to datastage users.
I do not know how we calculate the size of a text file .
i know it depends on number of columns , total length of the columns .
Please throw some light
Thanks
paddu
I do not know how we calculate the size of a text file .
i know it depends on number of columns , total length of the columns .
Please throw some light
Thanks
paddu
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The operating system will report the size of the text file when you do a DIR command.
Calculating - or, more accurately, estimating - in advance is a bit more tricky.
For example, numbers are converted into strings of characters. Do you have in mind a fixed-width format (in which case the file size is simply a product of the number of lines and the line length, and the line length is simply the sum of the field lengths plus 2 for the line terminator)?
Or do you have in mind delimited format (in which case you will need to estimate the average field size for each field - for example how many digits on average in an Integer, how many characters on average in a VarChar?)? Sum these averages, add two and multiply by the number of lines.
Calculating - or, more accurately, estimating - in advance is a bit more tricky.
For example, numbers are converted into strings of characters. Do you have in mind a fixed-width format (in which case the file size is simply a product of the number of lines and the line length, and the line length is simply the sum of the field lengths plus 2 for the line terminator)?
Or do you have in mind delimited format (in which case you will need to estimate the average field size for each field - for example how many digits on average in an Integer, how many characters on average in a VarChar?)? Sum these averages, add two and multiply by the number of lines.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Well, that was just a thought.
Can you share as to how much variance you got with this estimate?
- What was the size of the file to which you loaded 1000 rows ?
- When you multiply that size with 15446.662 (from your total numbers), how much is the variance to your final size 684,078 KB.
Can you share as to how much variance you got with this estimate?
- What was the size of the file to which you loaded 1000 rows ?
- When you multiply that size with 15446.662 (from your total numbers), how much is the variance to your final size 684,078 KB.
Narasimha Kade
Finding answers is simple, all you need to do is come up with the correct questions.
Finding answers is simple, all you need to do is come up with the correct questions.
oh oh
The 684,078KB file is my source (pipe delimited )which has 5 columns ,15446662 rows and lengths i mentioned earlier .
Coming to loading into text file (pipe delimited ) with 1000 rows ,the size is 34kb.
34 times 15446.662 is 525186.508
I hope now you see the difference
paddu
The 684,078KB file is my source (pipe delimited )which has 5 columns ,15446662 rows and lengths i mentioned earlier .
Coming to loading into text file (pipe delimited ) with 1000 rows ,the size is 34kb.
34 times 15446.662 is 525186.508
I hope now you see the difference
paddu
Last edited by paddu on Thu Feb 15, 2007 5:28 pm, edited 1 time in total.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
There's no general formula. You must construct a formula for each file.
Let's consider your example of pipe delimited file with 5 columns ( field size expected are Numeric 6,char 10 , date 8 , char 10 and Nchar 3). Numeric 6 may contain from 1 to 6 characters - let's assume it's always six characters. Then the record size is the sum of:
Worst case is (6+12+8+12+10+4+2) = 54 bytes per line. For 15,446,662 lines that equates to 834,119,748 bytes.
Best case is (6+10+8+10+6+4+2) = 46 bytes per line. For 15,446,662 lines that equates to 710,546,452 bytes.
Dare I suggest that your Char(10) columns are not fully populated? That is, the delimited strings sometimes contain fewer than 10 characters, in which case they are more accurately described as VarChar(10).
Let's consider your example of pipe delimited file with 5 columns ( field size expected are Numeric 6,char 10 , date 8 , char 10 and Nchar 3). Numeric 6 may contain from 1 to 6 characters - let's assume it's always six characters. Then the record size is the sum of:
Code: Select all
Numeric(6) 6 bytes
Char(10) 10 bytes (12 if quoted)
Date(8) 8 bytes
Char(10) 10 bytes (12 if quoted)
NChar(3) 6 bytes (8 or 10 if quoted, depending on the quote character used)
Pipe delimiters 4 bytes
Line terminator (Windows) 2 bytes
Best case is (6+10+8+10+6+4+2) = 46 bytes per line. For 15,446,662 lines that equates to 710,546,452 bytes.
Dare I suggest that your Char(10) columns are not fully populated? That is, the delimited strings sometimes contain fewer than 10 characters, in which case they are more accurately described as VarChar(10).
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Yeah..as Ray said no gereral formula or something like Average Row length in Oracle to estimate the row size.
But I would be curious to know what is the size of your target file after loading all the 15446662 roecords into it.
(Just in case there was some transformation/loss in the process of loading)
But I would be curious to know what is the size of your target file after loading all the 15446662 roecords into it.
(Just in case there was some transformation/loss in the process of loading)
Narasimha Kade
Finding answers is simple, all you need to do is come up with the correct questions.
Finding answers is simple, all you need to do is come up with the correct questions.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
Load it in your project directory.
Also, If you want to calculate the file size on windows, I posted a small routine that utilizes STATUS basic command to get the file size. Search for it, its here somewhere in the forum.
Also, If you want to calculate the file size on windows, I posted a small routine that utilizes STATUS basic command to get the file size. Search for it, its here somewhere in the forum.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
-
- Participant
- Posts: 54607
- Joined: Wed Oct 23, 2002 10:52 pm
- Location: Sydney, Australia
- Contact:
The calculations are different for hashed files, as there are internal pointers that also must be taken into account. The Hashed File Calculator (on your DataStage clients CD in the unsupported utilities folder) knows about these. You can use, as an estimate, 14 bytes per row for regular hashed files, or 26 bytes per row for 64-bit hashed files.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.