Page 1 of 1

Dataset

Posted: Fri Sep 03, 2010 4:32 am
by balaya.ds
while loading dataset how many files are created internally?

and what is default path of dataset ...?

Posted: Fri Sep 03, 2010 4:42 am
by arvind_ds
No idea about number of files but the path is whatever you have defined in your configuration file for Datasets/Scratch folder location. By default it should go to Project directory.

Re: Dataset

Posted: Fri Sep 03, 2010 4:42 am
by ramsubbiah
balaya.ds wrote:while loading dataset how many files are created internally?

and what is default path of dataset ...?
you need to derive your dataset path,whatever path you have defiend the dataset will reside on that path.

while laoding the dataset based on the nodes (you have defined in the configuration file) the reocrds will be loaded in to the dataset.

Posted: Fri Sep 03, 2010 6:15 am
by ray.wurlod
At least one file per resource disk mentioned in the node pool that the Data Set stage is using from the configuration file. More than one file if the operating system limits file size (for example to 2GB). More than one file (potentially) if you append to the Data Set.

Posted: Wed Sep 15, 2010 8:53 am
by kvsudheer
ray.wurlod wrote:At least one file per resource disk mentioned in the node pool that the Data Set stage is using from the configuration file. More than one file if the operating system limits file size (for example t ...
My question is in the samelines of this post, so i am continuing in this post. please let me know if i have to start a new thread.

Till now i have been thinking that when we create a dataset, the descriptor file will be stored in resource disk and the data file(s) will be stored in scratch disk space.
But as per this post , i understand that even the data files will be stored in resource disk. means the dataset has nothing to do with scratch disk?

i request you to kindly guide me in this regard.

Thanks,
Sudheer

Posted: Wed Sep 15, 2010 9:34 am
by priyadarshikunal
The descriptor is stored in folder specified in dataset stage property (project directory if no folder specified) and datafiles are stored in Resource disk specified in configuration file. Hence, Scratch disk is never used for dataset storage unless resource and scratch disk are same in configuration file or may be for virtual datasets(not for datasets itself).

Scratch disk is used as buffer between/for processes and should get cleaned after job completion.

Posted: Wed Sep 15, 2010 12:58 pm
by vivekgadwal
priyadarshikunal wrote: The descriptor is stored in folder specified in dataset stage property (project directory if no folder specified) and datafiles are stored in Resource disk specified in configuration file.
Right on.

The Resource disk is a permanent storage for the data set data file. However, as the discussion is going on I think the question I am about to post is relevant here. We are having our resource disk in a path and that is getting filled up pretty fast. I was going through that directory the other day and found that data files were sitting there from a couple of years ago. However, it is quite intermittent. The same data file is not present for all the days, but it is present on random dates (at least it seems random dates to me).

The question is, why are these still sitting there on all these dates. All of these data sets are set to be "Overwritten" in every run. So, why are these data files sitting there from times immemorial?

Posted: Wed Sep 15, 2010 3:46 pm
by kwwilliams
Many reasons, the first one I would suspect is that you changed the path for your descriptor file, or someone deleted the descriptor file. When it is set to overwrite, it reads the descriptor to delete the data from each of the node locations. If someone deleted your descriptor it wouldn't have this informtion. If you moved the path for your descriptor, it would be like a new data set and wouldn't overwrite anything.

Compare the date on the descriptors to the date on the data files in your resource locations and clean up the ones where the dates don't match.

Posted: Thu Sep 16, 2010 7:28 am
by vivekgadwal
kwwilliams wrote: Compare the date on the descriptors to the date on the data files in your resource locations and clean up the ones where the dates don't match.
Thanks for your response. I am not sure if somebody deleted that descriptor file as I am relatively new at this place. Ever since I got here though, none of that happened. Anyway, is it okay if I do a simple "rm" on those unnecessary data files? It would not have any other repercussions?

Posted: Thu Sep 16, 2010 8:46 am
by kwwilliams
You would need to try to ensure that the locations of your dataset descriptor does not match to the data file you are removing. You wouldn't want to delete data that someone is dependent upon. Most environments will have a handful of environmental vairables used to direct the location of the descriptor. If you're not sure ask someone who has been there for a while. If there as old as you say, then I would think that it would be safe to remove.

Posted: Thu Sep 16, 2010 8:50 am
by vivekgadwal
Thanks. I will make sure about that. :D

Posted: Thu Sep 16, 2010 8:54 am
by kumar_s
kwwilliams wrote:You would need to try to ensure that the locations of your dataset descriptor does not match to the data file you are removing. You wouldn't want to delete data that someone is dependent upon. Most environments will have a handful of environmental vairables used to direct the location of the descriptor. If you're not sure ask someone who has been there for a while. If there as old as you say, then I would think that it would be safe to remove.
Righto!
Deleting the data file without right iformation may lead to disaster.
Why not "orchadmin delete" the descriptor file which you are sure about.

Posted: Thu Sep 16, 2010 9:02 am
by vivekgadwal
kumar_s wrote: Why not "orchadmin delete" the descriptor file which you are sure about.
Normally we would do that. However, there are these files sitting from 2 years ago. I see the same file (of course, the extended name - some Hex things appended to the data set name - is different) again for later dates (including the latest date).

Posted: Thu Sep 16, 2010 9:06 am
by kwwilliams
vivekgadwal wrote:
kumar_s wrote: Why not "orchadmin delete" the descriptor file which you are sure about.
The question being posed is why would he have older dates on dataset resource files than exist on the descriptor files. His dataset overwrite function is not working and he was seeking an answer to why. Orchadmin is not needed for this situation.