Sequencial file name capture in a job

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

trammohan
Participant
Posts: 47
Joined: Thu Nov 13, 2003 12:47 pm

Sequencial file name capture in a job

Post by trammohan »

Hi,

I am using sequential file stage to read data from *XYZ*.txt files. when I select file name column in the output file name is coming as *XYZ*.txt ..
My question is how to get the actual file name while reading the data from sequential file..

Thanks in advance....
trm
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi,
In your output link change the Read Method to File Pattern,
this will let you use wildcards for the file name

IHTH,
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
trammohan
Participant
Posts: 47
Joined: Thu Nov 13, 2003 12:47 pm

Post by trammohan »

Hi Roy,

Thanks...I want to include the file name in the column list while reading the data.......I want the actual file not the file name with wild char..

Thanks
trm
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

As far as I am aware this is not possible, but would be happy to be proven wrong. Each row in the stream of rows that is being processed may have come from any of the files. Reading from the files that match a pattern is like using cat to make a single stream in a filter, except that you can get some parallelism happening.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
thebird
Participant
Posts: 254
Joined: Thu Jan 06, 2005 12:11 am
Location: India
Contact:

Post by thebird »

Hi,

There is an Environment Variable - APT_IMPORT_PATTERN_USES_FILESET, which when set to TRUE, returns the exact file name from which the record is being read.

There was a post regarding this in the Developer net forum, which was answered by Danny Owen.

I have used this in one scenario, and it does work fine with the File Pattern option. But there was 1 issue - if there are no files matching the pattern mentioned, then the job aborts.

Hope this helps.

Regards,

The Bird.
trammohan
Participant
Posts: 47
Joined: Thu Nov 13, 2003 12:47 pm

Post by trammohan »

Hi The Bird,

When I set APT_IMPORT_PATTERN_USES_FILESET this parameter to TRUE it is printing the output file name not the input file name ...

is there any other param to set for input filename?

trm
thebird
Participant
Posts: 254
Joined: Thu Jan 06, 2005 12:11 am
Location: India
Contact:

Post by thebird »

Hi trm,

There is no other parameters/variables that you have to set for this. If this variable is set and -

1. File pattern option set in your source sequential file stage to read the multiple source files

2. The File name column option chosen in the Source sequential file stage and the additional column (for the Source File Name) defined in the Columns tab

you should be able to see the corresponding source file name from which the record is read, when you do a View Data on the source stage. And this column, you should be able to carry forward to the downstream stages.


Hope this solves your problem.

Regards,

The Bird.
trammohan
Participant
Posts: 47
Joined: Thu Nov 13, 2003 12:47 pm

Post by trammohan »

Hi Brid,

I have 2 input files ( trm1.txt and trm2.txt ). It is picking up the trm2.txt file name and putting in the file_name column even for trm1.txt records...
trm
anton
Premium Member
Premium Member
Posts: 20
Joined: Wed Jul 19, 2006 9:32 am

Post by anton »

this is precisely my experience as well - APT_IMPORT_PATTERN_USES_FILESET causes each node from apt config to pick up a file name and use that for all the files it processes.

so in my case i have two nodes and 200 files, as a result (if i have APT_IMPORT_PATTERN_USES_FILESET set to true for the job) i get a file name column populated by the sequential stage, but there are only two unique values in it instead of 200).

file1.dat,data1,data2
file2.dat,data3,date4
file1.dat,data5,data6 <-- this actually came from file3.dat
file2.dat,data7,date8 <-- this actually came from file4.dat
...

alternatively, if in a naive assumption that things would work in a "common sense" way (without setting any variables), i would specify the file name column in sequential stage, and specify a file pattern in a read method, and feed it the wildcard corresponding to my files, every single row would have my wildcard, not the actual expanded file name.

*.dat,data1,data2 <-- this actually came from file1.dat
*.dat,data3,date4 <-- this actually came from file2.dat
*.dat,data5,data6 <-- this actually came from file3.dat
*.dat,data7,date8 <-- this actually came from file4.dat

therefore file name column option in sequential file stage is pretty much useless and misleading, as well as APT_IMPORT_PATTERN_USES_FILESET variable.

so, the question remains - is there a simple (config-time) option to preserve the file name from the pattern-based files read by the sequential file name stage?

thank you.

trammohan wrote:Hi Brid,

I have 2 input files ( trm1.txt and trm2.txt ). It is picking up the trm2.txt file name and putting in the file_name column even for trm1.txt records...
trm
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The Sequential File stage can generate two additional columns, one containing the file name of the file currently being read, the other containing the line number within that file of the record currently being read.

But, as noted, it may be wise to set APT_IMPORT_PATTERN_USES_FILESET to False. Or at the very least to experiment. That reported behaviour suggests a small bug.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
anton
Premium Member
Premium Member
Posts: 20
Joined: Wed Jul 19, 2006 9:32 am

Post by anton »

ray.wurlod wrote:The Sequential File stage can generate two additional columns, one containing the file name of the file currently being read, the other containing the line number within that file of the record currently being read.

But, as noted, it may be wise to set APT_IMPORT_PATTERN_USES_FILESET to False. Or at the very least to experiment. That reported behaviour suggests a small bug.
thank you for your response, but i am afraid you did not read my post correctly or the post i was replying to.

let me try again.

given in sequential file stage:
- "file name column" is set under "options"
- file pattern is set to /dir/file*
- read method is set to "file pattern"

APT_IMPORT_PATTERN_USES_FILESET is not present or explicitly set to false:
- i get /dir/file* as the value of the file name column for all records in every file

APT_IMPORT_PATTERN_USES_FILESET is set to true
- i get just one unique file name as the value of the file name column for all records in every file. if i run under 2-node configuration, i get two unique file names, etc. so if i have 100 different files, only two file names will ever be used.

once again, both "file name column" and APT_IMPORT_PATTERN_USES_FILESET do not work in this situation.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The "file name column" property must refer to a column (type VarChar probably) that is defined on the output link. Is this the case with your design?

The same is true for the file row number property, if you use that.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
anton
Premium Member
Premium Member
Posts: 20
Joined: Wed Jul 19, 2006 9:32 am

Post by anton »

ray.wurlod wrote:The "file name column" property must refer to a column (type VarChar probably) that is defined on the output link. Is this the case with your design?

The same is true for the file row number property, if you use that.
yes, and, as i mentioned, it gets populated - just with the wrong data.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

If that's the case (I have not have a chance to check yet) you need to report the bug through your support provider. They will also demand a reproducible case, so have that ready so they can't stall you.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
anton
Premium Member
Premium Member
Posts: 20
Joined: Wed Jul 19, 2006 9:32 am

Post by anton »

according to IBM this is fixed in patch 96576 for DS EE 7.5.1A; we are yet to try it in our environment.
Post Reply