Join heterogenous data

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
miamichillin99
Participant
Posts: 24
Joined: Wed Dec 01, 2004 11:12 am

Join heterogenous data

Post by miamichillin99 »

Let me start by saying thanks to the MODS and Admin of this great site. I'm new to DS and this site is going to help me a lot. Great Job!!!

No onto my Q:
I'm trying to join two data sources and retrieve the differences. for example:

Code: Select all

DataSource1
EmployeeID            Name
1                            xx
2                            yy

DataSource2
EmplyeeID             Name
1                           xx
2                           yy
3                           zz
The join would return only the record for EmployeeId #3. Also to add fuel to the fire DataSource1 is a sequential file and DataSource2 is an ODBC stage. So far I've created a sequential file going into a Transformer and the ODBC going to a hashed file and into the same Transformer. I'm trying to make the Transformer handle the join logic but it's not working.

If anyone has any ideas I would really appreciate it.
chucksmith
Premium Member
Premium Member
Posts: 385
Joined: Wed Jun 16, 2004 12:43 pm
Location: Virginia, USA
Contact:

Post by chucksmith »

So you set the output constraint to the link variable:

Code: Select all

DataSource2.NotFound
:?:
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard! :D

Since this is a server job, you really should have posted it on the server job forum.

The most obvious approach would be to load the contents of the text file into a hashed file with two key columns and no non-key columns.

The job to isolate the differences then extracts rows from your database (ODBC) table, performs a lookup (in a Transformer stage) against that hashed file, and sends the original row out of the Transformer only if the lookup succeeds (that is, if the reference input link's NOTFOUND variable is true). You handle this using a constraint expression on the output link from the Transformer stage.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
miamichillin99
Participant
Posts: 24
Joined: Wed Dec 01, 2004 11:12 am

Post by miamichillin99 »

Thanks chuck and ray. I'm really learning a lot here. I had the lookup going to the ODBC when it was suppose to go to the sequential file. I also added a hashed file to be created from the sequential file for faster lookups. I then added the correct NOTFOUND logic to the Transformer's contraints and now by looking at the rows moving from stage to stage it looks to be working better. But it's not perfect. I'm getting an error going from my FTP plugin (thats the sequential file) to my hashed file:

Code: Select all

IncrementalOperations..HAOldOperations.IDENT2: Unbalanced or unescaped quote character
At row 12457, link "FTPOut", while processing column "DESCRIPTION"
Unbalanced or unescaped quote character
I've looked at the sequential file at row 12457 and found nothing strange. Then when I look at the data in the "View Data..." of the Hashed File stage Inputs tab I see that the data is not in the same order as the sequential file before the FTP plug in. I have the file name in the has stage set to the same for the input and output. Is this correct or am I suppose to point the input to the incoming FTP filename?

So my question is how do I know what record is causing the problem. Is it record 12457 of the sequential file or is it record 12457 of the re-ordered hash file?

Sorry about posting in the wrong place. I'll get it right next time. Mods feel free to move this post to the correct forum.
[/code]
chucksmith
Premium Member
Premium Member
Posts: 385
Joined: Wed Jun 16, 2004 12:43 pm
Location: Virginia, USA
Contact:

Post by chucksmith »

First, hash file row order does not resemble the order of your input data. It can be ordered on output, but for a lookup it does not matter.

I guess that one of the descriptions has an odd number of quote characters in it.

In your FTP stage, what are the settings on the Output / Properties tab? Fixed width? Quote character? Column delimiter? Escape character?

Does you actual input file have quoted columns. Perhaps you should try setting the Quote character to 000.
miamichillin99
Participant
Posts: 24
Joined: Wed Dec 01, 2004 11:12 am

Post by miamichillin99 »

Hi chuck.

My setting are all default for the one's you asked:
Fixed width = No
Quote char = "
delimiter = ,
escape = \ (I also tried leaving this blank)

My sequential file does have quoted columns and is comma delimited.
chucksmith
Premium Member
Premium Member
Posts: 385
Joined: Wed Jun 16, 2004 12:43 pm
Location: Virginia, USA
Contact:

Post by chucksmith »

I suggest you do a test.

Code: Select all

FTP ---> SEQ
Tell the ftp stage that you have one unknown or varchar column, and set your quote character and column delimiter character to 000.

Let's see what actually appears in the sequential file.

You could also telnet to the location of the source file and use vi (or head, tail, and od) to look at the questionable row. I still think it has an embedded quote in the description text.
miamichillin99
Participant
Posts: 24
Joined: Wed Dec 01, 2004 11:12 am

Post by miamichillin99 »

Chuck,

using the debug with breakpoints and watches set up I was able to find the bad record. There was a quote (") in the middle of a column in the sequential file. Actually I found a few of them which are there because the file is comma delimited but has quotes around each column. This file is created using the Microsoft ODBC text driver and it places them by default around each column.

Thanks for all your help.
:D
miamichillin99
Participant
Posts: 24
Joined: Wed Dec 01, 2004 11:12 am

quotes within a field

Post by miamichillin99 »

Has anyone encountered the handling of quotes incoming from an ODBC source stage. I have an ODBC connection that will create a sequential file in the end. But some CHAR fields have quotes inside of them and it's creating a file with quotes which then cannot be used as an input to a hashed file like below:

Code: Select all

EmployeeID, Name
"12345","Dave "Matthews""
As you'll notice Matthews is in quotes but it's part of the entire second column, Name. Any way around Ascential thinking that the quotes within the column Name don't mean the field is ending. I thought Ascential would know that in a comma delimited file the column doesn't end until you hit the comma.
kcbland
Participant
Posts: 5208
Joined: Wed Jan 15, 2003 8:56 am
Location: Lutz, FL
Contact:

Post by kcbland »

Make sure you use the "Get SQL Info" button on the ODBC stage so that it properly registers the information about the ODBC driver in use. You'll see the Quote character text box right next to it probably change values now that it knows more about the ODBC driver.
Kenneth Bland

Rank: Sempai
Belt: First degree black
Fight name: Captain Hook
Signature knockout: right upper cut followed by left hook
Signature submission: Crucifix combined with leg triangle
miamichillin99
Participant
Posts: 24
Joined: Wed Dec 01, 2004 11:12 am

Post by miamichillin99 »

kc,

Thanks for your input. Unfortunately I had already done that. I did it again anyways and it left the quote character as a double quote("). I still have the sequential file as I posted previously. It still has quotes within the fields.
chucksmith
Premium Member
Premium Member
Posts: 385
Joined: Wed Jun 16, 2004 12:43 pm
Location: Virginia, USA
Contact:

Post by chucksmith »

I think you are back to trying to tell DataStage to ignore quotes when reading the file, and then using

Code: Select all

Trim(YourColumn,'"',"A")
to strip out the quotes. You still could run into problems with commas within strings. By the way, that second argument in the Trim() is single quote, double quote, single quote.
miamichillin99
Participant
Posts: 24
Joined: Wed Dec 01, 2004 11:12 am

Post by miamichillin99 »

Chuck,

Good to hear from you again. Actually I did try the QUOTE = 000 but just like you said now my problem is comma's inside the fields. So it's a constant problem. Either with commas or double quotes. I didn't try the Trim function though, because I would like to keep the fields as they come from the ODBC. So if they have a double quote or a comma I want to keep those characters in the resulting sequential file. How can I accomplish this?
:)
chucksmith
Premium Member
Premium Member
Posts: 385
Joined: Wed Jun 16, 2004 12:43 pm
Location: Virginia, USA
Contact:

Post by chucksmith »

Let me see if I understand the job. Does it look like this:

Code: Select all

ODBC ---> XFR ---> SEQ 2 ---> XFR ---> HASH
                               |
                               V
                   SEQ 1 ---> XFR ---> SEQ 3
Where ODBC is DataSource2 and SEQ 1 is DataSource1?

If so, could you try this:

Code: Select all

ODBC ---> XFR ---> HASH
                    |
                    V
        SEQ 1 ---> XFR ---> SEQ 3
If you eliminate SEQ 2, the problem is gone. Otherwise, try changing the quote character in SEQ 2 to something other than double quote, perhaps a pipe.

If my picture is wrong, please revise it.
Post Reply