Match free format text

This forum is in support of all issues about Data Quality regarding DataStage and other strategies.

Moderators: chulett, rschirm

Post Reply
sdf

Match free format text

Post by sdf »

Hi there,

Is there a way (using Integrity) to extract and match names and addreses from a free format text file, e.g *.txt files or file with line type char(1000)



Lady S.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Basically no.
Because of its mainframe heritage, INTEGRITY (at least at current version) can work only with fixed-width format files.
A good plan is to use DataStage to convert to fixed-width format, call INTEGRITY passing the stream of data and receive the result, using the INTEGRITY stage type. By this means, you do not have to create any on-disk files (unless you explicitly choose to).
AmosR
Participant
Posts: 13
Joined: Thu Jan 02, 2003 7:14 am

Post by AmosR »

Hi Guys,

If I have a description field that can contains free text such as product names, packeging desc ,persons names and addresses (or in other words, everything is possible)

Can I use the integrity stage to extract some sense out of it ??
(assuming it's all in the same known field)

Did anyone try it ... how good it is?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Yes, but you still have to set up the rules within INTEGRITY, including possible redefinitions (overlays) of data format.
timwalsh
Participant
Posts: 29
Joined: Tue Mar 04, 2003 7:48 am

Post by timwalsh »

Integrity works as well if not better at free-form data investigation, standardization and matching as its competitors.

However, realize that you must write and create custom rules no matter what tool you are using. Depending on the complexities of your data, extracting value from it can sometimes be difficult.

Be prepared to investigate your source data before your can identify trends and start looking at patterns. You also have to have master data list so that you can match your data after you standardize it.

Have you contacted someone from Ascential or a Data Cleansing expert to evaluate your situation and offer a solution? I would suggest a combination of Integrity and DataStage, if you have the luxury of having both tools!

Please let us know if you need more info!

Tim
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The "two tools" solution is excellent. DataStage (6.0 or later) can reformat the data into a fixed-width format required by INTEGRITY (within these fields data can be free format, but rules must have been created for making sense of these). From DataStage you can invoke INTEGRITY through a stage, which means that the data do not touch down on disk. The results are returned to become the output of that stage, again meaning that the data do not touch down on disk. Throughput is excellent. Since the Parallel Extender architecture underpins both products, the advantages of this technology can be obtained too, allowing efficient processing of huge volumes of data.
Post Reply