Before or after??

This forum is in support of all issues about Data Quality regarding DataStage and other strategies.

Moderators: chulett, rschirm

Post Reply
dh_Madhu
Premium Member
Premium Member
Posts: 65
Joined: Sat Apr 23, 2005 3:19 am
Location: Stirling, Scotland

Before or after??

Post by dh_Madhu »

What are the advantages or consequences of addressing the data quality issues after loading the data into the warehouse?
What would be an ideal one for a banking domain??
Thanks.
Regards,
Madhu Dharmapuri
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

After has no real advantages for the ETL process. But that should not preclude ongoing data quality audits of the DW to ensure that processes are not out of control.

It is much better to address data quality issues before or during ETL. That way you are much more likely to load "good" data and handle fewer rejects.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
roy
Participant
Posts: 2598
Joined: Wed Jul 30, 2003 2:05 am
Location: Israel

Post by roy »

Hi,
If your checking did the benefitiary/ies name changed and update on that basis then cleansing the names after the ETL will cause them to pop up as updates where in fact they may not be (the same will happen on every field you cleans and gets changed)
So I think doing it before the ETL is a better option.

IHTH,
Roy R.
Time is money but when you don't have money time is all you can afford.

Search before posting:)

Join the DataStagers team effort at:
http://www.worldcommunitygrid.org
Image
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

You can do data quality checking before, during and after ETL load. The most effective data quality checking up front is profiling, with ProfileStage being the obvious choice for a DataStage project and the upcoming releases making it even better.

In a banking domain where your amounts must tally the profiling tells you what data you are going to have trouble loading, what amount fields have negative or null values etc. A little profiling up front can save you a lot of pain later on.

In Dune the motto was "the spice must flow", in BI loads the motto is "the data must flow" so you are often letting through data that either is missing referential integrity and needs to be repaired (via augmentation) or you are defaulting empty fields or making up descriptions etc.

I place each row of data in four categories:
- Processed without issue.
- Processed but failing one or more business rules.
- Rejected failing one or more business rules.
- Rejected by the architecture.

In transit data quality can count these four categories and turn these into metrics. The last category can lead to row leakage where transformer rejects or database rejects are dropped off. You need good reject handling to make sure you trap these. Our business rules are usually written in a transformer and can be set to reject or continue on a rule by rule basis.

After data quality can measure what percentage of your data has these known data quality issues such as augmentation and defaulted fields. A good data analyst can also proactively find issues that had been missed during ETL testing as an ongoing exercise.

Another form of after data quality is the quality assurance audits on row counts. Kim Duke has done some work and his qa table approach is included in the etlstats download.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

You can pay me now or pay me later but quality matters.
Mamu Kim
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

In that particular situation, quantity matters too! :lol:
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

How many DWs can tell you the quality of the quantity?
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Every well-implemented one! 8)
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

I really should add to Vincent's more serious post that a tool such as QualityStage can be an immense asset in standardising the form of data, which is part of cleansing. You can develop separately then iintegrate into DataStage job streams or not, depending on how you want to process. Madhu, you will remember that some of this was done at your gig in Australia late last year and early this year.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
dh_Madhu
Premium Member
Premium Member
Posts: 65
Joined: Sat Apr 23, 2005 3:19 am
Location: Stirling, Scotland

Post by dh_Madhu »

Thanks to all for the inputs.
From my side, I always believe that Prevention is better than Cure and that is what most of you have suggested. I also think that junk data should reside in a system for the least amount of time.The more the time, the costlier it becomes to handle it and, Ray, I do remember those DQ issues that were hogging the show because of these reasons.And as a part of curing I believe a post DQ operation of a much lower intensity also has to be inevitably in place at least for a while until things get settled.
I could not mention what post cleaning is all about here as am not clear as to how and why in the first place would they wish to handle the whole lot of quality issues after load.
However, my views have also been conveyed and it has generated quite an interest to develop a new insight on how to go about the Data Quality issues.
Regards,
Madhu Dharmapuri
Post Reply