What is your company doing about data quality?

gleman · Post by **gleman** » Tue Nov 05, 2002 6:14 am

Back in June, Metagenix conducted a survey of database professionals and asked them how they were treating the issue of data quality. We were quite surprised to note that very few businesses had a formal data quality program.

So what steps are you taking on data quality? Does your company have a formal program of process improvement for data quality? Do you track datat quality? Are measurements taken? Do you cleanse after the fact, or do you try to fix root causes? How are people attacking this problem?

datastage · Post by **datastage** » Thu Jan 23, 2003 4:58 pm

I'm not surprised at all that the survey yeilded a small number of businesses with formal data quality programs. And I think the most common scenario is cleasing after the fact or cleasing only in one island and not fixing the root causes. If an ODS has dirty data and it sources a data mart, I feel what typically happens is the ETL process will either clean or at least find a way to handle issues caused by bad data, leaving the data in the ODS uncleansed as well as the root cause completely unaddressed.

The question I have is will the song remain the same or will there be a trend in the near future to actually resolve these data quality issues

gleman wrote:Back in June, Metagenix conducted a survey of database professionals and asked them how they were treating the issue of data quality. We were quite surprised to note that very few businesses had a formal data quality program.

So what steps are you taking on data quality? Does your company have a formal program of process improvement for data quality? Do you track datat quality? Are measurements taken? Do you cleanse after the fact, or do you try to fix root causes? How are people attacking this problem?

srinivasb · Post by **srinivasb** » Tue Mar 11, 2003 9:43 pm

Hi,

I agree with you that corporate awareness of Data Quality is quite low. While the acronym "ETL" or "ODS" rings a bell with most CXOs.

Data quality does not figure very high on their radar .The reason for this being the assumption " once our DW is in place , data quality will be very high".

It is never realized that source data can be unclean irrespective of any system. Most companies dont realize that 60% of DW effort is on Data quality/cleansing and transformation

Maybe ETL tools like DataStage should build quality manager functionality within the ETL itself.
This wil enable a user to get metrics on data quality once the ETL process is executed.

Could you please share your best practices document for data quality at enterprise level?

Regards
Srinivas.B

gleman · Post by **gleman** » Wed Mar 12, 2003 2:50 pm

Data quality does not figure very high on their radar .The reason for this being the assumption " once our DW is in place , data quality will be very high". It is never realized that source data can be unclean irrespective of any system. Most companies dont realize that 60% of DW effort is on Data quality/cleansing and transformation

Actually, it's the DW that often points out the DQ problems in the source systems. People run their legacy systems for years thinking everything's ok, but the data is only exposed within the silo of that particular application. It's not until that data gains visibility outside of the legacy system that the problems become apparent. This is usually followed by a fire drill of "let's clean up this mess" -- expressed as a big cleansing project.

Maybe ETL tools like DataStage should build quality manager functionality within the ETL itself.
This wil enable a user to get metrics on data quality once the ETL process is executed.

I'm doubtful that ETL vendors will step up to this kind of functionality. Most of them seem focused on providing bigger and better plumbing, but don't pay much attention to what's going through the pipes. That's in spite of what you point out as around 2/3 of the total cost being devoted to cleaning up the mess. The best you can do for metrics in an ETL tool is to look at log reports and examine the number of reject rows, time to load, etc.

As for QM and DataStage, while they're sold in the same "package", there's not any integration between them that I'm aware of. If you want to be able to integrate data quality management with ETL, then you need to find a data quality management package that supports a web services approach and let the ETL tool call the DQ package. For example, MetaPure from Metagenix ( http://www.metagenix.com ) does what you've suggested. The ETL tool can execute a MetaPure package at the beggining, during, and/or end of a run to collect metrics and validate data.

srinivasb · Post by **srinivasb** » Wed Mar 12, 2003 10:00 pm

Hi Ken,

Thanks for the knowledge sharing. During one project which we executed , we had thought of a rule based engine, something like a Database table which stores all business rules with versions and other details.

During run time, the ETL process/tool checks the rules out and changes to business rules can be made without hitting the ETL layer.
Highly theory oriented, but have you done any such approaches..I mean we did not have the time to try this out.

Regards
Srinivas.B

srinivasb · Post by **srinivasb** » Wed Mar 12, 2003 10:04 pm

Hi,

My actual question was " How to integrate a rule based engine " with our DataStage ETL tool? Have you done that?

Regards
Srinivas.B

kcbland · Post by **kcbland** » Thu Mar 13, 2003 9:55 am

I initially designed 2 approaches to prove a point. When you introduce anything to a single threaded process, you are limited to single threaded execution. So, doing an external call to an API to interrogate the cleanliness of a row means that you have this interface overhead and the single threaded execution. The other approach is to prepare a file and hand it all to the API to handle the whole file at once. This allows the cleansing process to scale to the right level to handle the volume, either through partitioned parallelism or multithreaded execution.

This is the exact same situation when using any 3rd party sort/aggregate tool inside DataStage. For example, CoSort's plugin can be used to sort and aggregate data on the incoming data stream, but for optimal performance it's actually faster to spool to a file and then call the external CoSort engine to handle it according to it's own nature.

I designed my rule based "engine" to use a source file. A parameter and configuration file contains weighting information so that the engine can size the incoming file (number of rows, number of columns) and determine how many processes to invoke to handle the source data. It then physically splits the source data file into N smaller files and gives each file to a process, then recombines the result files from each process to preserve original row order. This allows a small file to use just one process, while larger files will be handled by multiple processes.

Hope this helps,
-Ken

gleman · Post by **gleman** » Sat Mar 15, 2003 8:24 am

That's interesting that you took that approach. We took a different approach with MetaPure ( http://www.metagenix.com/products ) and allowed the ETL tool to call us at the transaction level and maintain the advantages of multi-threaded execution. The key element in this strategy was to make the business rules engine visible to the rest of the world via a web service interface. Here's a little background:

MetaPure is a DQ based rules engine. The user can build a library of rules for managing DQ. They can build extremely sophisticated rules that span multple data sources and implement complicated logic. It works with parameters to determine how to map actual data into the rules. It has a very nice interface for generating common types of data quality checks. And it's got a completely browser based web interface, so there are none of those pesky system management issues or DLL problems to go through. The desktop installation instructures are to point your browser at a web address.

MetaPure has a SOAP based web service interface that can validate an individual transaction. When we designed it we were meeting a requirement of having applications call it for data validation/cleansing, but an ETL job is really just another application. Thus, you get the multiple threads of execution by having the ETL job make the SOAP request during the transformation pipeline. MetaPure is multi-threaded (it's a web service), so you get the full benefit of the pipeline. We also have a "rule checker" that will process an entire data source in batch mode, and it's got the intelligence to break the input stream into partitions and process in parallel, but the advantage of being able to call these rules from within the ETL environment is pretty sexy.

MetaPure also keeps quality metrics of what's going on, so you get a complete historical picture of how you're doing. You also get a nice drill down capability to look at the snapshots of the data as it moved through, as well as the ability to process events. For instance, a user might notice than in early March the error rate spiked, and then realize that this was the same time that a new application was brought on line far upstream.

I've always been frustrated by the fact that you can write a set of business rules in one application, but then you can't do anything about them in another. Why should a QM user have to rewrite all of their rules (in a different language even) just to implement the checks on the ETL side? Debugging is hard enough without having to go through it twice. MetaPure solves that one nicely -- write the rule once, and then access the rules engine from multiple points.

True, you could write all of these rules in DataStage (a very tall order), and you could even build the huge infrastructure needed to implement the quality management aspects. But in the end you'd only have an ETL process that can manage quality. With the approach of separating the business rule processing into a separate web service, you get access to those hard earned business rules across the enterprise. Wouldn't it be nice if the operational systems had access to the same business rules and clean data starting arriving in your staging area?

If you're currently doing a DataStage project, consider how easy your life would be if you could discover business rules during data profiling and automatically implement checks for those rules as part of your DataStage jobs. There'd be some integration work involved (MetaRecon doesn't have a concept of a "business rule", just profiles), but that's the basic value proposition of separating business rule processing out into a web service based approach.

There's a reason there's so much excitement out there about web services. Frankly, I'm surprised that the ETL vendors haven't figured this out yet. But at least most ETL tools support the ability to call a web service, so you can implement this approach today.

One last plug: If you're a consultant doing this kind of work for a client, you really ought to contact Metagenix to become a Certified System Integrator. Your cost will be minimal and the value you add for your clients will be huge.

http://www.metagenix.com/products

kcbland · Post by **kcbland** » Sat Mar 15, 2003 10:12 pm

You're absolutely correct. Quality Manager, Integrity, and MetaRecon are not really integrated with DataStage, more like bundled.

As DataStage Server processes each row, the overhead of doing any external call for any reason will stagnate the data stream until that call returns. Using multiple job instances will magnify that stagnation as now many parallel jobs will be doing the exact same calls. Using DataStage Parallel Extender edition would be in the exact same place.

My experience is to segment and draw extremely clear lines in each of the phases of ETL processing. I would NEVER design a process that actively sources a database table, cleanses to source system standards, transforms, cleanses again to target system standards, and loads ALL IN ONE CONTINUOUS data stream. This is a fundamental design choice where I differ from many architects who poo-poo landing data. In almost every warehouse I have ever built, the source availability and target availability windows NEVER coincide. A near 24x7 warehouse is not accomplishable if you cannot snapshot your source data as quickly as possible when it's available. A near 24x7 warehouse is not accomplishable if you cannot prepare load data for the target and then within a very very very short window apply those inserts/updates.

Whew, that being said, you arrive at a job stream that will land data, and therefore can take advantage of a prepared data set that can be handed to an engine to process however best it can.

Thanks!

gleman · Post by **gleman** » Sun Mar 16, 2003 10:43 am

I agree on the need for a staging area. Going directly from source to target isn't what it looks like in theory.

I disagree about the function call overhead using the parallel extender. You've got a truly multi-threaded solution with parallel extender. The limiting factor is how fast the data stream can serve data, not how fast the processors can perform the computations for the function calls. That's why the performance improvement isn't completely linear -- the data can only arrive so fast.

If you want to build an external process for handling a staging file instead of going through the ETL tool, the approach will work just fine (that was the first generation of usage for MetaPure before we added the web services interface). But most people aren't going to be in a position to cobble together their own cleansing/validation system to work during one of these stages. The importance of having a rules based engine do this work, either as part of an ETL process or just as part of the overall pipeline, cannot be understated.

The problem is that every ETL tool I've seen (and that's a lot of them!) has no concept of a business rule. They can't allow the user to think of the business rules abstractly and then apply those rules to ETL jobs, applications, batch checks, etc. The business rules engine approach handles that nicely, and the added benefit is that your ETL programming can become more efficient because you only have to define the rule once and then apply it around the enterprise whenever you need it.

DSXchange

What is your company doing about data quality?

What is your company doing about data quality?

Re: What is your company doing about data quality?

Data Quality is an issue

ETL Tools with DQ Built In

Re: Ken's reply

Re: Ken's reply

Integrating a rule based engine

Integrating a Rule Based Engine with DS