Data quality checks in datawarehouse/ETL

A forum for discussing DataStage<sup>®</sup> basics. If you're not sure where your question goes, start here.

Moderators: chulett, rschirm, roy

Post Reply
Tsholedi
Participant
Posts: 1
Joined: Mon Mar 18, 2019 9:30 pm

Data quality checks in datawarehouse/ETL

Post by Tsholedi »

Hi Team

I need help with data quality process /model in the data warehouse.
Kindly advise how to do it based on completeness ,accuracy, validity, uniqueness.

I need advise how to design a process to use to cater for data quality in datastage.

Regards
Zandile
Zandile Moloi
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

Welcome aboard.

This is an enormous topic! Entire books have been written about how to do it.

Begin by understanding that data quality is one component of an overall governance program. The business must specify the rules that determine how data need to be in order to be fit for purpose.

From that, the business should be recording the policies, rules and a glossary of business terms that will control their information. Information is a very valuable asset, and it belongs to the business. If they're not prepared to own it, walk away.

To gain an understanding of what the data look like, the data need to be profiled. The hard lifting can be done with a tool like Information Analyzer, but its results need to be reviewed by a business analyst, possibly with input from subject matter experts and data architect types. Information Analyzer can profile for structure, classification, format, etc., and also for compliance with data rules. In the longer term, regular monitoring for compliance with data rules is a very convenient mechanism for scoring the overall data quality in the organization's business.

It's only once all of the above has been completed (at least for the pilot project - don't try to boil the ocean) that DataStage and QualityStage developers should become involved. Techniques available include:
  • standardizing data (e.g. MEG = MARGARET, DR = DOCTOR)
    matching data if required
    applying data rules (Data Rules stage --> pass/fail and details)
    capturing exceptions (Exceptions stage)
    managing exceptions through Data Quality console
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
Post Reply