WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

4.4.1.5What to dowith data of poor quality?

Business users are aware that data quality is a serious and expensive problem. Thus, most organizations are likely to

support initiatives to improve data quality. But most users probably have no idea where data quality problems originate

or what should be done to improve data quality. The best solution is to have the data captured accurately in the first

place. There are three choices if poor data quality is found:

1. Discard the offending records

2. Sending the offending record(s) to a suspense file for later processing

3. Tag the data with an error condition and pass it through the next step in the pipeline

The third approach is recommended. Bad fact table data can be tagged with an audit dimension record ID that

describes the overall data quality condition of the offending fact row. Bad dimension data can also be tagged using an

audit dimension or, in the case of missing or garbage data, can be tagged with unique error values in the field itself.

Both types of bad data can also be eliminated from query results if desired. We discuss this topic further in section

. The other two choices either introduce distortion in the data or may compromise integrity of the data, as records

aremissing.

In big data environments, this topic becomes more of an open question than in a traditional data warehouse, as we will

see in section

below.

4.4.1.6When is it important to capture and integratemetadata?

The list of requirements might include one or several of the following functionality:

—

Impact analysis

(e.g., which views and reports would be impacted in the production database upon new

changes being prepared in development of the next version)

—

Lineage analysis

(e.g., where does the data from this report come from, which transformations has it gone

through), and

—

Automating updates / additions to the target presentation server model upon corresponding changes in the

sourcemodel,

In any of these cases, there needs to be a way to integrate the various metadata repositories from the off-the-shelf

selected tools, which may include a modeling tool, an ETL tool, the target database and a BI tool. The DW/BI

architecture must address this matter. Metadata management tools are starting to become available as extensions of

an ETL and/or a Data Quality suite and enable features such as the above. However, a single source DW/BI

technology provider is the best hope to get a single, integrated metadata repository. Otherwise, standards such as

CWM for exchanging metadata among different systems are nowmature and being used for these matters, and some

tools based on this standard can read and write to themajor DW/BI vendor metadata repositories.