WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

A data target with duplicate data produces inaccurate results and is frequently inconsistent; ETL tools can

help with this operation

(see chapter

and

p. 383). There is very seldom a universal column that makes the

9 [1]

matching operation easy; the only clues are generally on columns carrying similar data. Before pair-wise

comparisons are performed among all records in the dataset (an operation which has O(n2) complexity), some

systems allow to specify a discriminating column or set of columns (e.g., such as zipCode for matching vendor data),

and only records with the same values on that or those fields will be compared (this is called blocking, and in big data

systems blocking is receiving a lot of attention –see section

below). Then, comparisons using fuzzy matches

6.3.3.1

are frequent on some data elements such as names or addresses, as similar but different spellings or conventions are

used; other, such as phone numbers, need exact matches. Survivorship is the process of combining a set of matching

records into a unified, single record.

6.2.5Data Quality at Deployment Stage

We concentrate in this section on

data quality assurance testing

(

p. 546-547) among the various others to be

[1]

performed at the alpha and beta testing of deployment stage, which is basically a process that makes sure that the

contents that are loaded (or to be loaded) in the data warehouse are correct. The audit information captured by the

flows may tell us we have the right number of rows, and referential integrity checking may tell us everything matches

up. But correctness can only be verified by running a set of queries or reports from the source system(s), running the

corresponding queries or reports from the data warehouse, and comparing results.

When there is a single source, determining the corresponding queries and reports is much easier than when

integrating data from several sources. When there are several sources and they are large, the difficult part is

determining sufficient coverage through queries and in determining the corresponding warehouse queries, including

complex calculations needed by the downstreamBI applications.

Engage business users and the source systems

owners to help create a solid set of data checks

. When there are differences that can’t easily be explained (existing

reports may have long been in error) document themand get help froman internal audit team.

During this last phase, data assurance testingmay be done at three levels:

(i) At dimension or fact initial load test level, with a representative subset of real data

(ii) At the primary test dataset used for end-to-end system testing (usually a small dataset), and

(iii) At the real, live data level, both the initial load and the incremental data loads, as mentioned in sectio

n 5.3.6

above.

Make sure that automation is done at least for the two first levels, saving a log of the results of every run and comparing

test results against known correct results. It is also recommended to divide automated tests into a basic acceptance

test, to run very frequently to test that builds are correct, a regression test suite to make sure that functional changes

on a new build does not break anything, and a complete test suite covering all branches of the ETLprocessing.

Go-live is described in section

above. After the system goes live it is possible that some data quality errors might

5.3.6

slip through the net, even if you thought data quality assurance had weeded themall out.