

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 56
www.persistent.com
A data target with duplicate data produces inaccurate results and is frequently inconsistent; ETL tools can
help with this operation
(see chapter
and
,p. 383). There is very seldom a universal column that makes the
9 [1]matching operation easy; the only clues are generally on columns carrying similar data. Before pair-wise
comparisons are performed among all records in the dataset (an operation which has O(n2) complexity), some
systems allow to specify a discriminating column or set of columns (e.g., such as zipCode for matching vendor data),
and only records with the same values on that or those fields will be compared (this is called blocking, and in big data
systems blocking is receiving a lot of attention –see section
below). Then, comparisons using fuzzy matches
6.3.3.1are frequent on some data elements such as names or addresses, as similar but different spellings or conventions are
used; other, such as phone numbers, need exact matches. Survivorship is the process of combining a set of matching
records into a unified, single record.
6.2.5Data Quality at Deployment Stage
We concentrate in this section on
data quality assurance testing
(
,p. 546-547) among the various others to be
[1]performed at the alpha and beta testing of deployment stage, which is basically a process that makes sure that the
contents that are loaded (or to be loaded) in the data warehouse are correct. The audit information captured by the
flows may tell us we have the right number of rows, and referential integrity checking may tell us everything matches
up. But correctness can only be verified by running a set of queries or reports from the source system(s), running the
corresponding queries or reports from the data warehouse, and comparing results.
When there is a single source, determining the corresponding queries and reports is much easier than when
integrating data from several sources. When there are several sources and they are large, the difficult part is
determining sufficient coverage through queries and in determining the corresponding warehouse queries, including
complex calculations needed by the downstreamBI applications.
Engage business users and the source systems
owners to help create a solid set of data checks
. When there are differences that can’t easily be explained (existing
reports may have long been in error) document themand get help froman internal audit team.
During this last phase, data assurance testingmay be done at three levels:
(i) At dimension or fact initial load test level, with a representative subset of real data
(ii) At the primary test dataset used for end-to-end system testing (usually a small dataset), and
(iii) At the real, live data level, both the initial load and the incremental data loads, as mentioned in sectio
n 5.3.6above.
Make sure that automation is done at least for the two first levels, saving a log of the results of every run and comparing
test results against known correct results. It is also recommended to divide automated tests into a basic acceptance
test, to run very frequently to test that builds are correct, a regression test suite to make sure that functional changes
on a new build does not break anything, and a complete test suite covering all branches of the ETLprocessing.
Go-live is described in section
above. After the system goes live it is possible that some data quality errors might
5.3.6slip through the net, even if you thought data quality assurance had weeded themall out.