

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 67
28
Recent evidence on this aspect is reported in
:even specialized, domain specific tools achieve precision (percentage of correct error
[17]detections) and recall (percentage of real errors detected) numbers in the 60 – 70% range on average.
www.persistent.com
—
Some tools are delving into data quality governance, allowing for users with a data steward profile to define
rules to define validity for data elements on data domains, cleansing invalid data, and monitoring data quality
on selected datasets over time. More on this below.
—
Some tools allow for IT to enforce governance from two points of view:
1. The system resources being used, imposing limits on sizes and time spans of user-generated content
(transformations and resulting datamodel and instances), and
2. Promotion processes for business user-generated content to a pilot phase and then to production, each
phase providing more governance and quality checks. Importantly, the tool may allow the IT architect to
retrieve the transformations done by the business user in the technical user interface he needs to
complete the job: for instance, to improve performance on large volumes, the IT user needs a more
technical, traditional interface.
This space is very young, and the state of the art of the existing products is being challenged (and rapidly making
progress) in a number of areas:
The first area is in
Data steward input
. Customers are asking for explicit approval steps for data steward users which
may know better the quality of their data than the tools in some cases. Indeed, the cleansing done, the matches found
28
and the survivorship performed by the system is, in general, suboptimal .
Let’s illustrate these points with a customer deduplication example. A match workflow may suggest the user to run
cleansing / standardization. Once names and addresses have been cleaned and put into standardized formats,
matching will detect that what appeared to be two customers at first, it may turn out to be one. Perhaps one entry has a
PO box address and the other has a street address, but the rest of the data indicates that it may be the same customer.
The systemmight suggest that the two records are in fact duplicates, but this may turn out not to be the case.And even
if a data steward user does approve the suggestion, at the survivorship phase, it is not clear which address to pick, and
the default survivorship rule might not be the right choice. The tool should let the data steward review and ultimately
decide about duplicate set suggestions and about surviving address values.
A second area is
relationships across datasets
(e.g., referential integrity). Simply put, when data is deleted or
updated, the relationships with other datasets typically need to bemaintained by hand.
To illustrate this point, let’s continue our example and assume that our user wants to combine this clean and de-
duplicated customer dataset with a sales dataset, which will very likely have entries for the two merged customers
using the two previous different keys. After the match, one of the keys is logically invalid: all sales records that exist
with the invalid key would have to be corrected to maintain referential integrity and/or to get correct results, and this, in
today’s systems, has to be done by hand. Otherwise, several bad things may happen: (a) if there is referential integrity
enforced by the DBMS among the two tables, a deletion of the record with the invalid key may not go through (and our
user might not understand why the deduplication operation fails); (b) if it is a soft delete, or (c) if there is no referential
integrity enforcement among these two datasets, then sales from the customer record with the invalidated key will not
be taken into account, and results will be wrong.