WP_Data Management Best Practices

W H I T E P A P E R

Recent evidence on this aspect is reported in

even specialized, domain specific tools achieve precision (percentage of correct error

detections) and recall (percentage of real errors detected) numbers in the 60 – 70% range on average.

www.persistent.com

—

Some tools are delving into data quality governance, allowing for users with a data steward profile to define

rules to define validity for data elements on data domains, cleansing invalid data, and monitoring data quality

on selected datasets over time. More on this below.

—

Some tools allow for IT to enforce governance from two points of view:

1. The system resources being used, imposing limits on sizes and time spans of user-generated content

(transformations and resulting datamodel and instances), and

2. Promotion processes for business user-generated content to a pilot phase and then to production, each

phase providing more governance and quality checks. Importantly, the tool may allow the IT architect to

retrieve the transformations done by the business user in the technical user interface he needs to

complete the job: for instance, to improve performance on large volumes, the IT user needs a more

technical, traditional interface.

This space is very young, and the state of the art of the existing products is being challenged (and rapidly making

progress) in a number of areas:

The first area is in

Data steward input

. Customers are asking for explicit approval steps for data steward users which

may know better the quality of their data than the tools in some cases. Indeed, the cleansing done, the matches found

and the survivorship performed by the system is, in general, suboptimal .

Let’s illustrate these points with a customer deduplication example. A match workflow may suggest the user to run

cleansing / standardization. Once names and addresses have been cleaned and put into standardized formats,

matching will detect that what appeared to be two customers at first, it may turn out to be one. Perhaps one entry has a

PO box address and the other has a street address, but the rest of the data indicates that it may be the same customer.

The systemmight suggest that the two records are in fact duplicates, but this may turn out not to be the case.And even

if a data steward user does approve the suggestion, at the survivorship phase, it is not clear which address to pick, and

the default survivorship rule might not be the right choice. The tool should let the data steward review and ultimately

decide about duplicate set suggestions and about surviving address values.

A second area is

relationships across datasets

(e.g., referential integrity). Simply put, when data is deleted or

updated, the relationships with other datasets typically need to bemaintained by hand.

To illustrate this point, let’s continue our example and assume that our user wants to combine this clean and de-

duplicated customer dataset with a sales dataset, which will very likely have entries for the two merged customers

using the two previous different keys. After the match, one of the keys is logically invalid: all sales records that exist

with the invalid key would have to be corrected to maintain referential integrity and/or to get correct results, and this, in

today’s systems, has to be done by hand. Otherwise, several bad things may happen: (a) if there is referential integrity

enforced by the DBMS among the two tables, a deletion of the record with the invalid key may not go through (and our

user might not understand why the deduplication operation fails); (b) if it is a soft delete, or (c) if there is no referential

integrity enforcement among these two datasets, then sales from the customer record with the invalidated key will not

be taken into account, and results will be wrong.