WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

As the reader will have understood by now, self-service systems don’t have the notion of a data model capturing

relationships among several datasets is being built: they work a dataset at a time. They have yet to capture more

model semantics, such as dimensional models, or at least foreign key/ primary key relationships among datasets.

ETLs can be programmed to produce either one of these through technicalities such as surrogate key generation,

automated mapping of source keys to surrogates, and replacement of (both the valid and the invalid) source keys with

the single surrogate for matched records, in all fact tables linked to the customer table. Needless to say, this remains

out of the reach for most business users: only technically savvy users would do this right (and on a more technical

interface). Simpler manual solutions could be proposed through specific workflows, such as replacing the invalid keys

from non-surviving records by those from surviving records in other datasets as indicated by the user, but this goes

beyond the state of the art of today’s tools.

Third, which is related to the previous example, users could benefit greatly from

two-way content authoring

between IT and business users

. This means that not only IT should be able to take content authored by business

users to operationalize it (rather than redoing everything the business user did from scratch), for example, to deal with

data modeling aspects as the previous example shows, or with incremental updates to the original datasets, but the

converse should be true as well: when the content is in pilot phase or in production, and when functional requirements

impacting the former data preparation change, when new data elements in the sources become available, or when

new datasets could help in improving insights, the business users would like to work on their content again (rather

than redoing everything they and the IT user did from scratch).

Finally, a fourth area is

support for a broader set of domains

, as well as in

support for user-defined domains

(a.k.a., business types), i.e., the ability for advanced business users and data stewards to define validation, cleansing

and de-duplication rules for their own data domains in an easy and data-driven manner. Self-service data preparation

systems have here an opportunity to dramatically increase the usefulness of their systems with the introduction of

these domains. Tomention a few:

I) The ability to involve profiling during the definition of a data domain would streamline the development of the

validation, cleansing andmatching rules of the domain. We already commented on this in sectio

but we

6.3.1,

believe that it is in self-service systems where this functionality is the most cost-effective. Profiling a column

would immediately tell the usual patterns and/or frequent erroneous values of a domain under definition, and

enables the user to construct data validation, correction and matching rules directly from these patterns for the

domain.

ii) The ability to automatically detect columns as being of a certain domain (which is a problem that is being

framed as a machine learning classification problem), will improve the automated suggestion capabilities and

subsequent execution of data validation and cleansing of data in these columns;

iii) The ability to use domains in keyword search will improve the discovering power of the system to find datasets

containing instances of these domains;

iv) The ability to use columns assigned to a domain on operations such as joins may make these operations much

more robust, as the automated application of validation/cleansing operations on domain data will produce

more and better quality join results. A simple, pervasive example is on date/time data, which may appear in

different formats on two dataset columns and not be directly joinable. The application of the right cleansing /

standardization rules will convert the data into joinable values. Again, the system may make smart

suggestions to help the user given the data instances at hand.