

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 68
www.persistent.com
As the reader will have understood by now, self-service systems don’t have the notion of a data model capturing
relationships among several datasets is being built: they work a dataset at a time. They have yet to capture more
model semantics, such as dimensional models, or at least foreign key/ primary key relationships among datasets.
ETLs can be programmed to produce either one of these through technicalities such as surrogate key generation,
automated mapping of source keys to surrogates, and replacement of (both the valid and the invalid) source keys with
the single surrogate for matched records, in all fact tables linked to the customer table. Needless to say, this remains
out of the reach for most business users: only technically savvy users would do this right (and on a more technical
interface). Simpler manual solutions could be proposed through specific workflows, such as replacing the invalid keys
from non-surviving records by those from surviving records in other datasets as indicated by the user, but this goes
beyond the state of the art of today’s tools.
Third, which is related to the previous example, users could benefit greatly from
two-way content authoring
between IT and business users
. This means that not only IT should be able to take content authored by business
users to operationalize it (rather than redoing everything the business user did from scratch), for example, to deal with
data modeling aspects as the previous example shows, or with incremental updates to the original datasets, but the
converse should be true as well: when the content is in pilot phase or in production, and when functional requirements
impacting the former data preparation change, when new data elements in the sources become available, or when
new datasets could help in improving insights, the business users would like to work on their content again (rather
than redoing everything they and the IT user did from scratch).
Finally, a fourth area is
support for a broader set of domains
, as well as in
support for user-defined domains
(a.k.a., business types), i.e., the ability for advanced business users and data stewards to define validation, cleansing
and de-duplication rules for their own data domains in an easy and data-driven manner. Self-service data preparation
systems have here an opportunity to dramatically increase the usefulness of their systems with the introduction of
these domains. Tomention a few:
I) The ability to involve profiling during the definition of a data domain would streamline the development of the
validation, cleansing andmatching rules of the domain. We already commented on this in sectio
nbut we
6.3.1,believe that it is in self-service systems where this functionality is the most cost-effective. Profiling a column
would immediately tell the usual patterns and/or frequent erroneous values of a domain under definition, and
enables the user to construct data validation, correction and matching rules directly from these patterns for the
domain.
ii) The ability to automatically detect columns as being of a certain domain (which is a problem that is being
framed as a machine learning classification problem), will improve the automated suggestion capabilities and
subsequent execution of data validation and cleansing of data in these columns;
iii) The ability to use domains in keyword search will improve the discovering power of the system to find datasets
containing instances of these domains;
iv) The ability to use columns assigned to a domain on operations such as joins may make these operations much
more robust, as the automated application of validation/cleansing operations on domain data will produce
more and better quality join results. A simple, pervasive example is on date/time data, which may appear in
different formats on two dataset columns and not be directly joinable. The application of the right cleansing /
standardization rules will convert the data into joinable values. Again, the system may make smart
suggestions to help the user given the data instances at hand.