

W H I T E P A P E R
21
This separate product may be a self-service data preparation product, which we will comment about in section
below.
6.3.422
Personally, after having working first in the database field and later in data integration, it came to me as a surprise that I never saw an ETL
tool allow developers to provide treatment for rows violating a DBMS constraint.
© 2017 Persistent Systems Ltd. All rights reserved. 58
www.persistent.com
Other aspects on our experience with data quality tools include the following:
—
On profiling tools:
Most profiling tools have user interfaces where users can manually specify columns to
inspect. In large enterprises, the data to profile is large, and manual profiling would be very impractical. For
relationship profiling on columns of two tables, tools traditionally allow for pairwise column dependency
profiling. Architects need to look for solutions that allow to
automate profiling of across all columns and on
entire schemas
. We have also found that it’s best to look for a data profiling solution that enables you to
construct data validation and correction routines directly from the profiling reports. This will help you combine
data inspection and correction phases, which will
streamline your overall data quality development
process
. We will look again at this aspect in self-service data preparation systems below, as business users
are those who would benefit more of streamlining.
—
Datamining, statistics andmachine learning technology is being increasingly applied to address data
quality problems
. We devote a subsection (see
below) of the big data quality section on this subject,
6.3.3.3even though the subject is not strictly specific to big data.
—
Data governance aspects such as measuring and monitoring quality is often neglected
. Customers
want to easily drill and navigate to specific areas of the warehouse where quality is being monitored and find
out about quality measures. Our experience is that this area is often overlooked and, when addressed, it is
done in a separate product, which brings usability and productivity problems. On the other hand, it is true that
measuring and monitoring quality is the natural role of a data steward, who is a less technical user, and this
21explains why data quality governance may be built on a separate product offer .
Data governance should
also include review steps for automatic data quality actions
. People do not always trust changes
proposed by automatic data quality actions in production systems. This is a healthy attitude, as experience
today shows that results of tools in real world datasets are far from perfect. This is another of the duties of data
stewards.
—
If quality of streaming data and “static” data needs to occur in the project, you should
favor tools or use a
streaming-oriented architecture that allows operating on both types of data
. This favors reuse of
existing pipelines on either type of data, and avoids having to develop explicit batch-oriented “delta” loads, by
applying the same changes done through an initial load to data being streamed in as it changes in the sources.
In our experience, attention needs to be paid to the degree to which these architectures are robust with respect
to schema changes at the data source. Indeed, we have seen that upon source schema changes, pipelines
frequently break.
—
Last, I’d like to comment on the relationship of between integrity rule violation and data quality. In theory,
developers could rely on constraints on the target DBMS set up by the project’s database designer (e.g., key
constraints, foreign key / primary key constraints, NOT NULL, and the like). In practice,
developers have to
go through the trouble of developing validation rules on the ETL tool
, which sounds like duplicate work
–even though these days ELT processing mode has become mainstream, as mentioned above in section 3.
The reason is that database engines allow to catch constraint violations with proprietary SQL syntax (even
though the SQL-99 standard has comprehensive support for integrity constraints), and ETL engines don’t go
22into the trouble of generating specific code for each engine .