WP_Data Management Best Practices

W H I T E P A P E R

This separate product may be a self-service data preparation product, which we will comment about in section

below.

6.3.4

Personally, after having working first in the database field and later in data integration, it came to me as a surprise that I never saw an ETL

tool allow developers to provide treatment for rows violating a DBMS constraint.

www.persistent.com

Other aspects on our experience with data quality tools include the following:

—

On profiling tools:

Most profiling tools have user interfaces where users can manually specify columns to

inspect. In large enterprises, the data to profile is large, and manual profiling would be very impractical. For

relationship profiling on columns of two tables, tools traditionally allow for pairwise column dependency

profiling. Architects need to look for solutions that allow to

automate profiling of across all columns and on

entire schemas

. We have also found that it’s best to look for a data profiling solution that enables you to

construct data validation and correction routines directly from the profiling reports. This will help you combine

data inspection and correction phases, which will

streamline your overall data quality development

process

. We will look again at this aspect in self-service data preparation systems below, as business users

are those who would benefit more of streamlining.

—

Datamining, statistics andmachine learning technology is being increasingly applied to address data

quality problems

. We devote a subsection (see

below) of the big data quality section on this subject,

6.3.3.3

even though the subject is not strictly specific to big data.

—

Data governance aspects such as measuring and monitoring quality is often neglected

. Customers

want to easily drill and navigate to specific areas of the warehouse where quality is being monitored and find

out about quality measures. Our experience is that this area is often overlooked and, when addressed, it is

done in a separate product, which brings usability and productivity problems. On the other hand, it is true that

measuring and monitoring quality is the natural role of a data steward, who is a less technical user, and this

explains why data quality governance may be built on a separate product offer .

Data governance should

also include review steps for automatic data quality actions

. People do not always trust changes

proposed by automatic data quality actions in production systems. This is a healthy attitude, as experience

today shows that results of tools in real world datasets are far from perfect. This is another of the duties of data

stewards.

—

If quality of streaming data and “static” data needs to occur in the project, you should

favor tools or use a

streaming-oriented architecture that allows operating on both types of data

. This favors reuse of

existing pipelines on either type of data, and avoids having to develop explicit batch-oriented “delta” loads, by

applying the same changes done through an initial load to data being streamed in as it changes in the sources.

In our experience, attention needs to be paid to the degree to which these architectures are robust with respect

to schema changes at the data source. Indeed, we have seen that upon source schema changes, pipelines

frequently break.

—

Last, I’d like to comment on the relationship of between integrity rule violation and data quality. In theory,

developers could rely on constraints on the target DBMS set up by the project’s database designer (e.g., key

constraints, foreign key / primary key constraints, NOT NULL, and the like). In practice,

developers have to

go through the trouble of developing validation rules on the ETL tool

, which sounds like duplicate work

–even though these days ELT processing mode has become mainstream, as mentioned above in section 3.

The reason is that database engines allow to catch constraint violations with proprietary SQL syntax (even

though the SQL-99 standard has comprehensive support for integrity constraints), and ETL engines don’t go

into the trouble of generating specific code for each engine .