

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 55
www.persistent.com
The integrity constraints that should be added to the target model is discussed in p. 330 (NULL constraints), primary
keys and foreign keys constraints (p. 332 – 334). In a nutshell, it is recommended to
declare dimension columns to
be NOT NULL
(but not necessarily for fact tables),
use surrogate keys for primary keys in dimensions
(but not
necessarily for fact tables) and, if possible,
declare and enforce foreign keys between facts and dimension tables
(except if it becomes too heavy for the time allotted).
6.2.4Data Quality at Development Stage
At this stage, the key requirements (p. 381) are to
develop a system that can load valid data, correcting invalid
data and tagging invalid data it can’t correct
(e.g., there may be no default rules for missing data in all cases); and
then,
highlighting and explaining
the modifications related to standardizations and rules, as well as the data that
was rejected, and why. Quality screens, also called validation rules, are the heart of the data quality architecture.
Single column, multiple column and business rules and/or queries testing that integrity constraints are valid are typical
examples of validation rules. Profiling the data will help determine the necessary validation rules and the ensuing
cleansing andmatching rules correcting invalid and/or duplicate data.
ETL tools vary a lot in their built-in support for data cleaning capabilities, but allow the user to specify cleaning
functionality via a proprietary rule languages and/or transformations supported by a graphical user interface to build
data pipelines. Validation rules generally must be coded with column expression editors and hand-coded functions
where they can determine characteristics of incoming data with the help of aggregation functions. Cleansing can be
done using a programming library that helps with type conversions, string, arithmetic, scientific functions and the like.
Some tools do provide high level transform to cleanse specific domains (e.g., addresses, names, phones). Matching
is generally done through joins, string matching functions including fuzzy and wildcard matching and, in some tools,
high level transforms for specifyingmatching policies based on several fields and survivorship rules.
Rejecting invalid data is a topic of much debate. Some tips about what to do with such data appear in p. 383,
“responding to quality events”, and the overall recommendation is to
accept as much data as possible in the
warehouse, tagging fact and dimension data with a special audit dimension
, as illustrated in figure 9-2, p. 385,
so that data that is not quite valid can be corrected down the line, and can also be eliminated from query results if
desired. Statuses such as normal, anomalous, outlier, and impossible may be used. Also, as we discuss in the big
data section below, as aminimum, data cleansing rules should
differentiate numeric, missing fact values between
“cannot exist” and “exists but is unknown”
. Ultimately, this is quite dependent on the type of data and on the type
of analysis that will be performed once the data is accepted in the warehouse. As we will see in section
below,
6.3.3.1what to reject and even what type of data quality is necessary in big data repositories such as data lakes is more of an
open question than in a traditional data warehouse, as use cases are more varied, queries to be asked to the data are
figured out after it is ingested, and analytics workloads tend to bemore advanced.
As
,p. 383 suggests, it is recommended to
persist audit data for each ETL pipeline run, along with detailed
[1]errors in a specific error event schema
. Not only it will provide the necessary explanations for what happened
during an ETL pipeline process, it will also be the
basis for making data quality auditable and being monitored
over time. Make sure it has detailed structures to persist errors at a record level, and that audit data containing
measures of data quality for the pipelines being run, along with checksums tying back to data sources to cross-check
the integrity of the data
( ,pp. 448 and 453). The tie back to the source data systems is important because it
[1]contributes to establish the credibility of the warehouse.