WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

The integrity constraints that should be added to the target model is discussed in p. 330 (NULL constraints), primary

keys and foreign keys constraints (p. 332 – 334). In a nutshell, it is recommended to

declare dimension columns to

be NOT NULL

(but not necessarily for fact tables),

use surrogate keys for primary keys in dimensions

(but not

necessarily for fact tables) and, if possible,

declare and enforce foreign keys between facts and dimension tables

(except if it becomes too heavy for the time allotted).

6.2.4Data Quality at Development Stage

At this stage, the key requirements (p. 381) are to

develop a system that can load valid data, correcting invalid

data and tagging invalid data it can’t correct

(e.g., there may be no default rules for missing data in all cases); and

then,

highlighting and explaining

the modifications related to standardizations and rules, as well as the data that

was rejected, and why. Quality screens, also called validation rules, are the heart of the data quality architecture.

Single column, multiple column and business rules and/or queries testing that integrity constraints are valid are typical

examples of validation rules. Profiling the data will help determine the necessary validation rules and the ensuing

cleansing andmatching rules correcting invalid and/or duplicate data.

ETL tools vary a lot in their built-in support for data cleaning capabilities, but allow the user to specify cleaning

functionality via a proprietary rule languages and/or transformations supported by a graphical user interface to build

data pipelines. Validation rules generally must be coded with column expression editors and hand-coded functions

where they can determine characteristics of incoming data with the help of aggregation functions. Cleansing can be

done using a programming library that helps with type conversions, string, arithmetic, scientific functions and the like.

Some tools do provide high level transform to cleanse specific domains (e.g., addresses, names, phones). Matching

is generally done through joins, string matching functions including fuzzy and wildcard matching and, in some tools,

high level transforms for specifyingmatching policies based on several fields and survivorship rules.

Rejecting invalid data is a topic of much debate. Some tips about what to do with such data appear in p. 383,

“responding to quality events”, and the overall recommendation is to

accept as much data as possible in the

warehouse, tagging fact and dimension data with a special audit dimension

, as illustrated in figure 9-2, p. 385,

so that data that is not quite valid can be corrected down the line, and can also be eliminated from query results if

desired. Statuses such as normal, anomalous, outlier, and impossible may be used. Also, as we discuss in the big

data section below, as aminimum, data cleansing rules should

differentiate numeric, missing fact values between

“cannot exist” and “exists but is unknown”

. Ultimately, this is quite dependent on the type of data and on the type

of analysis that will be performed once the data is accepted in the warehouse. As we will see in section

below,

6.3.3.1

what to reject and even what type of data quality is necessary in big data repositories such as data lakes is more of an

open question than in a traditional data warehouse, as use cases are more varied, queries to be asked to the data are

figured out after it is ingested, and analytics workloads tend to bemore advanced.

p. 383 suggests, it is recommended to

persist audit data for each ETL pipeline run, along with detailed

[1]

errors in a specific error event schema

. Not only it will provide the necessary explanations for what happened

during an ETL pipeline process, it will also be the

basis for making data quality auditable and being monitored

over time. Make sure it has detailed structures to persist errors at a record level, and that audit data containing

measures of data quality for the pipelines being run, along with checksums tying back to data sources to cross-check

the integrity of the data

( ,

pp. 448 and 453). The tie back to the source data systems is important because it

[1]

contributes to establish the credibility of the warehouse.