

19
According to
,the optimal level of data maintenance is not to achieve perfect data, but only a level where the costs of the work to clean the
[3]data do not exceed savings from the costs inflicted by poor quality data.
W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 52
www.persistent.com
—
Completeness,
defined as expected comprehensiveness, generally the proportion of missing item values
among those expected as mandatory;
—
Validity,
also called conformity, the degree to which data follows the specification of its value domain (data
type, size, range, and format).
—
Consistency,
which is the degree to which no conflicting versions of the same data item(s) appear in different
places.
—
Accuracy,
the degree to which data correctly describes the real world object or event being described
(besides numerical and ordinal data, this includes also typographical errors or misspellings in string data).
—
Timeliness,
also called currency, is the degree to which data represents reality from the required point in time.
—
Integrity,
the degree to which the data captures the semantic business rules governing the components of the
data model (for instance, functional dependencies among data items, primary key / foreign key relationships
among data items of different data sets, and more generally, business rules governing the values of different
records between the same or different data sets).
A typical cause of low quality (a.k.a. dirty), data is the collection of data by end users through data capture forms (data
may be mistyped, or entered in the wrong field). Input masks can address some of these problems, but most probably
won’t correct them all: for instance, there may not be formal standards in the organization on how to enter qualitative
attributes such as ordinal rating scales (e.g., very high / high / medium / low / very low), so data may become
inaccurate or even inconsistent, i.e., there may be significant variations across time and type of qualified item for this
attribute. Another example: information may be deliberately distorted at data entry: for instance, entering a fictional
name or identifier in a person data entry screen. It is not uncommon to find that this is caused by a broken business
process. Overloaded codes are another curse of an era where saving bits was important; fields with codes or
comments used for unofficial or undocumented purposes are yet another. The list is long.
Validation rules may catch some of these errors, and other technology services improving data quality can also help,
but will certainly not capture all possible data quality issues. As discussed throughout this chapter, ensuring that data
19is clean (or at least, clean enough ) is the result of data governance, which implies an organizational commitment to a
continuous quality improvement process.
The glossary in chapter
below serves as a crash introduction to the vocabulary of data quality. In particular, we talk
9about data cleansing (or cleaning), and it is important to clarify its meaning with respect to a crucial aspect of data
quality, namely, to define validity of data, i.e., agreement on definitions of data elements and their shared value
domain, to strive for enterprise-wide consistency and accuracy of data elements. This is the role of data stewards,
about whom we talked in the previous section. As mentioned in chapter
,the secret sauce for integrating data from
5different departments using a dimensional data model is to strive for standardized, conformed dimensions and
conformed facts. Lack of conformity of dimensions or facts becomes certainly a data quality problem, under our
definition. Data cleansingmay help in determining and cleansing dimension attribute values to be shared.