WP_Data Management Best Practices

According to

the optimal level of data maintenance is not to achieve perfect data, but only a level where the costs of the work to clean the

[3]

data do not exceed savings from the costs inflicted by poor quality data.

W H I T E P A P E R

www.persistent.com

—

Completeness,

defined as expected comprehensiveness, generally the proportion of missing item values

among those expected as mandatory;

—

Validity,

also called conformity, the degree to which data follows the specification of its value domain (data

type, size, range, and format).

—

Consistency,

which is the degree to which no conflicting versions of the same data item(s) appear in different

places.

—

Accuracy,

the degree to which data correctly describes the real world object or event being described

(besides numerical and ordinal data, this includes also typographical errors or misspellings in string data).

—

Timeliness,

also called currency, is the degree to which data represents reality from the required point in time.

—

Integrity,

the degree to which the data captures the semantic business rules governing the components of the

data model (for instance, functional dependencies among data items, primary key / foreign key relationships

among data items of different data sets, and more generally, business rules governing the values of different

records between the same or different data sets).

A typical cause of low quality (a.k.a. dirty), data is the collection of data by end users through data capture forms (data

may be mistyped, or entered in the wrong field). Input masks can address some of these problems, but most probably

won’t correct them all: for instance, there may not be formal standards in the organization on how to enter qualitative

attributes such as ordinal rating scales (e.g., very high / high / medium / low / very low), so data may become

inaccurate or even inconsistent, i.e., there may be significant variations across time and type of qualified item for this

attribute. Another example: information may be deliberately distorted at data entry: for instance, entering a fictional

name or identifier in a person data entry screen. It is not uncommon to find that this is caused by a broken business

process. Overloaded codes are another curse of an era where saving bits was important; fields with codes or

comments used for unofficial or undocumented purposes are yet another. The list is long.

Validation rules may catch some of these errors, and other technology services improving data quality can also help,

but will certainly not capture all possible data quality issues. As discussed throughout this chapter, ensuring that data

is clean (or at least, clean enough ) is the result of data governance, which implies an organizational commitment to a

continuous quality improvement process.

The glossary in chapter

below serves as a crash introduction to the vocabulary of data quality. In particular, we talk

about data cleansing (or cleaning), and it is important to clarify its meaning with respect to a crucial aspect of data

quality, namely, to define validity of data, i.e., agreement on definitions of data elements and their shared value

domain, to strive for enterprise-wide consistency and accuracy of data elements. This is the role of data stewards,

about whom we talked in the previous section. As mentioned in chapter

the secret sauce for integrating data from

different departments using a dimensional data model is to strive for standardized, conformed dimensions and

conformed facts. Lack of conformity of dimensions or facts becomes certainly a data quality problem, under our

definition. Data cleansingmay help in determining and cleansing dimension attribute values to be shared.