

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 91
www.persistent.com
Glossary
We start by an explanation of data quality terms.
Data Profiling
is the systematic exploration of source content and analysis of relationships of data elements in the content under
examination. Profiling generally delivers statistics about the data that provide insight into the quality of data and help to identify data
quality issues. Single columns are profiled to get an understanding of frequency distribution of different values, whether they are
unique values, and use of each column. Embedded value dependencies such as those modeled by primary key / foreign key
(PK/FK) constraints can be exposed in cross-columns analysis. Profiling should be performed several times and with varying
intensity throughout the data warehouse developing process.
Data Conformance
is the process of reaching agreement on common data definitions, representation and structure of data
elements and values, and
Data Standardization
the formatting of values into consistent layouts (sometimes these terms are used
interchangeably). Representations and layouts of values are based on industry standards, local standards (e.g., postal standards
for addresses), and user-defined business rules (e.g., replacing variants of a termby a standard).
Data Validation
is the execution of the set of rules that make sure that data is “clean”. Some of these rules are those defined or
implied by data standardization, but these rules may be more involved than that: they include key uniqueness constraints, PK/FK
constraints, and may involve complex business formulas constraining the value of several fields. These rules return true or false
Booleans, indicating if the data row or element verifies the criteria. Aspecific type of validation rules dealing with duplicate data are
called out in a separate topic below.
Data Cleansing
is the process of fixing dirty data to make it clean(er). This process covers filling (some) missing values, correcting
mistakes in strings, applying transformations to values to meet the agreed data standardization, integrity constraints or other
business rules that define when the quality of data is sufficient for an organization.
Data Matching and survivorship
is the task of identifying and merging (a.k.a. consolidating) records that correspond to the same
real world entities through different data values and formatting standards. This typically happens when bringing data from different
data sources to a target source, but may also occur within the same data source. This task reduces data duplication and improves
data accuracy and consistency in the target source.
Data Enrichment:
the enhancement of the value of internally held data by appending related attributes from external sources (for
example, consumer demographic attributes and geographic descriptors).
The following chart illustrates these steps with a simple example.