WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

Glossary

We start by an explanation of data quality terms.

Data Profiling

is the systematic exploration of source content and analysis of relationships of data elements in the content under

examination. Profiling generally delivers statistics about the data that provide insight into the quality of data and help to identify data

quality issues. Single columns are profiled to get an understanding of frequency distribution of different values, whether they are

unique values, and use of each column. Embedded value dependencies such as those modeled by primary key / foreign key

(PK/FK) constraints can be exposed in cross-columns analysis. Profiling should be performed several times and with varying

intensity throughout the data warehouse developing process.

Data Conformance

is the process of reaching agreement on common data definitions, representation and structure of data

elements and values, and

Data Standardization

the formatting of values into consistent layouts (sometimes these terms are used

interchangeably). Representations and layouts of values are based on industry standards, local standards (e.g., postal standards

for addresses), and user-defined business rules (e.g., replacing variants of a termby a standard).

Data Validation

is the execution of the set of rules that make sure that data is “clean”. Some of these rules are those defined or

implied by data standardization, but these rules may be more involved than that: they include key uniqueness constraints, PK/FK

constraints, and may involve complex business formulas constraining the value of several fields. These rules return true or false

Booleans, indicating if the data row or element verifies the criteria. Aspecific type of validation rules dealing with duplicate data are

called out in a separate topic below.

Data Cleansing

is the process of fixing dirty data to make it clean(er). This process covers filling (some) missing values, correcting

mistakes in strings, applying transformations to values to meet the agreed data standardization, integrity constraints or other

business rules that define when the quality of data is sufficient for an organization.

Data Matching and survivorship

is the task of identifying and merging (a.k.a. consolidating) records that correspond to the same

real world entities through different data values and formatting standards. This typically happens when bringing data from different

data sources to a target source, but may also occur within the same data source. This task reduces data duplication and improves

data accuracy and consistency in the target source.

Data Enrichment:

the enhancement of the value of internally held data by appending related attributes from external sources (for

example, consumer demographic attributes and geographic descriptors).

The following chart illustrates these steps with a simple example.