WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

As you define your warehouse support strategy, it is wise to

assign resources for data reconciliation

shortly after

deployment. These analysts require a broad set of skills: they will need access to the data warehouse, to the source

systems or production reports, and to have a solid understanding of the data transformation process.

6.3 Enhancements to the reference publication

6.3.1 Our own experience

We begin this section with our experience on integrating datasets which will become dimensions in the target

dimensional schema.

—

Matching party data fromdifferent sources is fundamental for data integration

. By “party data” we mean

people names, organization names, addresses, e-mail and phones –the list is not exhaustive. Customer,

employee and vendor data fall in this category. It is very frequently the case that party data comes from

different data sources, both within and outside the organization (some of which is known to be of dubious

quality, such as retail point of sale data

)

, and need to be merged in the warehouse. There is no common

[18]

join key that makes this operation easy. So what is generally done is to parse party data in their component

elements, correct mistakes and fill in missing values (for instance, correct mistyped city names, fill in zip codes

if there are enough address data elements available), and then perform matching on these parsed-out

component elements.

—

The above discussion on parsing and cleansing leads us to the following observation.

Domain-

oriented validation and cleansing rules generally work better than generic ones. This has been observed as

well in a recently published comparative experiment

Party data elements (e.g., City, a component of

[17]

address data) is an example of such a domain. But this is also true of other domains as well. The way this

generally works is through identification of columns as belonging to these domains, and through access to

dictionaries or knowledge bases that validate the values in the column and/or provide the way to correct

frequent mistakes (we have also seen recently a push towards using machine learning techniques –more on

this below). The reason why so much effort has been put in detecting, cleansing and matching party data

(including for instance country and language-specific variations which are frequent in industrial-strength data

cleansing tools) is because of our first observation above.

—

Managedmaster data is great news for analytics projects.

In our experience, organizations as a whole are

finally starting to get interested in data quality. An area that gathered a lot of traction recently is master data. As

mentioned above, the main reason for this is because transactional systems are based on master data, and

MDM provides themwith high quality master data (as explained in chapte

, MDM systems are consumers of

data quality software). Master data managed by MDM systems are great news for analytics projects for two

reasons.

1. A lot of work in the cleansing and conforming stage of the ETL system centers on creating exactly this

single, conformed version of customers or products, so this work is avoided when the master data

repository of theMDM system is sourced into the data warehouse.

2. MDM systems allow solving the quality problem at the source, rather than continuously keep fixing the

same problems over and over again whenmoving data over to a data warehouse.