

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 57
www.persistent.com
As you define your warehouse support strategy, it is wise to
assign resources for data reconciliation
shortly after
deployment. These analysts require a broad set of skills: they will need access to the data warehouse, to the source
systems or production reports, and to have a solid understanding of the data transformation process.
6.3 Enhancements to the reference publication
6.3.1 Our own experience
We begin this section with our experience on integrating datasets which will become dimensions in the target
dimensional schema.
—
Matching party data fromdifferent sources is fundamental for data integration
. By “party data” we mean
people names, organization names, addresses, e-mail and phones –the list is not exhaustive. Customer,
employee and vendor data fall in this category. It is very frequently the case that party data comes from
different data sources, both within and outside the organization (some of which is known to be of dubious
quality, such as retail point of sale data
), and need to be merged in the warehouse. There is no common
[18]join key that makes this operation easy. So what is generally done is to parse party data in their component
elements, correct mistakes and fill in missing values (for instance, correct mistyped city names, fill in zip codes
if there are enough address data elements available), and then perform matching on these parsed-out
component elements.
—
The above discussion on parsing and cleansing leads us to the following observation.
Domain-
oriented validation and cleansing rules generally work better than generic ones. This has been observed as
well in a recently published comparative experiment
.Party data elements (e.g., City, a component of
[17]address data) is an example of such a domain. But this is also true of other domains as well. The way this
generally works is through identification of columns as belonging to these domains, and through access to
dictionaries or knowledge bases that validate the values in the column and/or provide the way to correct
frequent mistakes (we have also seen recently a push towards using machine learning techniques –more on
this below). The reason why so much effort has been put in detecting, cleansing and matching party data
(including for instance country and language-specific variations which are frequent in industrial-strength data
cleansing tools) is because of our first observation above.
—
Managedmaster data is great news for analytics projects.
In our experience, organizations as a whole are
finally starting to get interested in data quality. An area that gathered a lot of traction recently is master data. As
mentioned above, the main reason for this is because transactional systems are based on master data, and
MDM provides themwith high quality master data (as explained in chapte
r, MDM systems are consumers of
9data quality software). Master data managed by MDM systems are great news for analytics projects for two
reasons.
1. A lot of work in the cleansing and conforming stage of the ETL system centers on creating exactly this
single, conformed version of customers or products, so this work is avoided when the master data
repository of theMDM system is sourced into the data warehouse.
2. MDM systems allow solving the quality problem at the source, rather than continuously keep fixing the
same problems over and over again whenmoving data over to a data warehouse.