WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

Finally, it is highly recommended to influence management about

establishing a corporate data governance

program

( ,

p. 56) to commit to a continuous data quality improvement process that transcends departmental level

[1]

organizations and works at the enterprise level. Technology is an enabler, but it does not fix quality problems in an

organization. In

page 381, Kimball proposes an interesting 9-step data governance program template for any

[1]

organization that wants to address and build data quality as part of its culture (he calls it information governance, but

the two terms are basically synonymous). Michael Hammer in his famous reengineering book

points to several

[19]

case studies where improvements in information technology and, in particular, in the quality of data involved in key

business processes, was credited as an essential enabler of spectacular gains in productivity in well-known

corporations (in his own words: “seemingly small data quality issues are, in reality, important indications of broken

business processes”).

6.2.2 Data Quality at Requirements Definition Stage

At this stage, it is recommended to have the data steward

have a first “dig into the data”

( ,

pp. 95, 99) to better

[1]

understand the underlying data sources, starting with the primary data source for the project at hand. It is suggested to

talk to the owners of the core operational system of the project, as well as with the database administrator and the data

modeler. The goal of this data audit at this early stage is to perform a strategic, light assessment on the data to

determine its suitability for inclusion in the data warehouse and provide an early go/no go decision. It is far better to

disqualify a data source at this juncture, even if the consequence is a major disappointment for your business sponsor

and users, rather than coming to this realization during the ETLdevelopment effort.

This exploration journey will be

greatly facilitated by a profiling tool,

rather than hand coding all your queries (e.g.,

SELECTDISTINCT on a database column). Profiling should continue as requirements are getting extracted.

6.2.3 Data Quality at Design Stage

The data quality related activities at this stage are still driven by a

deeper, tactical profiling effort

of the formal data

sources, i.e., those maintained by IT, and the informal sources, coming from the lines of business or which are external

to the organization. The first step in this process is to understand all the sources that are candidates for populating the

target model; the second, is to evaluate each, and determine each data source

( ,

p. 307). The outcome (p. 308)

[1]

includes the following:

—

Abasic “Go/ No Go” decision for each data source

—

Data quality issues that must be corrected at the source systems before the project can proceed

—

Data quality issues that can be corrected in the ETL processing flow after extraction – sort out standardization,

validation, cleansing andmatching needs.

—

Unanticipated business rules, hierarchical structures and foreign key / primary key relationships.

From this analysis, a decision needs to be taken regarding the best source to populate the dimensional model. A

criterion for choice in case of two or possible feeds for the data include accessibility and data accuracy, as explained in

page 308 of

. [1]

Data stewards should

try in obtaining and validating optional data

, which systems and data source owners are

happy to leave unfilled (p. 321).