WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

—

Step 9: Aggregate Table and OLAP Loads. Aggregation may be managed by technology such as relational

materialized views or OLAP cubes, in which case most of the maintenance work is done automatically after

initial loading. Incremental processing refreshing the aggregates is possible even if aggregates are

maintained by ETL code (you can be confident this is the case in OLAP and materialized view technology);

however, a major change of a dimension attribute, such as a Type 1 SCD attribute change, may require

reprocessing the aggregate table; this will be the case no matter how aggregates are implemented, if by ETL

code or through database technology.

—

Step 10: ETLSystemOperation andAutomation.

—

Jobs are scheduled for a specific time, or may be triggered based on polling.

Aclear workflow should be

built for steps requiring order such as dimension management before facts, aggregate refresh

after incremental fact processing, OLAP rebuild.

—

Concentrate on

ETL testing

( ,

p. 455). First, use a small dataset to unit-test both fact table loading and

[1]

dimension loading before throwing the full set of historic data at it. Direct error records to a file, investigate

what went wrong, fix the code and rerun the test. When getting your first production load complete, with the

right record count, don’t cry victory too soon: confirm that the data is accurate at the target (again, a

profiling tool is your best alternative). When initial loads are confirmed accurate, move on to test the

incremental load processes

( ,

p. 467). This is often neglected, is very unglamorous,

but these test

[1]

scriptsmake the difference between a successful ETLproject and an unsuccessful one.

—

A point that deserves particular attention is how to

handle unpredictable errors

, and how to

recover

automatically from them in a transaction-consistent manner

, especially given the fact that these

errors typically occur in a “lights-out” environment. Many ETL products provide tools and features to

enable you to build a robust system under all possible failures, but doing so is more complicated than it

seems.

Alerts on unpredictable errors should be raised and audited.

—

Once any unpredictable error occurs and ETL is re-run, it should not result into any data loss or

inflation;

i.e. the ETL queries or workflows must be coded to handle this. This is very important as the DW

is supposed to provide single version of the truth.

—

There are several supporting database tasks necessary for ETL operations, e.g., table space

management, periodic index maintenance, and database backups.

These tasks should be automated

within the ETL environment to allow for end-to-end operations testing

(see next section), preferably

by using database systemmetadata.

—

Finally, strive to

develop and test an automated operation as complete as possible

for ETLprocesses:

the ideal ETL operation runs the regular load processes in a lights-out manner, without human

intervention. More on this on the section below.

5.3.6Dimension DataModeling - Deployment Considerations

The deployment stage should go through the traditional rigors of alpha and beta testing prior to general availability and

rollout (and this is true of any subsequent delivery after the initial release of the data warehouse).