WP_Data Management Best Practices

W H I T E P A P E R

For instance, a sales by city aggregate for last month will be wrong when notice of a customer moving two months ago arrives.

www.persistent.com

—

Periodic Snapshot Loader: An ETL fact table builder can be run at the periodicity of a day, week or month,

to capture a snapshot to update this fact table.

—

Accumulating Snapshot Loader: An ETL Fact table builder runs at intervals to gather the state transitions

and summarize them into theAccumulating Snapshot Fact.

—

Subsystem14: Surrogate Key Pipeline: This is the process that exchanges the natural keys of each dimension

by the surrogate keys and handles any referential integrity error. Dimension table processing must be

complete before the fact data enters the surrogate key pipeline, for both the initial (historic) load and the

incremental load.

Mapping tables should be retained in the ETL pipeline to get the corresponding

surrogate keys for natural keys.

The ETL should have logic to insert into the dimension as well as the

mapping tables, in case a new natural key is encountered.

—

Subsystem 15: Multi-Values Dimension Bridge Table Builder:

Construction of the bridge table should

happen during the during the ETL pipeline

. Additional complications arise as these dimensions may also

be SCDType 2.

—

Subsystem 16: LateArriving Data Handler:

The ETL system needs to take care of late arriving data, either

dimension or fact records

. In the former case, if the business can live without facts that have no dimension

data, one strategy is to park aside (into another area), fact data arriving before its dimension data. If reports

with all fact data are needed, the facts could be loaded with dummy dimension placeholder records with their

definitive surrogate key initially. These dimension placeholders are later updated when its data arrives.

Additional consideration is needed when the dimension is slowly changing and updates arrive late, as the

surrogate keys in the fact table point to the wrong dimension data, and for summary (aggregate) tables based

on now obsolete dimension data . If the Facts arrive late, expensive searches are needed to backtrack and

adjust all the summaries, semi-additive facts, as well as locating which dimension surrogate keys were in

effect at that time. During this time window, the reports will not be always accurate.

—

Subsystem 17: Dimension Manager System: The set of dimensions should be tightly controlled by a

Dimension manager interacting with other subsystems managing dimensions (9 to 12, 15, 16) The difficulty

increases with multi-system EDWs and distributed environments where dimension tables are replicated. In

this case,

a best practice is to have this subsystem replicate dimensions to the Fact Provider Systems

(see below) with a version number attached to each dimension row.

The version number is very useful in

case of drill across queries spanning systems, to ensure consistency.

—

Subsystem 18:

Creation of a Fact Provider system

is a best practice, as it is responsible for

replacing/recalculating surrogate keys, accumulating snapshots, and aggregations using some of the

subsystems introduced above, plus subsystem 19. In distributed, drill-across query environments, it receives

conformed dimensions released by the dimensionmanager.

—

Subsystem 19: Aggregate Builder: This applies in general to environments that have not selected OLAP

engines in their technical architecture (see section

as well as the point below).

ETL must make sure

5.3.3

that the aggregates are consistent with the base data

. There are several possible implementations that

range from relying on materialized view DBMS technology, to explicit ETL code implementation of this

subsystem. The Fact Provider System usually triggers the Aggregate Builder based on changes to Fact table

data.