WP_Data Management Best Practices

W H I T E P A P E R

In this case, there relational aggregation tables would not be needed (except in mixed OLAP/relational environments).

www.persistent.com

—

Subsystem 20: OLAP Cube Builder. This applies to environments that have selected OLAP engines as their

primary user access presentation server in their technical architecture (see section

above) .

At the

5.3.3

very end of the ETL processing, the OLAP cubes get loaded

. However, as mentioned above, they are

sensitive to the SCD nature of dimensions. Type 1 and Type 3 SCDs would cause large scale

reloading/rebuilding of OLAPdata (the same is true with aggregate tables just described).

—

Subsystem 21: Data Propagation Manager:

The ETL system is the platform that now handles the

distribution of integrated and conformed data of the warehouse to other systems, through data

extracts

(subsuming the former EAI systems). Inputs from the warehouse feed to external entities

(customers, regulatory bodies, partners, vendors), and for corporate analytical systems for data mining and

algorithmic analysis. Some amount of transformation may be required from facts and dimensions into the

formats required by each system.

Below are some best practices related to the development of dimensional modeling

( ,

Chapter 10):

[1]

—

Use a data profiling tool

(or some queries that can be used to perform Data Profiling). Using a data profiling

tool helps uncover the actual state of quality of the data (usually, not expressed via database metadata),

understanding data issues, and sometimes, even discovering relationships and derivation rules. This is

discussed more widely in section

below. The data profile is an input to the ETL transformations and logic,

6.2

and guides the data quality effort.

—

Step 5: This

checklist of steps to follow for dimension tables

is about the initial population of dimension

data, including data cleansing and match/consolidation, especially when populating from different sources

(this is developed in section

below), Type 1 / Type 2 transformations, validating relationships, surrogate

6.2.4

key generation and assignment, techniques for fast loading, and loading Date and other special dimensions

introduced above.

—

Step 6: Perform the Fact Table Historical (initial) Load. Extracting the right range of facts, with the right date

filter criteria is important to start the Fact table loading. Audit statistics with counts and sums gathered in these

steps are important to perform validation and verification later. The fact records may contain derived,

calculated facts, although in many cases it is more efficient to calculate them in an OLAP cube or a view rather

than in the physical table.

Loading should be done by disabling constraints, dropping indexes, fast

loading, and then enabling constraints and re-creating indexes.

—

Step 7: Dimension Table Incremental Processing has two basic strategies: extract full load at the source and

compute the delta at the target, or extract only new and changed records at the source. The first strategy

applies when the source system has no capability to identify the new, changed and deleted data;

the second

strategy should be preferred, as it minimizes data transfer loads and expensive comparison

operations.

As with initial loads, the dimension data should be cleaned and transformed prior to identifying

new and changed rows. Techniques for identifying changes in the rows for Type 1 and Type 2 SCD processing

if the dimension has many rows and columns is presented in pg 458.

—

Step 8: Fact Table Incremental Processing.

Automatic auditing, error handling

(including persistence in an

error event schema and verifying quality measures, as mentioned in section

below)

and enabling

6.2.4

restarts after catastrophic errors in case the incremental load fails, are key in this step.

Loading of the

transactional fact table should be straightforward; however loading the accumulating snapshot fact table

could be complex. Loading could be improved by trickle load or parallel batch load operation, as suited to the

database.