WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

—

Surrogate integer keys are tight and more efficient than alpha-numeric natural keys or GUIDs. They result into

faster performance due to efficient joins, smaller indices and more fact rows per bloc

(

page 255,

[1]

332).

—

Matrix tables with no measures or fact-less fact tables: multiple column keys to unify a relation between

multiple dimensions can be chosen as a pattern for some subject areas tomodel m:m relationships.

—

Avoid overly generalized columns with key filters

, e.g. DimAddress as a generic table, with a key to filter

on an employee, vendor, customer, and contract staff etc. As there is virtually no overlap in these physical

entities, it just complicates the index structure unnecessarily.

7.2.4 Performance at Implementation, physical design stage

—

Avoid multiple columns for hierarchical dimensions

( ,

page 268), e.g. instead of having product,

[1]

product category, line of business as separate columns, we could have a single column named product and

indicate the granularity. This results into less and simpler indices.

—

Date dimension should have integer surrogate keys

instead of format YYYYMMDD

( ,

page 254), as the

[1]

PK since it is 8 bytes, we would be wasting 4-bytes storage per row or index.

—

Design initial index plan for all dimensions and facts

( ,

page 344).

Iteratively improve upon it as per

[1]

reporting requirements or data access patterns.

—

For good query performance, it is important that

key columns be the most efficient data type for join

performance

, usually an integer but a fixed-length CHAR might be relatively efficient for low-cardinality

columns

(

page 268).

[1]

—

Section

presented the CDC subsystem to capture the changed source data. Generally speaking,

5.3.5

timestamp columns and database transaction log reads have less impact at the source as compared to

database trigger mechanisms that generate new rows on logging tables holding the delta records, as the ETL

imposes read-only operations as opposed to writes (triggers) followed by reads (ETL) to compute the delta

record set.

—

For datamovement performance -

—

Avoid using transforms which does row-by-row processing/fetch/load.

—

Utilize ELT transforms to push transforms to DW

—

Develop and test jobs for parallel processing

—

Test ETL for near real-time loads, if it is a key requirement

—

Drop indexes before fact loads

—

Avoid fact updates as they are slow and prefer DELETE-INSERT. It will result into more fragmentation of

data but rebuild the indices post each load.

—

Create stage tables for more flexibility, manageability and better ETLperformance

—

For reporting/BI tool performance -

—

Apply appropriate filters to hit the indexes

—

Avoid report level calculations, grouping, applying business rules, null checks etc.

—

DataWarehouse -

—

Implement indexes as per data access requirements. Prefer to use covering indexes.

—

Build a script to rebuild indexes periodically and collect DB statistics (defragmentation)