WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

5.3.4Dimension DataModeling - Physical

The following are the physical dimensional datamodeling best practices from

. [1]

—

Physical dimension and fact table management

( ,

pages 262 – 271) is covered by subsystems 10 to 15

[1]

described below. The same physical dimension may play different logical roles in a dimensional model

( , [1]

page 262). In this case, the illusion of independent dimension tables can be achieved through database views

or aliases: strive to

name the columns playing the different roles using different names

(e.g., order date

and shipping date)

through views

, but nevertheless build and administer a single physical table.

—

Consider adopting the fully de-normalized dimensional model in cases of databases not so

performant for joins

(HPE Vertica),

or those unable to perform them

(NoSQLs – Cassandra, HBase –see

sectio

above). This may also be applicable where the database join optimizer plan is not stable.

5.2

—

Indexing strategy: There is a good discussion on indexing of Fact Tables and Dimension tables in

pages

[1]

345-50 as part of the “Designing the Physical Database and Planning for Performance”.

Having an index

plan, and then adjusting the indexes as needed

is a best practice, as it is not uncommon to see databases

with 15 indexes per table. The index types (B-Tree, Clustered, Bitmapped, Index Organized Tables), the

optimizations (Star Schema optimization, Cost based optimization and query plans), and the operations

(Staging, Transforming, Loading) all have to be considering when creating the indexing plan.

—

Partitioning: The advantages of partitioning when dealing with loading as well as with query performance are

listed in

pages 359-60, in the section on Physical Storage Structure.

A partitioning plan should be

[1]

drawn up considering performance of operations (load, archive, query), and table maintenance.

mentioned below in section

, partitioning is also useful in achieving multi-tenancy when shared schema

5.4.2

approach is followed, and helps with security (see section

)

7.3.3.2

5.3.5Dimension DataModeling - ETL Considerations

Chapters 9 and 10 of

introduce the most crucial part of any warehousing project, namely ETL. The former

[1]

introduces a series of ETL subsystems (pp. 387 – 404), and the latter some best practices on how to develop

dimension and fact table initial and incremental loading (pp. 437 – 468). In what follows we present some distilled

notes of what we consider best practices on both the subsystems and the development techniques. We start by the

Chapter 9 subsystems.

—

Subsystem 2: Change Data Capture (CDC) system. This is the capability of being able to isolate the relevant

changes to the source data since the last data warehouse load. There are several strategies to compute the

delta record set; the most popular ones are (i) database transaction log read, (ii) trigger-based snapshot

capture, and (iii) timestamp (a.k.a. audit) columns in each source table, with the ETL code finding the delta

record set with time-stamp based queries on these tables. The two first ones are more reliable since the

underlying data is generated by automated means. When timestamp columns are populated by anything other

than automated triggers, you must pay special attention to whether they are a reliable source to indicate

change.

—

Subsystem 9: SCD (Slowly Changing Dimensions) Manager:

Having a specialized/centralized SCD

Manager

that has logic to deal with SCD Type 1 (Overwrite), Type 2 (New row, with older row having a “valid

until date”), Type 3 (new descriptive attribute added to dimension) changes or a Hybrid SCD approach

is a

definite best practice.