

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 38
www.persistent.com
5.3.4Dimension DataModeling - Physical
The following are the physical dimensional datamodeling best practices from
. [1]—
Physical dimension and fact table management
( ,pages 262 – 271) is covered by subsystems 10 to 15
[1]described below. The same physical dimension may play different logical roles in a dimensional model
( , [1]page 262). In this case, the illusion of independent dimension tables can be achieved through database views
or aliases: strive to
name the columns playing the different roles using different names
(e.g., order date
and shipping date)
through views
, but nevertheless build and administer a single physical table.
—
Consider adopting the fully de-normalized dimensional model in cases of databases not so
performant for joins
(HPE Vertica),
or those unable to perform them
(NoSQLs – Cassandra, HBase –see
sectio
nabove). This may also be applicable where the database join optimizer plan is not stable.
5.2—
Indexing strategy: There is a good discussion on indexing of Fact Tables and Dimension tables in
,pages
[1]345-50 as part of the “Designing the Physical Database and Planning for Performance”.
Having an index
plan, and then adjusting the indexes as needed
is a best practice, as it is not uncommon to see databases
with 15 indexes per table. The index types (B-Tree, Clustered, Bitmapped, Index Organized Tables), the
optimizations (Star Schema optimization, Cost based optimization and query plans), and the operations
(Staging, Transforming, Loading) all have to be considering when creating the indexing plan.
—
Partitioning: The advantages of partitioning when dealing with loading as well as with query performance are
listed in
,pages 359-60, in the section on Physical Storage Structure.
A partitioning plan should be
[1]drawn up considering performance of operations (load, archive, query), and table maintenance.
As
mentioned below in section
, partitioning is also useful in achieving multi-tenancy when shared schema
5.4.2approach is followed, and helps with security (see section
)
7.3.3.25.3.5Dimension DataModeling - ETL Considerations
Chapters 9 and 10 of
introduce the most crucial part of any warehousing project, namely ETL. The former
[1]introduces a series of ETL subsystems (pp. 387 – 404), and the latter some best practices on how to develop
dimension and fact table initial and incremental loading (pp. 437 – 468). In what follows we present some distilled
notes of what we consider best practices on both the subsystems and the development techniques. We start by the
Chapter 9 subsystems.
—
Subsystem 2: Change Data Capture (CDC) system. This is the capability of being able to isolate the relevant
changes to the source data since the last data warehouse load. There are several strategies to compute the
delta record set; the most popular ones are (i) database transaction log read, (ii) trigger-based snapshot
capture, and (iii) timestamp (a.k.a. audit) columns in each source table, with the ETL code finding the delta
record set with time-stamp based queries on these tables. The two first ones are more reliable since the
underlying data is generated by automated means. When timestamp columns are populated by anything other
than automated triggers, you must pay special attention to whether they are a reliable source to indicate
change.
—
Subsystem 9: SCD (Slowly Changing Dimensions) Manager:
Having a specialized/centralized SCD
Manager
that has logic to deal with SCD Type 1 (Overwrite), Type 2 (New row, with older row having a “valid
until date”), Type 3 (new descriptive attribute added to dimension) changes or a Hybrid SCD approach
is a
definite best practice.