

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 73
www.persistent.com
—
Surrogate integer keys are tight and more efficient than alpha-numeric natural keys or GUIDs. They result into
faster performance due to efficient joins, smaller indices and more fact rows per bloc
k
(page 255,
[1]332).
—
Matrix tables with no measures or fact-less fact tables: multiple column keys to unify a relation between
multiple dimensions can be chosen as a pattern for some subject areas tomodel m:m relationships.
—
Avoid overly generalized columns with key filters
, e.g. DimAddress as a generic table, with a key to filter
on an employee, vendor, customer, and contract staff etc. As there is virtually no overlap in these physical
entities, it just complicates the index structure unnecessarily.
7.2.4 Performance at Implementation, physical design stage
—
Avoid multiple columns for hierarchical dimensions
( ,page 268), e.g. instead of having product,
[1]product category, line of business as separate columns, we could have a single column named product and
indicate the granularity. This results into less and simpler indices.
—
Date dimension should have integer surrogate keys
instead of format YYYYMMDD
( ,page 254), as the
[1]PK since it is 8 bytes, we would be wasting 4-bytes storage per row or index.
—
Design initial index plan for all dimensions and facts
( ,page 344).
Iteratively improve upon it as per
[1]reporting requirements or data access patterns.
—
For good query performance, it is important that
key columns be the most efficient data type for join
performance
, usually an integer but a fixed-length CHAR might be relatively efficient for low-cardinality
columns
(page 268).
[1]—
Section
presented the CDC subsystem to capture the changed source data. Generally speaking,
5.3.5timestamp columns and database transaction log reads have less impact at the source as compared to
database trigger mechanisms that generate new rows on logging tables holding the delta records, as the ETL
imposes read-only operations as opposed to writes (triggers) followed by reads (ETL) to compute the delta
record set.
—
For datamovement performance -
—
Avoid using transforms which does row-by-row processing/fetch/load.
—
Utilize ELT transforms to push transforms to DW
—
Develop and test jobs for parallel processing
—
Test ETL for near real-time loads, if it is a key requirement
—
Drop indexes before fact loads
—
Avoid fact updates as they are slow and prefer DELETE-INSERT. It will result into more fragmentation of
data but rebuild the indices post each load.
—
Create stage tables for more flexibility, manageability and better ETLperformance
—
For reporting/BI tool performance -
—
Apply appropriate filters to hit the indexes
—
Avoid report level calculations, grouping, applying business rules, null checks etc.
—
DataWarehouse -
—
Implement indexes as per data access requirements. Prefer to use covering indexes.
—
Build a script to rebuild indexes periodically and collect DB statistics (defragmentation)