WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

System deployment in production starts by passing the complete set of tests on a test system that is as similar as the

production system as possible. Then, move the software artifacts from the test repository to the production repository.

The more automated and parameterized this can be, the better – and this deployment move is something else

to test

. And perform a final automated test on the production system before letting people in. Deployment is much

harder on an existing system in production. Pay attention to the chain dependencies: reports depend on views, which

depend on data warehouse tables, which depend on ETL data pipelines, which depend on data sources. Changing

any of these may break something down the line, so a system that maintains dependencies and allows impacts to be

dealt with individually is critical.

Other deployment considerations for dimensional data warehouses are the following:

—

From a dimension data perspective,

having a single deployment platform for MDM, Identity data,

Customer 360 andBI/Reporting is the recommended approach.

—

Storagemanagement is another key component of the Deployment

, as the growth of data under analysis,

can affect the performance (response times), andmay force storage reconfigurations.

—

The

deployment should facilitate behind the scenes housekeeping

- ETLmetadata, availability, BCP/DR

(Backup Continuity Planning/Disaster Recovery), Performance, Security,Archival, StagingManagement.

5.4 Enhancements to the reference publication

5.4.1 DimensionModeling in our own experience

—

High Level Model Diagram is important for anchoring discussion

. The creation of a high-level model

diagram (Kimball

figure 7-3) helps all the stakeholders including business users, in discussions. Diagrams

[1]

usually elicit better feedback and discussion. This is even more important with distributed teams. The visual

anchoring of the conversations as well as the visual recall provided by the diagram are of great value. To this

effect, the advice about not radically altering the model (or even the relative positions within the high-level

diagram) is very relevant.

—

Automatic Change Data Capture (CDC) mechanisms at the source are key for keeping the data

warehouse up to date in the presence of hard deletes at the source.

Indeed, if the source deletes rows

physically, only database logs or database triggers that generate data as a side effect of the deletion, with the

correct operation code (DELETE), will provide the ETL layer with the necessary information to keep the target

database up to date without having to compare full table contents from the source and the target during

incremental loads, which will be a performance killer when tables are large. Also, as further discussed in the

point below,

CDCmechanisms based on database transaction logs or triggers can be used for pushing

changes in real time to the data warehouse

(as opposed to pulling them through queries). Sectio

n 7.2.4

compares thesemechanisms froma performance point of view.

—

Creation of a “data highway” is often needed. Kimball highlights the need for data traveling at various

speeds / in different lanes:

—

Raw-source (immediate) - CEP, alerts, fraud

—

Real time (seconds) - ad selection, sports, stock market, IoT, systems monitoring

—

BusinessActivity (minutes) - Trouble Tickets, Workflows, MobileApp Dashboards

—

Top Line (24 hours) – Tactical Reporting, Overview status dashboards

—

EDW (daily, periodic, yearly) – Historical analysis, analytics, all reporting