WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

5.4.2 Cloud Deployments

—

Kimball recommends using the public cloud in the prototyping phase and then moving to the private

cloud as things mature, or the return-on-investment is proven

. Given the capital expenditure in

provisioning the servers and tools, this is a very practical suggestion, after considering the aspects of data

security and PII (personally identifiable data). In the context of the cloud, the physical data model and ETLmay

be tailored to the infrastructure available in the cloud. The public cloud may use products such as SnapLogic

andAmazon RedShift (or Google SQL and Cloud DataFlow), while the private cloud may have Informatica and

Teradata, for instance.

—

Cluster based physical table design

. In a cluster of machines (in the private, or public cloud),

we would

partition facts across the Data Layer, and replicate dimensions across the Query Layer

. For very large

fact tables, we can partition across the Data Layer by time, geography, or tenant-id in a shared schema

approach or other partition elimination (and selective) mechanisms. These fact tables would be

partitioned

across the storage layer. Contrasted with that, the dimension tables usually get

replicated

across the nodes of

the cluster to service the Query. Usually, the dimensions get replicated in SSD storage or in-memory caches.

This is a performance boosting strategy. Of importance is the key based on which the facts get partitioned.

—

Physical table design strategies for Cloud based DW products.

Cloud based DW products (like Amazon

RedShift) force a

partitioning of Facts with distribution keys and fixed sort orders

. HPE Vertica also

follows the same approach for projections. In both these systems, the query patterns dictate the physical

storage.

—

Cloud ETL comes with challenges related to the following points below.

Having a clear architecture,

strategy and cost estimates help here.

—

Initial data population and the transfer/bandwidth costs.

—

Dealing with data characteristics (rapidly changing data, streaming data, unchanging data).

—

Tweaking the performance of cloud data warehouse nodes.

—

Bandwidth issues between nodes, and communication costs.

—

Pay-per-use costs association with Transformations.

—

Comprehensive and unified cloud systemadministration and data lifecyclemanagement strategy.

—

Cloud deployment of a warehouse leads to flexibility in scheduling workloads and having machines mapped to

the workload. The Compute and Storage can be separated and provisioned independently. An example is

SnowFlake.