

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 47
www.persistent.com
5.4.2 Cloud Deployments
—
Kimball recommends using the public cloud in the prototyping phase and then moving to the private
cloud as things mature, or the return-on-investment is proven
. Given the capital expenditure in
provisioning the servers and tools, this is a very practical suggestion, after considering the aspects of data
security and PII (personally identifiable data). In the context of the cloud, the physical data model and ETLmay
be tailored to the infrastructure available in the cloud. The public cloud may use products such as SnapLogic
andAmazon RedShift (or Google SQL and Cloud DataFlow), while the private cloud may have Informatica and
Teradata, for instance.
—
Cluster based physical table design
. In a cluster of machines (in the private, or public cloud),
we would
partition facts across the Data Layer, and replicate dimensions across the Query Layer
. For very large
fact tables, we can partition across the Data Layer by time, geography, or tenant-id in a shared schema
approach or other partition elimination (and selective) mechanisms. These fact tables would be
partitioned
across the storage layer. Contrasted with that, the dimension tables usually get
replicated
across the nodes of
the cluster to service the Query. Usually, the dimensions get replicated in SSD storage or in-memory caches.
This is a performance boosting strategy. Of importance is the key based on which the facts get partitioned.
—
Physical table design strategies for Cloud based DW products.
Cloud based DW products (like Amazon
RedShift) force a
partitioning of Facts with distribution keys and fixed sort orders
. HPE Vertica also
follows the same approach for projections. In both these systems, the query patterns dictate the physical
storage.
—
Cloud ETL comes with challenges related to the following points below.
Having a clear architecture,
strategy and cost estimates help here.
—
Initial data population and the transfer/bandwidth costs.
—
Dealing with data characteristics (rapidly changing data, streaming data, unchanging data).
—
Tweaking the performance of cloud data warehouse nodes.
—
Bandwidth issues between nodes, and communication costs.
—
Pay-per-use costs association with Transformations.
—
Comprehensive and unified cloud systemadministration and data lifecyclemanagement strategy.
—
Cloud deployment of a warehouse leads to flexibility in scheduling workloads and having machines mapped to
the workload. The Compute and Storage can be separated and provisioned independently. An example is
SnowFlake.