

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 84
www.persistent.com
—
Datamovements fromon premise-to-cloud or cloud-to-cloud
Understand the data structure, volume, and update frequency. There are many ways, including secure FTP
and WAN optimization. Some cloud vendors provide compression for big data that enterprises want to upload
to the cloud.
—
Perform heavy-duty data export or import or ETL operations during night or off-peak period. Also perform
parallel loads to utilize resources optimally.
—
With larger volumes of data, one approach is to put the initial data upload onto disk and send it to the cloud
provider for upload on their internal network via sftp, and then use a VPN for ongoing differentiated loads.
Some vendors also provide loading compressed data in the DW from storage services: e.g. loading files in
a S3 bucket intoAmazon Redshift DW.
—
Another approach is to use private network connection solutions, which allows organizations to directly
connect to cloud providers rather than through connections established on the public Internet. In these
kind of solutions, the customer should choose an interconnection provider that has a broad geographic
reach, and place its equipment close to the provider’s data centers from where direct connection to public
and/or private cloud providers (Amazon,Azure, Google, etc.).
—
Query Performance
—
Frequently monitor and profile the queries to identify long-running operations and run EXPLAIN to identify
any bottle-necks. Outdated system statistics can severely degrade performance, so keep performance
related system statistics up-to-date. A nightly or weekly statistics rebuild is all right, all depends upon the
data load frequency. This is recommended to be performed only by designers or advanced users.
—
On MPP cloud databases, it is important to partition data in the cluster by appropriately defining distribution
keys and sort keys, so that related data can be co-located on a node and is sorted. For example,
—
Using CUST_ID as the distribution key with a hash partitioning function will partition the table
according to the value of the hashing function on the CUST_ID column. Associative access to the data
with a specific attribute value can be directed to a single node, and sequential access can be served in
parallel.
—
Using CREATED_DATE as the sort key will sort the data ascending such that date range queries
response improves substantially.
—
While reading the data, project only the columns required instead of “SELECT * ”, which may not hit
indexes always and end up doing table scans.
—
Apply selective filters before join –filtration will significantly boost join performance, if we filter out irrelevant
data as much as possible.
—
If possible, run one query at a time as multiple queries will demand resources and may slow down overall
query execution or entire cluster.
—
Data transfer for cloud database to client – avoid transferring data directly from DB to the client. Instead,
projected data can be loaded to a temporary table and exported to a file using command-line utilities.