WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

—

Datamovements fromon premise-to-cloud or cloud-to-cloud

Understand the data structure, volume, and update frequency. There are many ways, including secure FTP

and WAN optimization. Some cloud vendors provide compression for big data that enterprises want to upload

to the cloud.

—

Perform heavy-duty data export or import or ETL operations during night or off-peak period. Also perform

parallel loads to utilize resources optimally.

—

With larger volumes of data, one approach is to put the initial data upload onto disk and send it to the cloud

provider for upload on their internal network via sftp, and then use a VPN for ongoing differentiated loads.

Some vendors also provide loading compressed data in the DW from storage services: e.g. loading files in

a S3 bucket intoAmazon Redshift DW.

—

Another approach is to use private network connection solutions, which allows organizations to directly

connect to cloud providers rather than through connections established on the public Internet. In these

kind of solutions, the customer should choose an interconnection provider that has a broad geographic

reach, and place its equipment close to the provider’s data centers from where direct connection to public

and/or private cloud providers (Amazon,Azure, Google, etc.).

—

Query Performance

—

Frequently monitor and profile the queries to identify long-running operations and run EXPLAIN to identify

any bottle-necks. Outdated system statistics can severely degrade performance, so keep performance

related system statistics up-to-date. A nightly or weekly statistics rebuild is all right, all depends upon the

data load frequency. This is recommended to be performed only by designers or advanced users.

—

On MPP cloud databases, it is important to partition data in the cluster by appropriately defining distribution

keys and sort keys, so that related data can be co-located on a node and is sorted. For example,

—

Using CUST_ID as the distribution key with a hash partitioning function will partition the table

according to the value of the hashing function on the CUST_ID column. Associative access to the data

with a specific attribute value can be directed to a single node, and sequential access can be served in

parallel.

—

Using CREATED_DATE as the sort key will sort the data ascending such that date range queries

response improves substantially.

—

While reading the data, project only the columns required instead of “SELECT * ”, which may not hit

indexes always and end up doing table scans.

—

Apply selective filters before join –filtration will significantly boost join performance, if we filter out irrelevant

data as much as possible.

—

If possible, run one query at a time as multiple queries will demand resources and may slow down overall

query execution or entire cluster.

—

Data transfer for cloud database to client – avoid transferring data directly from DB to the client. Instead,

projected data can be loaded to a temporary table and exported to a file using command-line utilities.