WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

and

driving or building a culture of plan-test-measure-optimize

. Documenting the performance success

criteria per component of the BI system and prioritization of reports critical to the business is very important

and can go a long way inminimizing the potential risks.

—

Assess current performance needs as well as future scalability and concurrency (for multi-tenant DWs)

requirements, their cost implications, and come up with sizing estimates for space based on factors like row

length, number of tables, number of years of history, index structures, de-normalization applied, temp/sort

space, number of aggregate tables, number of stage tables, metadata and configuration tables, log/error

tables, etc.

(

page 340, 360).

In general, DW takes up three or four times as much space as that of

[1]

OLTPspace.

7.2.2 Performance at Design and Technical architecture stage

—

Identify the components in the entire data flow or pipeline right from source to consumption, which

need to be tested or optimized for performance

( p

ages 336,363). Typically, this includes:

[1]

—

Check source systems involved and their connectivity methods, and strive finding the optimal way to get

data in. For instance, does this connector allow batching to increase the throughput, does it allow parallel

reads? If not, can it allow bulk reads and what will be the impact on the source system? If the source system

is message queue, check for data retention or caching knobs.

—

ETL system: choose an ETL tool which allow parallel processing of data without any data loss, or

duplication (like the potential for duplicate surrogate keys), or any other implication in case of error events.

Parallel processing both at job and at data level is preferred. Job-level parallel processing means multiple

jobs can be run by a master ETL process or job without causing any disruptions and utilizing multi-core

hardware. On the other hand, for processing huge amounts of data, data-level parallel processing is a

must. Data should be partitioned logically, on certain keys, say date range, and processed by different

threads in parallel. The parallel data processingmay also be referred to as grid processing.

—

Reporting tool: Understand the architecture and the various rendering layers of the reporting and

visualization tools to get an understanding on the time spent between layers, from data request to final

response. Choose a tool having minimal overhead at Model-View-Controller and presentation layer (of the

order of 100 to 30milliseconds is reasonably good)

—

Database infrastructure

. Section

asked the question about choosing database technology for a

4.4.1.7

traditional data warehouse between MPP and SMP technologies. As this is a question mainly about

performance, in this section we present an overview of the public measurements that have been done a well-

known decision support benchmark, namely TPC-H

backing up, along with our own experience in the

[25]

matter, the recommendations that were given in chapter

above. TPC-H is a warehousing benchmark that

defines 22 relatively complex queries on top of a very simple database schema (just 8 tables) modeling the

activity of a wholesale supplier. Different database sizes can be generated using scaling factors, from 1 GB to

100 TB. Two performance metrics are defined by the benchmark: (i) TPC-H Composite Query-per-Hour

Performance Metric (QphH@Size), which reflects several aspects of the capability of the system to process

queries in an hour, and (ii) The TPC-HPrice/Performancemetric, expressed as dollars per QphH@Size.