

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 70
www.persistent.com
and
driving or building a culture of plan-test-measure-optimize
. Documenting the performance success
criteria per component of the BI system and prioritization of reports critical to the business is very important
and can go a long way inminimizing the potential risks.
—
Assess current performance needs as well as future scalability and concurrency (for multi-tenant DWs)
requirements, their cost implications, and come up with sizing estimates for space based on factors like row
length, number of tables, number of years of history, index structures, de-normalization applied, temp/sort
space, number of aggregate tables, number of stage tables, metadata and configuration tables, log/error
tables, etc.
(page 340, 360).
In general, DW takes up three or four times as much space as that of
[1]OLTPspace.
7.2.2 Performance at Design and Technical architecture stage
—
Identify the components in the entire data flow or pipeline right from source to consumption, which
need to be tested or optimized for performance
( pages 336,363). Typically, this includes:
[1]—
Check source systems involved and their connectivity methods, and strive finding the optimal way to get
data in. For instance, does this connector allow batching to increase the throughput, does it allow parallel
reads? If not, can it allow bulk reads and what will be the impact on the source system? If the source system
is message queue, check for data retention or caching knobs.
—
ETL system: choose an ETL tool which allow parallel processing of data without any data loss, or
duplication (like the potential for duplicate surrogate keys), or any other implication in case of error events.
Parallel processing both at job and at data level is preferred. Job-level parallel processing means multiple
jobs can be run by a master ETL process or job without causing any disruptions and utilizing multi-core
hardware. On the other hand, for processing huge amounts of data, data-level parallel processing is a
must. Data should be partitioned logically, on certain keys, say date range, and processed by different
threads in parallel. The parallel data processingmay also be referred to as grid processing.
—
Reporting tool: Understand the architecture and the various rendering layers of the reporting and
visualization tools to get an understanding on the time spent between layers, from data request to final
response. Choose a tool having minimal overhead at Model-View-Controller and presentation layer (of the
order of 100 to 30milliseconds is reasonably good)
—
Database infrastructure
. Section
asked the question about choosing database technology for a
4.4.1.7traditional data warehouse between MPP and SMP technologies. As this is a question mainly about
performance, in this section we present an overview of the public measurements that have been done a well-
known decision support benchmark, namely TPC-H
,backing up, along with our own experience in the
[25]matter, the recommendations that were given in chapter
above. TPC-H is a warehousing benchmark that
4defines 22 relatively complex queries on top of a very simple database schema (just 8 tables) modeling the
activity of a wholesale supplier. Different database sizes can be generated using scaling factors, from 1 GB to
100 TB. Two performance metrics are defined by the benchmark: (i) TPC-H Composite Query-per-Hour
Performance Metric (QphH@Size), which reflects several aspects of the capability of the system to process
queries in an hour, and (ii) The TPC-HPrice/Performancemetric, expressed as dollars per QphH@Size.