WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

Analytic application

. Prebuilt data access application that contains powerful analysis algorithms based on domain expertise

(e.g. data mining algorithms), in addition to normal database queries.

BI applications.

The value-add analytics within the DW/BI system, they include the entire range of data access methods, from ad-

hoc queries, to standardized reports, to analytic applications.

Customer 360.

An application allowing to combine customer data from the various (external) touch points that a customer may use

to contact an organization and the internal data sources that that trace which products they purchase, how they receive service and

support, etc., giving a complete picture of how they interact with the organization.

Data Silos.

An enterprise has data silos when data is stored redundantly by an area of the organization, with each area mandating

its own policies and processes. This leads to inconsistent data definitions, formats and data values, which makes it very hard to

understand and use key business entities that are common across these silos. The first, classical version of an area generating a

data silo corresponded to a local facility or a department within an enterprise. Then, ERP systems were introduced to help alleviate

this problem (among several others). However, ERPs only deal with internal company data, and provide only partial management

of customer data or supplier data: that is done by other packaged applications such as CRMs and SRMs do this. These generate

today's modern version of data silos.

MPP Databases

. The MPP acronym stands for “Massively Parallel Processing”. These databases can be best described as

providing a SQL interface and a relational database management system (RDBMS) running on a cluster of servers networked

together by a high-speed interconnect, where the clusters form a Shared-Nothing architecture: i.e., each system has its own CPU,

memory and disk which they don’t share to any other server in the cluster. Through the database software and high-speed

interconnects, the system functions as a whole and can scale as new servers are added to the cluster (this form of extending

capacity is known as scale-out). This approach is used by MPP database systems like Teradata, Greenplum, Vertica, Netezza,

ParAccel, and others. Why do MPP databases working on a shared nothing cluster work well for data warehouses? For mainly two

reasons:

1. Relational queries are ideally suited to parallel execution; they are decomposed into uniform (relational algebra)

operations applied to uniform streams of data. By partitioning data across disk storage units attached directly to each

processor, an operator can often be split into many independent operators each working on a part of the data. This

partitioned data and execution gives partitioned parallelism. Each operator produces a new relation, so the operators

can be composed into highly parallel dataflow graphs. By streaming the output of one operator into the input of another

operator, the two operators can work in series giving pipelined parallelism.

2. Shared nothing architectures scale well up to hundreds and even thousands of processors that do not interfere with one

another.As

we will see below, this does not happen with single machines with parallel processors (SMPs), where there is

an interference effect.

A very good introduction to subject is the classic 1992 paper from David Dewitt and Jim Gray [24], from where we took these two

paragraphs above, and which is still strikingly relevant.

OLAP, OLAP database, or engine

. OLAP stands for “Online Analytical Processing” and is a set of principles that provide a

framework for answering multi-dimensional, analytical queries. AOLAP database or engine is one that organizes data natively per

a dimensional model, in cubes (as opposed to relational tables) where data (measures) are categorized by dimensions. OLAP

cubes are often pre- aggregated across dimensions to answer multi-dimensional, analytical queries swiftly and predictably.

SMPDatabases

. Traditional databases work well on small to medium database sizes (up to a few tens of terabytes) on Symmetric

Multi-Processing (SMP) machines, which are tightly coupled multiprocessor systems where processors can run in parallel, are

connected using a common bus, are managed by a single operating system, and share I/O devices and memory resources. SMPs

are rather of the scale-up sort, where additional capacity is obtained by getting a bigger machine. These days, SMPs come with 4

up to 64 processors.