WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

We recommend deploying components from the reference architecture that address below needs across different

data projects in Big Data setup:

—

DataModels, Structures, Types

—

Data LifecycleManagement

—

Data transformations, Provenance, Curation,Archiving

—

Target use, presentation, visualization

—

Cluster size or Infrastructure - Storage, Compute, and Network

—

Data security in-rest, in-move, trusted computational environments

4.4.3.2Hadoop andMPP

This section addresses the relation between Hadoop-based big data architectures and MPP databases, as there is

indeed some relation between these concepts –and quite some confusion.

In Hadoop, there is some notion of scale-out by adding servers to a Hadoop cluster. On the other hand, MapReduce,

the first distributed processing that appeared does allow (but does not require) all its computational tasks to run in

parallel.

So, this bears the question, is Hadoop MPP?

There is no simple answer or, better yet, the initial answer was no,

and it now progressing towards a “maybe”, depending on what Hadoop solution you pick. Indeed, Hadoop is not a

single technology, it is an ecosystem of related projects. The question needs to be asked, therefore, at each level of

the stack. Recent solutions such as Impala and HAWQ are MPP execution engines on top of Hadoop working with the

data stored in HDFS. SparkSQL is another execution engine trying to get the best of both MapReduce and MPP-over-

Hadoop approaches, and having its own drawbacks.

As for the question,

“Should I choose an MPP solution or Hadoop-based solution?”

the state of the technology

today is that with these sophisticated Hadoop execution engines, you can get queries to return in a decent amount of

time these days on huge datasets, but Hadoop cannot be used as a complete replacement of the traditional enterprise

data warehouse. On typical data warehousing queries, we are still talking about response time differences of about

one order of magnitude of difference, sometimes evenmore.

4.4.4 Self-service and agility

This section will not delve into self-service technology and products, as this has been the focus of section

above.

We will only add that you should make sure you involve business users to select their tools, especially those that are

self-service.

We consider here the impact that agile methodologies could have from a technical architecture point of view on the

development of a warehousing/analytics solution. Agile methodologies should not be limited to BI; rather, they should

be directed at all layers of the data warehouse, particularly the database and ETL design, which typically make or

break a data warehouse project. To apply successfully an agile methodology on a data warehouse project, several

practical measures can be implemented in concert with themethodology itself. These include: