

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 32
www.persistent.com
We recommend deploying components from the reference architecture that address below needs across different
data projects in Big Data setup:
—
DataModels, Structures, Types
—
Data LifecycleManagement
—
Data transformations, Provenance, Curation,Archiving
—
Target use, presentation, visualization
—
Cluster size or Infrastructure - Storage, Compute, and Network
—
Data security in-rest, in-move, trusted computational environments
4.4.3.2Hadoop andMPP
This section addresses the relation between Hadoop-based big data architectures and MPP databases, as there is
indeed some relation between these concepts –and quite some confusion.
In Hadoop, there is some notion of scale-out by adding servers to a Hadoop cluster. On the other hand, MapReduce,
the first distributed processing that appeared does allow (but does not require) all its computational tasks to run in
parallel.
So, this bears the question, is Hadoop MPP?
There is no simple answer or, better yet, the initial answer was no,
and it now progressing towards a “maybe”, depending on what Hadoop solution you pick. Indeed, Hadoop is not a
single technology, it is an ecosystem of related projects. The question needs to be asked, therefore, at each level of
the stack. Recent solutions such as Impala and HAWQ are MPP execution engines on top of Hadoop working with the
data stored in HDFS. SparkSQL is another execution engine trying to get the best of both MapReduce and MPP-over-
Hadoop approaches, and having its own drawbacks.
As for the question,
“Should I choose an MPP solution or Hadoop-based solution?”
the state of the technology
today is that with these sophisticated Hadoop execution engines, you can get queries to return in a decent amount of
time these days on huge datasets, but Hadoop cannot be used as a complete replacement of the traditional enterprise
data warehouse. On typical data warehousing queries, we are still talking about response time differences of about
one order of magnitude of difference, sometimes evenmore.
4.4.4 Self-service and agility
This section will not delve into self-service technology and products, as this has been the focus of section
above.
3.3We will only add that you should make sure you involve business users to select their tools, especially those that are
self-service.
We consider here the impact that agile methodologies could have from a technical architecture point of view on the
development of a warehousing/analytics solution. Agile methodologies should not be limited to BI; rather, they should
be directed at all layers of the data warehouse, particularly the database and ETL design, which typically make or
break a data warehouse project. To apply successfully an agile methodology on a data warehouse project, several
practical measures can be implemented in concert with themethodology itself. These include: