WP_Cloud Analytics - Mapping Requirements to Technology

W H I T E P A P E R

www.persistent.com

In any case, data architects still need to design carefully a partitioning schema in the cluster to optimize

your most important queries. Another important aspect is right-sizing for concurrent user access. Both

Google and Azure have schemes based on concurrency slots, which are units of computational capacity

–and the more slots you reserve the more you pay.

At the intersection of the cloud and big data you now have the fast-growing

Hadoop as-a-service

(HaaS)

market, which you should consider when volumes are petabyte size, query response time demands are

in the average range (several second

s 7 )

, and you want to rein in costs.

The biggest drivers for HaaS are reducing the need for technical expertise and low upfront costs, the

former being more of a driver than with the more traditional databases, given Hadoop’s management

complexity. Amazon Elastic MapReduce (EMR), the HaaS service by AWS, is the largest

player (described in

10 .2.2.9

)

. Microsoft, Google and IBM also have their own offers described

below. Other vendors include Cloudera, EMC Pivotal, Qubole and Altiscale, now part of SAP. Most of

these competitors initially provided

Run It Yourself

deployments: they were hosted Hadoop on top of

IaaS, really. Some are starting to propose complete running and management of the Hadoop jobs.

Managed Hadoop typically provides no real multi-tenancy (no sharing of cluster nodes among

tenants), but provides elastic, auto-scaling clusters where nodes are added or removed depending on

SLAs for jobs estimated in advance by IT users.

Pricing models

and

cost

- The combined cost of data management

tools supported through public

IaaS services is generally costlier than the corresponding public PaaS data management services, so

being a single tenant is, well, a bit of a luxury. The running example in section

5.2.2

will give you a more

concrete idea but, of course, this depends on your use case and platform provider. Pricing schemes for

DWaaS are widely different from one provider to another: some separate storage from querying activity

costs while some bundle it; some include concurrent usage concepts; some offer prepayment options to

lower pay-as-you-go pricing, which makes sense in stable production deployments where you are sure

to use the service for a term measured in years. Pricing for HaaS is simpler and typically include fees

for its service, based on usage on top of storage costs, the latter being IaaS storage costs, which are

cheaper than those that DWaaS charge (at least, when fees for storage is visible in the cost structure).

e. Cloud Computing ecosystem: Data management and other PaaS services –it almost goes without saying

that the ecosystem of a cloud platform provider in terms of tools, services and partners will be crucial

in determining the final choice. If your organization has already settled on a specific tool or service, this

reduces the choices (it could be argued in this case that the presence of such a tool in the cloud platform

ecosystem is a customer requirement). Beyond infrastructure and data management, today’s platforms

offer a very wide spectrum of platform technology services which include data, server and VM migration

tools, security services, developer tools, IoT and machine learning services, application services (e.g.,

API and workflowmanagement), mobile services, media services, cognitive services, monitoring services

and more. Of this long list, the most relevant category might be migration tools if you have resources

to migrate from on-premise to the cloud (for detailed steps in migrating data warehouses to the cloud,

please refer to

[9] )

; and security, given it is a common deterrent for organizations considering both cloud

migrations and new cloud deployments.

4.3 The data model for the cloud database

This section is just a refresher on the type of data engine one would use on-premise based on the data and the

query category requirements introduced in

section 3

(for a longer treatment of the subject, please refer to

[15] )

. The cloud nevertheless does open possibilities in this area: as analytics use cases are becoming more

complex (given the appetite for insights on data increasingly available from so many sources), they may

necessitate more than a single engine to do the job, and it is much easier and cost-effective to architect these

kinds of “polyglot persistence” solutions using services in the cloud as opposed to installing different engines

on premise.

On Spark, the fastest Hadoop technology, which is still about 1 or more orders of magnitude slower when compared to MPP data warehouses.

This term refers here to databases, BI and data movement tools and services.