W H I T E P A P E R
www.persistent.com
© 2017 Persistent Systems Ltd. All rights reserved.
10
In any case, data architects still need to design carefully a partitioning schema in the cluster to optimize
your most important queries. Another important aspect is right-sizing for concurrent user access. Both
Google and Azure have schemes based on concurrency slots, which are units of computational capacity
–and the more slots you reserve the more you pay.
At the intersection of the cloud and big data you now have the fast-growing
Hadoop as-a-service
(HaaS)
market, which you should consider when volumes are petabyte size, query response time demands are
in the average range (several second
s 7 ), and you want to rein in costs.
The biggest drivers for HaaS are reducing the need for technical expertise and low upfront costs, the
former being more of a driver than with the more traditional databases, given Hadoop’s management
complexity. Amazon Elastic MapReduce (EMR), the HaaS service by AWS, is the largest
player (described in
10 .2.2.9)
. Microsoft, Google and IBM also have their own offers described
below. Other vendors include Cloudera, EMC Pivotal, Qubole and Altiscale, now part of SAP. Most of
these competitors initially provided
Run It Yourself
deployments: they were hosted Hadoop on top of
IaaS, really. Some are starting to propose complete running and management of the Hadoop jobs.
Managed Hadoop typically provides no real multi-tenancy (no sharing of cluster nodes among
tenants), but provides elastic, auto-scaling clusters where nodes are added or removed depending on
SLAs for jobs estimated in advance by IT users.
d.
Pricing models
and
cost
- The combined cost of data management
8tools supported through public
IaaS services is generally costlier than the corresponding public PaaS data management services, so
being a single tenant is, well, a bit of a luxury. The running example in section
5.2.2will give you a more
concrete idea but, of course, this depends on your use case and platform provider. Pricing schemes for
DWaaS are widely different from one provider to another: some separate storage from querying activity
costs while some bundle it; some include concurrent usage concepts; some offer prepayment options to
lower pay-as-you-go pricing, which makes sense in stable production deployments where you are sure
to use the service for a term measured in years. Pricing for HaaS is simpler and typically include fees
for its service, based on usage on top of storage costs, the latter being IaaS storage costs, which are
cheaper than those that DWaaS charge (at least, when fees for storage is visible in the cost structure).
e. Cloud Computing ecosystem: Data management and other PaaS services –it almost goes without saying
that the ecosystem of a cloud platform provider in terms of tools, services and partners will be crucial
in determining the final choice. If your organization has already settled on a specific tool or service, this
reduces the choices (it could be argued in this case that the presence of such a tool in the cloud platform
ecosystem is a customer requirement). Beyond infrastructure and data management, today’s platforms
offer a very wide spectrum of platform technology services which include data, server and VM migration
tools, security services, developer tools, IoT and machine learning services, application services (e.g.,
API and workflowmanagement), mobile services, media services, cognitive services, monitoring services
and more. Of this long list, the most relevant category might be migration tools if you have resources
to migrate from on-premise to the cloud (for detailed steps in migrating data warehouses to the cloud,
please refer to
[9] ); and security, given it is a common deterrent for organizations considering both cloud
migrations and new cloud deployments.
4.3 The data model for the cloud database
This section is just a refresher on the type of data engine one would use on-premise based on the data and the
query category requirements introduced in
section 3(for a longer treatment of the subject, please refer to
[15] ). The cloud nevertheless does open possibilities in this area: as analytics use cases are becoming more
complex (given the appetite for insights on data increasingly available from so many sources), they may
necessitate more than a single engine to do the job, and it is much easier and cost-effective to architect these
kinds of “polyglot persistence” solutions using services in the cloud as opposed to installing different engines
on premise.
7
On Spark, the fastest Hadoop technology, which is still about 1 or more orders of magnitude slower when compared to MPP data warehouses.
8
This term refers here to databases, BI and data movement tools and services.