WP_Cloud Analytics - Mapping Requirements to Technology

W H I T E P A P E R

www.persistent.com

In a nutshell, here are the broad guidelines for the database technology to use given requirements around the

data requirements category and the query requirements category.

• Use traditional SMP relational databases when data is structured, volume, velocity and variety are mild

to medium level, queries are known and concern mainly operational data with low requirements on

aggregations, and you want a minimal lag on availability of the most recent data. If you have concerns

with disturbing users using your operational application with the analytics load, consider replicating

operational data to a separate instance. Most replication solutions work in real time these days.

• Use MPP relational data warehouses (or in-memory, columnar databases on scale-up servers) when

data is structured, volumes are large but up to several hundred terabytes, velocity and variety are

mild to medium, queries are known in advance and need complex aggregations and calculations, and

workloads are for a low to medium user scale (under a thousand). When queries are multidimensional in

nature, need additive, semi-additive or non-additive aggregations (e.g., sums, averages or ratios) based

on flexible hierarchies, and predictable performance is needed, then you should consider using an OLAP

engine.

• Use Hadoop (which includes both Hive and its recent variants, and Spark) when your data is very

large, has a lot of variety or comes in different types (e.g., semi-structured web logs or text files, but

also structured data from many sources), has large velocity or beyond, and/or when your queries are

not (entirely) known in advance. Perhaps the main characteristic of Hadoop engines is that the data

precedes its schema, so it can easily accommodate big volumes and big variety. On the other hand,

query complexity, performance and high concurrency is not their forte: In Hadoop for analytics, or data

lake, you must start by discovering the structure of data, layering a schema on files and eventually

transforming data before querying. If query complexity and/or performance is a concern, this architecture

still allows optimizations, such as overlaying an OLAP model for querying from Hadoop data with OLAP-

as-a-service offers

[10]

[11]

, or exporting key/value data from Hadoop for fast querying with tools such

as Cassandra or ElephantDB; the choice depends on the type of queries you might have –of course,

this adds more cost to the overall solution. Hadoop is also a good platform for search engines indexing

all types of data. As for big velocity, you can now use Spark, Hadoop’s most recent project incarnation,

which provides streaming capabilities for fast incoming dat

a 9 .

• Use NoSQL systems when your data volumes are large, data velocity matters and may be large, data

structure is not as relevant (it may be structured and even nested –e.g., JSON, XML, but you want to

store it as is), and you need scalability and performance for online writes and simple reads.

NoSQL systems are not meant for complex querying or OLAP style aggregation but rather for operational

systems with simple analytics requirements, or for search-intensive applications. NoSQL databases

trade off stringent consistency requirements (in the sense of the

consistency model

requirement factor)

for availabilit

y 10 ,

and are modeled for querying patterns for speed and agility (very simple queries or APIs

and updates involving one or a few records accessed by their key).

To address consistency under write failures in a more comprehensive way, a data processing architecture,

Lambda architecture

[8] ,

has been introduced and is now popular. It combines the use of batch (e.g., Spark) and

stream data processing methods (e.g., Storm plus a NoSQL database). It applies to systems that try to provide

to large, geographically disperse user scales the ability to query, with acceptable latency, continuously updated

very large data volumes (hence, distribution and geo-redundant data is implied), where updates must be visible

immediately. Lambda architecture chooses availability over consistency, arguing that sacrificing availability is bad

for business and at least eventual consistency is neede

d 11

As Spark was a late comer, other popular solutions were crafted before such as Kafka, a distributed message broker, and Storm, a scalable, fault-tolerant, real-time analytic

system.

There’s typically no transaction semantics beyond single record writes: this is also related to scalability, as discussed in Appendix 1.

The CAP theorem [14] states a database cannot guarantee consistency, availability, and partition-tolerance at the same time. But in fact, the choice is really between

consistency and availability when a partition happens; at all other times, no trade-off must be made. Database systems designed with traditional ACID guarantees choose

consistency over availability (designers had LAN scenarios in mind), whereas NoSQL databases, more recent and scale-conscious, choose availability over consistency. The

main contribution of the Lambda architecture is in isolating the complexity involved in maintaining eventual consistency in the stream processing part, containing only a few

hours of activity, as the batch part constantly overrides the stream layer and manages a fault-tolerant, append-only, immutable state.