W H I T E P A P E R
www.persistent.com
© 2017 Persistent Systems Ltd. All rights reserved.
11
In a nutshell, here are the broad guidelines for the database technology to use given requirements around the
data requirements category and the query requirements category.
• Use traditional SMP relational databases when data is structured, volume, velocity and variety are mild
to medium level, queries are known and concern mainly operational data with low requirements on
aggregations, and you want a minimal lag on availability of the most recent data. If you have concerns
with disturbing users using your operational application with the analytics load, consider replicating
operational data to a separate instance. Most replication solutions work in real time these days.
• Use MPP relational data warehouses (or in-memory, columnar databases on scale-up servers) when
data is structured, volumes are large but up to several hundred terabytes, velocity and variety are
mild to medium, queries are known in advance and need complex aggregations and calculations, and
workloads are for a low to medium user scale (under a thousand). When queries are multidimensional in
nature, need additive, semi-additive or non-additive aggregations (e.g., sums, averages or ratios) based
on flexible hierarchies, and predictable performance is needed, then you should consider using an OLAP
engine.
• Use Hadoop (which includes both Hive and its recent variants, and Spark) when your data is very
large, has a lot of variety or comes in different types (e.g., semi-structured web logs or text files, but
also structured data from many sources), has large velocity or beyond, and/or when your queries are
not (entirely) known in advance. Perhaps the main characteristic of Hadoop engines is that the data
precedes its schema, so it can easily accommodate big volumes and big variety. On the other hand,
query complexity, performance and high concurrency is not their forte: In Hadoop for analytics, or data
lake, you must start by discovering the structure of data, layering a schema on files and eventually
transforming data before querying. If query complexity and/or performance is a concern, this architecture
still allows optimizations, such as overlaying an OLAP model for querying from Hadoop data with OLAP-
as-a-service offers
[10],
[11], or exporting key/value data from Hadoop for fast querying with tools such
as Cassandra or ElephantDB; the choice depends on the type of queries you might have –of course,
this adds more cost to the overall solution. Hadoop is also a good platform for search engines indexing
all types of data. As for big velocity, you can now use Spark, Hadoop’s most recent project incarnation,
which provides streaming capabilities for fast incoming dat
a 9 .• Use NoSQL systems when your data volumes are large, data velocity matters and may be large, data
structure is not as relevant (it may be structured and even nested –e.g., JSON, XML, but you want to
store it as is), and you need scalability and performance for online writes and simple reads.
NoSQL systems are not meant for complex querying or OLAP style aggregation but rather for operational
systems with simple analytics requirements, or for search-intensive applications. NoSQL databases
trade off stringent consistency requirements (in the sense of the
consistency model
requirement factor)
for availabilit
y 10 ,and are modeled for querying patterns for speed and agility (very simple queries or APIs
and updates involving one or a few records accessed by their key).
To address consistency under write failures in a more comprehensive way, a data processing architecture,
Lambda architecture
[8] ,has been introduced and is now popular. It combines the use of batch (e.g., Spark) and
stream data processing methods (e.g., Storm plus a NoSQL database). It applies to systems that try to provide
to large, geographically disperse user scales the ability to query, with acceptable latency, continuously updated
very large data volumes (hence, distribution and geo-redundant data is implied), where updates must be visible
immediately. Lambda architecture chooses availability over consistency, arguing that sacrificing availability is bad
for business and at least eventual consistency is neede
d 11.
9
As Spark was a late comer, other popular solutions were crafted before such as Kafka, a distributed message broker, and Storm, a scalable, fault-tolerant, real-time analytic
system.
10
There’s typically no transaction semantics beyond single record writes: this is also related to scalability, as discussed in Appendix 1.
11
The CAP theorem [14] states a database cannot guarantee consistency, availability, and partition-tolerance at the same time. But in fact, the choice is really between
consistency and availability when a partition happens; at all other times, no trade-off must be made. Database systems designed with traditional ACID guarantees choose
consistency over availability (designers had LAN scenarios in mind), whereas NoSQL databases, more recent and scale-conscious, choose availability over consistency. The
main contribution of the Lambda architecture is in isolating the complexity involved in maintaining eventual consistency in the stream processing part, containing only a few
hours of activity, as the batch part constantly overrides the stream layer and manages a fault-tolerant, append-only, immutable state.