W H I T E P A P E R
www.persistent.com
© 2017 Persistent Systems Ltd. All rights reserved.
12
As you can see, no single technology can capture all requirements. Even two or three requirements, but taken
to their extreme, are either difficult to satisfy with a single existing technology, or a combination of different
database technologies is needed (Lambda architecture is in fact an example). Let us illustrate this with a couple
of examples.
a. At Facebook, the analytics group had to provide OLAP style queries with very low latencies and very
high velocities
[12] .They experimented with several technologies, nothing worked and, at the end, had
to build a data and query execution engine that worked for the
m 12 .b. Without extreme query performance requirements, the variety of analytics tasks may bend an architecture
towards polyglot persistence, as in the case of Flipkart, an eCommerce company
[13] :large incoming
data volumes, data processing at different velocities (both real time and batch), and an analytics layer
requiring ad-hoc analysis, search, machine learning and canned reporting. Their data layer includes
Hadoop (Hive, Spark), Storm, Vertica (an MPP warehouse) and ElasticSearch (see
10 .4.2.1 1 ). If the
high
velocity requirement is dropped there would still probably be Hadoop, Vertica and ElasticSearch
in the
picture, given the analytics requirements.
4.4 The BI / analytics tools
This is a very important aspect of decision making being the one that most impacts business end users. The
modern cloud database needs to support the breadth of tools that organizations can use to get actionable results
from the data. BI is a good fit for the cloud when the visualization tools are close to where the data is, which is
now the case with cloud analytics. The choice of BI/Analytics tool depend on several dimensions.
a.
Query types
–Traditional BI tools were built for the reporting analytic workload; ad-hoc querying and
OLAP came later and have more “free-form” user experiences and interfaces, sometimes imposing
limitations on the types of queries that may be defined (see section
5.2.4 ).
b.
Performance and scalability
– This is an area normally associated with the database / data warehouse
layer, but the analytics layer also contributes to the overall time spent (again, refer to section
5.2.4below
for an example).
c.
Analytic workload
– If the requirement is about reporting or dashboarding, then most cloud platforms
also provide solutions e.g. from SAP, IBM and Oracle as SaaS services from their own clouds or from
Azure and AWS. However, if you are looking for exploration and discovery use cases, then look for tools
like Tableau, Qlikview, etc; these are mainly desktop solutions but can work with cloud sources and can
publish reports and dashboards to the cloud. For machine learning use cases, cloud service providers do
offer them as a service, e.g. Amazon ML, Azure ML, Watson Analytics. Google Analytics offers complete
BI stack in the cloud: it not only offers visual data discovery, exploration, collaboration, and reporting, but
also analytic applications for marketing, sales, service, and social platforms. Finally, if the requirement is
to build a full-blown solution in a given vertical industry, then we are talking about embedding analytics
capabilities in an application that is to be built and deployed using PaaS
development services
and tools.
d.
Data integration / data quality
– Also referred to as data preparation, it has been recently recognized
that it is highly desirable, in a modern BI toolset, to include features to integrate data coming from
different sources and address the heterogeneity of data representations, conventions and standards,
missing values, as well as duplicated records, that impact the quality of data. The most common way
this is being addressed is by loosely coupling self-service
data preparation
tools with BI tools, as will be
explained in the next section.
12
At the root of the problem, OLAP engines operate on mostly static datasets