

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 49
www.persistent.com
—
Apache Kylin
]is an open source analytics engine designed to provide an SQL interface and OLAP style
[28multi-dimensional analysis on Hadoop large datasets. It was originally developed by eBay and is now an
Apache project. Kylin builds MOLAP cubes starting from a tabular star schema warehouse model in Hadoop
Hive (version 1.5 allows Spark and Kafka sources) using a MapReduce engine that produces a cube stored in
HBase. You can then use standard SQL to query the cube as if you were querying the star schema. From the
outside, clients can connect through ODBC/JDBC; there is a connection with Tableau where this tool can
generate SQL queries automatically from the query generator, as well as with Zeppelin, Excel and Microsoft
Power BI.
—
Drui
dis another open source option designed for OLAP queries on fast arriving event data; it leverages its
[29]own (non-Hadoop) technologies based on a column-oriented storage layout, a distributed, shared-nothing
architecture, and an advanced indexing structure. Druid is suitable for real time analysis and has good
integration with Kafka (the realtime capability of Kylin is still under development); However, it has its own query
API and language, is limited with respect to joins, does not support SQL and does not support integration with
BI tools.
—
Apache Lens
is another open source solution for Hadoop. It is a more traditional ROLAP solution based
[30]on Hive. Acube model can be built on a star or snowflake Hive model with optional aggregated tables, and can
be queried with OLAP cube QL (a logical subset of HiveQL). The system implements aggregate navigation
(see section
): the input query is rewritten taking into account aggregated tables.
4.4.1.9—
Atscale
is a commercial solution that turns a Hadoop cluster into scale-out OLAP server.AsApache Lens,
[31]it allows to overlay an OLAP model with dimensional hierarchies on top of SQL on Hadoop engines and builds
and maintains a mix of virtual BI cube definitions and aggregate tables. It supports any BI tool that can talk SQL
or MDX, and works out-of-the-box with the leading SQL-on-Hadoop engines, such as Impala, SparkSQL, or
Hive. It can also run on top of Altiscale data cloud, a Hadoop as a service solution (see section
), turning it
3.2.3into an OLAPas a service solution.
Query performance, at the cost of (bounded) accuracy.
The Kimball data model has emphasized
understandability and performance. However, to cope with the growth in data, there is a growing need for returning
data in finite and predictable time. We are running into barriers imposed by physics (storage blocks accessed by the
query, memory accessed, network round trips made), see
.
Sampling or micro-querying is now being
[23]explored as a solution to returning results in a finite bounded time.
The cost to pay is accuracy. However,
advances in research and newer database products (like BlinkDB
)point to a minimal loss of accuracy – Bounded
[27]errors and Bounded latency. In these systems, a sample model is also retained as a representative of the full data. In
essence, we now deal with 3 datamodels (the logical, the physical, and the sampled).
5.4.4NoSQL
With the rise in NoSQL Technologies, the Data Warehousing space has seen many changes. Here we track the
Dimension Modeling related practices that we see, and use in the field, trying to map dimension models to the
NoSQLmodels.