WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

—

Apache Kylin

]

is an open source analytics engine designed to provide an SQL interface and OLAP style

[28

multi-dimensional analysis on Hadoop large datasets. It was originally developed by eBay and is now an

Apache project. Kylin builds MOLAP cubes starting from a tabular star schema warehouse model in Hadoop

Hive (version 1.5 allows Spark and Kafka sources) using a MapReduce engine that produces a cube stored in

HBase. You can then use standard SQL to query the cube as if you were querying the star schema. From the

outside, clients can connect through ODBC/JDBC; there is a connection with Tableau where this tool can

generate SQL queries automatically from the query generator, as well as with Zeppelin, Excel and Microsoft

Power BI.

—

Drui

is another open source option designed for OLAP queries on fast arriving event data; it leverages its

[29]

own (non-Hadoop) technologies based on a column-oriented storage layout, a distributed, shared-nothing

architecture, and an advanced indexing structure. Druid is suitable for real time analysis and has good

integration with Kafka (the realtime capability of Kylin is still under development); However, it has its own query

API and language, is limited with respect to joins, does not support SQL and does not support integration with

BI tools.

—

Apache Lens

is another open source solution for Hadoop. It is a more traditional ROLAP solution based

[30]

on Hive. Acube model can be built on a star or snowflake Hive model with optional aggregated tables, and can

be queried with OLAP cube QL (a logical subset of HiveQL). The system implements aggregate navigation

(see section

)

: the input query is rewritten taking into account aggregated tables.

4.4.1.9

—

Atscale

is a commercial solution that turns a Hadoop cluster into scale-out OLAP server.AsApache Lens,

[31]

it allows to overlay an OLAP model with dimensional hierarchies on top of SQL on Hadoop engines and builds

and maintains a mix of virtual BI cube definitions and aggregate tables. It supports any BI tool that can talk SQL

or MDX, and works out-of-the-box with the leading SQL-on-Hadoop engines, such as Impala, SparkSQL, or

Hive. It can also run on top of Altiscale data cloud, a Hadoop as a service solution (see section

)

, turning it

3.2.3

into an OLAPas a service solution.

Query performance, at the cost of (bounded) accuracy.

The Kimball data model has emphasized

understandability and performance. However, to cope with the growth in data, there is a growing need for returning

data in finite and predictable time. We are running into barriers imposed by physics (storage blocks accessed by the

query, memory accessed, network round trips made), see

Sampling or micro-querying is now being

[23]

explored as a solution to returning results in a finite bounded time.

The cost to pay is accuracy. However,

advances in research and newer database products (like BlinkDB

)

point to a minimal loss of accuracy – Bounded

[27]

errors and Bounded latency. In these systems, a sample model is also retained as a representative of the full data. In

essence, we now deal with 3 datamodels (the logical, the physical, and the sampled).

5.4.4NoSQL

With the rise in NoSQL Technologies, the Data Warehousing space has seen many changes. Here we track the

Dimension Modeling related practices that we see, and use in the field, trying to map dimension models to the

NoSQLmodels.