WP_Cloud Analytics - Mapping Requirements to Technology

W H I T E P A P E R

www.persistent.com

5.2.3 Data model of the cloud database

As the queries are very well known and the performance requirements are rather stringent, the data model is a

traditional dimensional data warehouse model.

PSL had initially decided to build an Operational Data Store (ODS) type of warehouse, using a very similar model

to the OLTP normalized model, because (i) the large percentage of history tracking entities put a limitation on

de-normalizing the data, and (ii) every field describing each entity was potentially used for ad hoc reports for

slicing/dicing. The initial results for transactional reports were substantially above the requested performance

requirements, due to the large number of joins between dimension tables in the normalized schema. Analytical

queries were also off.

After relaxing the requirements in terms of history tracking entities and ad hoc reporting, PSL then proposed

to follow a dimensional de-normalized model to serve both transactional reports as well as analytic reports

on a particular effective date (for current date and time-travel queries). The model follows a shared schema

approach where tables have a tenant-id column. In addition, PSL proposed to build aggregate tables along with

Aggregate-aware feature in reporting tools like SAP BusinessObjects or IBM Cognos. Other database tuning and

optimization around indexing, managing statistics and partitioning on tenant-id were suggested as well.

These decisions were accepted by the customer. The performance numbers at the database layer after this

change were brought in line with the requirements.

Given that in the customer’s plans there is a will to handle unstructured and semi-structured data in the future,

PSL also investigated big data stores such as Hadoop with MapReduce, Spark, as well as No-SQL databases.

However, at this point, it was considered that moving to a big data stack would be an overfit, so building a

dimensional data warehouse appears to be the most appropriate solution at this point.

5.2.4 BI / analytics tools

Jasper Reports is the tool selected by the customer and in production in the current version of the product; as

such, it needs to be managed directly by the customer (no BI as a service option exists with this specific product).

Extensive tuning and experimentation was performed to make reports return data as per requirements, as at

some point more time was spent on the reporting layer than on the database layer (and is still the case with ad

hoc reports, as we explain below).

Jasper Reports has a limitation on ad-hoc reporting: it cannot make use of two date columns (StartDate and

EndDate) from a table to filter data based on user provided Effective Date or date range, which is the traditional

slowly changing dimension design –and there are dozens of SCD tables in the data warehouse. This limitation

only applies to Ad-Hoc querying model but works for canned (operational and analytic) reports. PSL studied

several alternatives and recommended a solution based on views on factless-fact tables which stores history

per date by learner across dimension keys, which is not too demanding in terms of extra ETL work. Jasper has

another product, Jaspersoft OLAP, that also overcomes this limitation and supports aggregate awareness to

select aggregate tables or not depending on the query (experimentation is underway on this last point).

5.2.5 Data movement /ETL tools

On the data movement front, PSL also evaluated the Azure Data Factory service, a public cloud service allowing

to populate SQL Server, Azure SQL DW and Azure storage. Transforms can be written in languages such as

Hive/Pig/C#/T-SQL (Stored Procedure language for SQL Server); pipelines are composed of activities by scripting

them in JSON code –there is no GUI. It is suitable for all kinds of data, not just structured data. Pricing is per data

activity used.