W H I T E P A P E R
www.persistent.com
© 2017 Persistent Systems Ltd. All rights reserved.
5.2.3 Data model of the cloud database
As the queries are very well known and the performance requirements are rather stringent, the data model is a
traditional dimensional data warehouse model.
PSL had initially decided to build an Operational Data Store (ODS) type of warehouse, using a very similar model
to the OLTP normalized model, because (i) the large percentage of history tracking entities put a limitation on
de-normalizing the data, and (ii) every field describing each entity was potentially used for ad hoc reports for
slicing/dicing. The initial results for transactional reports were substantially above the requested performance
requirements, due to the large number of joins between dimension tables in the normalized schema. Analytical
queries were also off.
After relaxing the requirements in terms of history tracking entities and ad hoc reporting, PSL then proposed
to follow a dimensional de-normalized model to serve both transactional reports as well as analytic reports
on a particular effective date (for current date and time-travel queries). The model follows a shared schema
approach where tables have a tenant-id column. In addition, PSL proposed to build aggregate tables along with
Aggregate-aware feature in reporting tools like SAP BusinessObjects or IBM Cognos. Other database tuning and
optimization around indexing, managing statistics and partitioning on tenant-id were suggested as well.
These decisions were accepted by the customer. The performance numbers at the database layer after this
change were brought in line with the requirements.
Given that in the customer’s plans there is a will to handle unstructured and semi-structured data in the future,
PSL also investigated big data stores such as Hadoop with MapReduce, Spark, as well as No-SQL databases.
However, at this point, it was considered that moving to a big data stack would be an overfit, so building a
dimensional data warehouse appears to be the most appropriate solution at this point.
5.2.4 BI / analytics tools
Jasper Reports is the tool selected by the customer and in production in the current version of the product; as
such, it needs to be managed directly by the customer (no BI as a service option exists with this specific product).
Extensive tuning and experimentation was performed to make reports return data as per requirements, as at
some point more time was spent on the reporting layer than on the database layer (and is still the case with ad
hoc reports, as we explain below).
Jasper Reports has a limitation on ad-hoc reporting: it cannot make use of two date columns (StartDate and
EndDate) from a table to filter data based on user provided Effective Date or date range, which is the traditional
slowly changing dimension design –and there are dozens of SCD tables in the data warehouse. This limitation
only applies to Ad-Hoc querying model but works for canned (operational and analytic) reports. PSL studied
several alternatives and recommended a solution based on views on factless-fact tables which stores history
per date by learner across dimension keys, which is not too demanding in terms of extra ETL work. Jasper has
another product, Jaspersoft OLAP, that also overcomes this limitation and supports aggregate awareness to
select aggregate tables or not depending on the query (experimentation is underway on this last point).
5.2.5 Data movement /ETL tools
On the data movement front, PSL also evaluated the Azure Data Factory service, a public cloud service allowing
to populate SQL Server, Azure SQL DW and Azure storage. Transforms can be written in languages such as
Hive/Pig/C#/T-SQL (Stored Procedure language for SQL Server); pipelines are composed of activities by scripting
them in JSON code –there is no GUI. It is suitable for all kinds of data, not just structured data. Pricing is per data
activity used.
20