

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 48
www.persistent.com
5.4.3 Big Data
In this section, we can see what best practices we can apply fromDimensional Modeling to the Big Data space.
Big Data has forced a newway to thinking about
—
Aworld extending beyond RDBMS
—
Analytics, beyond slice/dice/filtering
—
How to eliminate high latency insights.
Dimensional thinking can be applied to Big Data. For example, analysis of a Twitter firehose or an IoT application data
stream is valuable when provided with context. The dimensions are what provide the context. Prior Event, Weather,
Who (person), What (Condition, Topic), Where (Location), When (Time), are all examples. Dimensionalizing should
be automated to keep pace with the scale and velocity of data.
In Big Data environments, as mentioned in section
above, modeling is done after the data is ingested. It is not
3.2.2schema-on-write, but the schema is modeled to suit data access patterns.
The diversity of data sources (usually external or not controlled by the Enterprise IT) means that conforming the data
by dimensions (keys, definitions, granularity, attributed) is evenmore critical to the integration pipeline.
—
Generation of durable surrogate keys is applicable to Big Data
—
Time variance tracking (via SCDs) is pertinent.
—
Both Structured and Unstructured data will be integrated. Attributes of a medical examination fact table will be
linked tomultimedia data (image, scans) and text (physician annotations).
Changing role of Data Marts and ODSs (Operational Data Store).
We have seen how the role of a data mart in any
organizations data ecosystem changes over time. Data Marts usually start out as an initial proof-of-concept to gauge
the return-on-investment for analytics investments. In some organizations, they supply curated “departmental” data
to the EDW (Enterprise DataWarehouse). With the advent of a Big Data System into the data ecosystem, we now see
Data Marts as a downstream system from the Hadoop Big Data System.
Data Marts now play the role of a “serving
layer” or presentation layer
. In some cases, data marts are being done away with. Also, with the “unlimited” storage
provided by Hadoop based Big Data Systems and relief in terms of database licensing costs, there are instances
where ODSs are now being retired (especially those that have no operational reports built on them). The data quality
checks on the lowest level of transaction granularity are now performed during the ingest phase, or after the ingestion.
As discussed in sectio
nbelow, in case of data lakes, the data quality checks are done after the ingestion.
6.3.3.2As mentioned in section
above,
the tendency is to perform data transformations in the Hadoop Layer
(no
3.2.1longer is there a need for a separate Staging Databases + Staging DB Servers, or extensive investments in ETL tools,
coupled with expensive database licenses).
OLAP style multidimensional analysis directly on big data.
One of the use cases for Hadoop was to prepare data
for multi-dimensional analysis. Up to recently, data had to be extracted to a traditional downstream data warehouse or
OLAP engine (from players such as Microsoft, Oracle, IBM, SAP, etc.). However, Hadoop and its ecosystem tools are
now increasingly capable of addressing use cases that go well beyond distributed batch processing, and these now
include OLAP style analytics. Indeed, several recent open source multidimensional analysis capabilities that run “on
the cluster” are putting pressure on the traditional OLAPengine players.