WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

5.4.3 Big Data

In this section, we can see what best practices we can apply fromDimensional Modeling to the Big Data space.

Big Data has forced a newway to thinking about

—

Aworld extending beyond RDBMS

—

Analytics, beyond slice/dice/filtering

—

How to eliminate high latency insights.

Dimensional thinking can be applied to Big Data. For example, analysis of a Twitter firehose or an IoT application data

stream is valuable when provided with context. The dimensions are what provide the context. Prior Event, Weather,

Who (person), What (Condition, Topic), Where (Location), When (Time), are all examples. Dimensionalizing should

be automated to keep pace with the scale and velocity of data.

In Big Data environments, as mentioned in section

above, modeling is done after the data is ingested. It is not

3.2.2

schema-on-write, but the schema is modeled to suit data access patterns.

The diversity of data sources (usually external or not controlled by the Enterprise IT) means that conforming the data

by dimensions (keys, definitions, granularity, attributed) is evenmore critical to the integration pipeline.

—

Generation of durable surrogate keys is applicable to Big Data

—

Time variance tracking (via SCDs) is pertinent.

—

Both Structured and Unstructured data will be integrated. Attributes of a medical examination fact table will be

linked tomultimedia data (image, scans) and text (physician annotations).

Changing role of Data Marts and ODSs (Operational Data Store).

We have seen how the role of a data mart in any

organizations data ecosystem changes over time. Data Marts usually start out as an initial proof-of-concept to gauge

the return-on-investment for analytics investments. In some organizations, they supply curated “departmental” data

to the EDW (Enterprise DataWarehouse). With the advent of a Big Data System into the data ecosystem, we now see

Data Marts as a downstream system from the Hadoop Big Data System.

Data Marts now play the role of a “serving

layer” or presentation layer

. In some cases, data marts are being done away with. Also, with the “unlimited” storage

provided by Hadoop based Big Data Systems and relief in terms of database licensing costs, there are instances

where ODSs are now being retired (especially those that have no operational reports built on them). The data quality

checks on the lowest level of transaction granularity are now performed during the ingest phase, or after the ingestion.

As discussed in sectio

below, in case of data lakes, the data quality checks are done after the ingestion.

6.3.3.2

As mentioned in section

above,

the tendency is to perform data transformations in the Hadoop Layer

(no

3.2.1

longer is there a need for a separate Staging Databases + Staging DB Servers, or extensive investments in ETL tools,

coupled with expensive database licenses).

OLAP style multidimensional analysis directly on big data.

One of the use cases for Hadoop was to prepare data

for multi-dimensional analysis. Up to recently, data had to be extracted to a traditional downstream data warehouse or

OLAP engine (from players such as Microsoft, Oracle, IBM, SAP, etc.). However, Hadoop and its ecosystem tools are

now increasingly capable of addressing use cases that go well beyond distributed batch processing, and these now

include OLAP style analytics. Indeed, several recent open source multidimensional analysis capabilities that run “on

the cluster” are putting pressure on the traditional OLAPengine players.