WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

Data Lakes and Datawarehouses

Data lakes are therefore far more flexible alternatives to data warehouses to gather all the data of an organization that

is relevant for analysis, especially as they become more outward looking and increase their exposure to cloud and

mobile applications. Indeed, data warehouses require upfront modeling, and only good quality data conforming to the

model should be loaded. However, as we will see below, this does not mean that data is not modeled in the lake for

further analysis, or that its quality can be disregarded: it just means that these two activities can be

deferred.

As we already know, data warehouses are optimized for a different purpose: answers to the questions for which they

have been designed can immediately be trusted and, because of more mature (but more expensive) engines, it can

support fast response times for large numbers of concurrent users.

Data warehouses and data lakes should then be used for the purpose they were designed and, as such, they can co-

exist in an organization's landscape: data from the warehouse can be fed into the lake; and conversely, data from the

lake can be a source for the warehouse, or a data mart (a portion of a data warehouse specializing on a business

process).

Roles in a Data Lake

AData Lake owner (called custodians in

is a role fulfilled by IT-savvy people, who can belong to the IT department

of an organization or not, and who try to satisfy the needs of personas of two different roles:

—

The data producers, whose role is that of the business head who owns the data specific to the business

function. They are all about control: they worry about regulatory compliance issues, visibility about who uses

their data, and privacy and access control, among others.

—

The data consumers, who may be either data scientists or business analysts. They are the ones who discover,

explore, visualize and ultimately get business value in the form of insights to executives. They are about

flexibility: they want to quickly find relevant data for their use cases, consuming good quality data, supporting

several analytics workloads, etc.

Data custodians can use several tools on the data lake to assure data producers that data is in safe hands and is being

governed (controlled for access, quality and lineage, even if these controls are at consumption time) and monitored,

while serving data consumers with all flexibility, so that they can discover, explore and visualize data sets to derive

valuable insights.

3.2.3Hadoop as a service

At the intersection of the cloud and big data there is a nascent but fast-growing market: Hadoop as-a-service, or

HaaS. The biggest drivers are reducing the need for technical expertise and low upfront costs.

Amazon Elastic MapReduce (EMR), the HaaS service by AWS is the largest player. It provides a Hadoop based

platform for data analysis with S3 as the storage system and EC2 as the compute system. Microsoft Azure HDInsight,

Cloudera CDH3, IBM BigInsights for Apache Hadoop for Bluemix, EMC Pivotal HD are the primary competing HaaS

services provided by global IT giants. Most of these competitors initially provided a Run It Yourself deployment option

and are now proposing managed options. Altiscale, recently bought by SAP, is a pure play provider, meaning it

provides complete running andmanagement of the Hadoop jobs. Another pure play provider is Qubole.