Table of Contents Table of Contents
Previous Page  16 / 96 Next Page
Information
Show Menu
Previous Page 16 / 96 Next Page
Page Background

W H I T E P A P E R

© 2017 Persistent Systems Ltd. All rights reserved. 16

www.persistent.com

3.2.2 Hadoop for Analytics, or Data Lakes

Data lake

s

is an emerging approach to extracting and placing big data for analytics, along with corporate data, in a

[7]

Hadoop cluster which several components are layered in order to effectively enable data scientists and business

analysts to extract analytical value out of the data.

In the original concept expressed i

n ,

one of the primary motivations of a data lake is not to lose any data that may be

[7]

relevant for analysis, now or at some point in the future; the idea is to provide as large a pool of data as possible. Data

lakes thus store all the data deemed relevant for analysis, and ingests in raw form, without conforming to any

overarching data model. So, for instance, when a customer buys a product from the organization's e-commerce web

front-end, not only we want to get the data of the customer's transaction from the underlying database, but we also

want to include all the logged events that trace the customer's experience during the purchase on the web site. We

might also be interested in gathering data from social sites to understand the experience that this customer might

12

have with the product in the future .

Reference

is PSL's official “Data Lake story”, explaining our point of view on this concept. What follows

[5]

summarizes some sections in

.

To be effective to analytics processing, the Hadoop environment uses a variety of

[5]

processing tools to semantically

discover

and to

govern

the data, which include traceability through lineage

metadata and tools to improve its overall quality. This allows finding the data needed to solve a business process on

huge amounts of data, providing a level of trust to this data and making it effectively consumable. Finally, it allows tools

and exposes APIs for consumers to extract business value through several types of consumer workloads, including

traditional BI queries, exploration and machine learning / statistical workloads. The figure below is PSL's data lake

reference architecture from

. [5]

Data Lake - Reference Architecture

12

Provided we are able to identify him/her as the same customer that bought the product, which may or may not be possible.