WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

3.2.2 Hadoop for Analytics, or Data Lakes

Data lake

is an emerging approach to extracting and placing big data for analytics, along with corporate data, in a

Hadoop cluster which several components are layered in order to effectively enable data scientists and business

analysts to extract analytical value out of the data.

In the original concept expressed i

one of the primary motivations of a data lake is not to lose any data that may be

relevant for analysis, now or at some point in the future; the idea is to provide as large a pool of data as possible. Data

lakes thus store all the data deemed relevant for analysis, and ingests in raw form, without conforming to any

overarching data model. So, for instance, when a customer buys a product from the organization's e-commerce web

front-end, not only we want to get the data of the customer's transaction from the underlying database, but we also

want to include all the logged events that trace the customer's experience during the purchase on the web site. We

might also be interested in gathering data from social sites to understand the experience that this customer might

have with the product in the future .

Reference

is PSL's official “Data Lake story”, explaining our point of view on this concept. What follows

summarizes some sections in

To be effective to analytics processing, the Hadoop environment uses a variety of

processing tools to semantically

discover

and to

govern

the data, which include traceability through lineage

metadata and tools to improve its overall quality. This allows finding the data needed to solve a business process on

huge amounts of data, providing a level of trust to this data and making it effectively consumable. Finally, it allows tools

and exposes APIs for consumers to extract business value through several types of consumer workloads, including

traditional BI queries, exploration and machine learning / statistical workloads. The figure below is PSL's data lake

reference architecture from

Data Lake - Reference Architecture

Provided we are able to identify him/her as the same customer that bought the product, which may or may not be possible.