

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 16
www.persistent.com
3.2.2 Hadoop for Analytics, or Data Lakes
Data lake
sis an emerging approach to extracting and placing big data for analytics, along with corporate data, in a
[7]Hadoop cluster which several components are layered in order to effectively enable data scientists and business
analysts to extract analytical value out of the data.
In the original concept expressed i
n ,one of the primary motivations of a data lake is not to lose any data that may be
[7]relevant for analysis, now or at some point in the future; the idea is to provide as large a pool of data as possible. Data
lakes thus store all the data deemed relevant for analysis, and ingests in raw form, without conforming to any
overarching data model. So, for instance, when a customer buys a product from the organization's e-commerce web
front-end, not only we want to get the data of the customer's transaction from the underlying database, but we also
want to include all the logged events that trace the customer's experience during the purchase on the web site. We
might also be interested in gathering data from social sites to understand the experience that this customer might
12have with the product in the future .
Reference
is PSL's official “Data Lake story”, explaining our point of view on this concept. What follows
[5]summarizes some sections in
.To be effective to analytics processing, the Hadoop environment uses a variety of
[5]processing tools to semantically
discover
and to
govern
the data, which include traceability through lineage
metadata and tools to improve its overall quality. This allows finding the data needed to solve a business process on
huge amounts of data, providing a level of trust to this data and making it effectively consumable. Finally, it allows tools
and exposes APIs for consumers to extract business value through several types of consumer workloads, including
traditional BI queries, exploration and machine learning / statistical workloads. The figure below is PSL's data lake
reference architecture from
. [5]Data Lake - Reference Architecture
12
Provided we are able to identify him/her as the same customer that bought the product, which may or may not be possible.