

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 63
www.persistent.com
A transversal topic is when to worry about improving data quality in the analytical lifecycle of a big data repository.
When this comes after datasets have been placed in the repository, after users have profiled a subset of the data, and
after a business opportunity has been detected that may use a well-defined subset of the repository, then we are
talking about data lakes, and we discuss this matter in our next section. At the other end of the spectrum, when data is
cleansed immediately after it has been placed in the repository (or not stored in raw format, but in a format ready to be
processed), it is generally the case that the questions to be asked to the data are known, and the data is made to
comply to a structured schema managed both at the logical and physical level for optimal performance of these
questions/queries. We are in this case talking about a big data repository that behaves very much like a data
warehouse and for which the best practices from section
apply.
6.26.3.3.2 Data quality in a Data Lake
Data lakes were introduced in section
above. A data lake is not just a big data repository such as one built with
3.2.1Hadoop HDFS: it has added capabilities such as dataset discovery, data quality, security and data lineage to deal with
data at scale. In a data lake, ALL data of ANY type, regardless of quality, that is related to a business process is stored
in raw form: it is not directly consumable. Data modeling, data cleansing and integration of data from several sources,
which is needed to make the data consumable, if and when it occurs, is done late, i.e., when the right business
opportunity is identified, and the questions that the analysis will be driving towards become clear.
Now, when this does occur, the quality of the data becomes a central concern: in order to get any value out of the data,
consumers need to make sure that the data has been checked and processed to improve its quality, so that
consumers can have any confidence in the accuracy of the analytics results. Clean data from these (henceforth,
curated) datasets is then made to conform the data to one (or several) structured schemas in preparation for the
anticipated analysis questions. Note that this also means that, if data quality processes are not in place in the data
lake, consumers might inadvertently consume bad quality data from the lake, and get wrong insights.
This motivates the following choices:
—
The liberal choice: data consumers should be able to tell by themselves the quality of the datasets they
discover before they explore them (through profiling or inspecting quality metadata –see below).
—
The cautious choice: business analysts should only be able to discover curated datasets (those already
processed for data quality). Data scientists are technical enough to assess and improve the quality of
datasets.
—
The traditional choice: the platform for most business users would not be the data lake, but a data mart or a
data warehouse that is fed with data from the lake. The lake in this case was used by data scientists and some
power business analysts to identify some new insights on a collection of related datasets. The data mart can
be populated with curated datasets from the lake; or alternatively, the data quality processes are applied by
ETL engines to selected, non-curated datasets feeding the data mart. This choice may be motivated by the
experience of an organization with a given ETL tool, or BI tool that may work only with the data warehouse.