WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

A transversal topic is when to worry about improving data quality in the analytical lifecycle of a big data repository.

When this comes after datasets have been placed in the repository, after users have profiled a subset of the data, and

after a business opportunity has been detected that may use a well-defined subset of the repository, then we are

talking about data lakes, and we discuss this matter in our next section. At the other end of the spectrum, when data is

cleansed immediately after it has been placed in the repository (or not stored in raw format, but in a format ready to be

processed), it is generally the case that the questions to be asked to the data are known, and the data is made to

comply to a structured schema managed both at the logical and physical level for optimal performance of these

questions/queries. We are in this case talking about a big data repository that behaves very much like a data

warehouse and for which the best practices from section

apply.

6.3.3.2 Data quality in a Data Lake

Data lakes were introduced in section

above. A data lake is not just a big data repository such as one built with

Hadoop HDFS: it has added capabilities such as dataset discovery, data quality, security and data lineage to deal with

data at scale. In a data lake, ALL data of ANY type, regardless of quality, that is related to a business process is stored

in raw form: it is not directly consumable. Data modeling, data cleansing and integration of data from several sources,

which is needed to make the data consumable, if and when it occurs, is done late, i.e., when the right business

opportunity is identified, and the questions that the analysis will be driving towards become clear.

Now, when this does occur, the quality of the data becomes a central concern: in order to get any value out of the data,

consumers need to make sure that the data has been checked and processed to improve its quality, so that

consumers can have any confidence in the accuracy of the analytics results. Clean data from these (henceforth,

curated) datasets is then made to conform the data to one (or several) structured schemas in preparation for the

anticipated analysis questions. Note that this also means that, if data quality processes are not in place in the data

lake, consumers might inadvertently consume bad quality data from the lake, and get wrong insights.

This motivates the following choices:

—

The liberal choice: data consumers should be able to tell by themselves the quality of the datasets they

discover before they explore them (through profiling or inspecting quality metadata –see below).

—

The cautious choice: business analysts should only be able to discover curated datasets (those already

processed for data quality). Data scientists are technical enough to assess and improve the quality of

datasets.

—

The traditional choice: the platform for most business users would not be the data lake, but a data mart or a

data warehouse that is fed with data from the lake. The lake in this case was used by data scientists and some

power business analysts to identify some new insights on a collection of related datasets. The data mart can

be populated with curated datasets from the lake; or alternatively, the data quality processes are applied by

ETL engines to selected, non-curated datasets feeding the data mart. This choice may be motivated by the

experience of an organization with a given ETL tool, or BI tool that may work only with the data warehouse.