

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 64
www.persistent.com
All datasets in the lake should have an associated “
quality metadata” tag
, and set by default to an uncertain value.
The data custodian (and eventually the data scientists) can assess the content and the quality of the data by using
profiling tools or machine learning algorithms, and adjust the value of the quality tag accordingly. He/she can then
meet with data producers to agree on standardization of data elements, as well as the minimal business validation
rules that need to be put in place for the data to be deemed consumable.
Once this is done, depending on the nature of the application (see the above section), the data custodian can define
the appropriate data quality processes. Data quality for applications working with transactional data in the lake would
be as follows: (i) the custodian or the data scientist defines transformation flows needed to cleanse the data, directly
within the lake, for it to meet agreed data standardizations and the necessary business rules, (ii) s/he would then
define the validation and the cleansing rules, as well as (iii) the data matching and consolidation rules to remove
duplicates and improve data accuracy in the lake. But then again, as mentioned in the previous section, clickstream
analysis applications, fraud detection applications or applications based on sensor or social data may call for different
data quality processes.
As for the selection of features for a machine learning task, custodians or data scientists preparing the data may be
surprised of the pervasiveness of profiling tasks during the data cleansing stage while working on big data. This is
more the case than in traditional data cleansing for populating data warehouses, perhaps because the availability of
all the data in the lake prompts for more interactive workloads. Also, these days, data preparation on big data sets
including data quality is becoming possible by less technically skilled people, for instance business analysts: we will
look at this topic in section
below.
6.3.4As datasets become curated, the value of the quality tags can be raised. Curated datasets may get permanently
refreshed from their original source, so it is appropriate to apply their corresponding validation rules to these datasets,
monitoring the results of validation rules over time, and adjusting the quality metadata tags accordingly.
6.3.3.3 Datamining &machine learning techniques relevant to quality on big data
Besides being used to learn to accomplish a wide variety of business problems such as image recognition, product
recommendation, and online advertising, machine learning allows the use of very raw forms of data in order to build
higher-level understanding of data automatically, which is essential for improving the quality of data. The following
is a very quick summary, not meant to be exhaustive neither, of the areas wheremachine learning has been applied
successfully to such end.
Data dependency profiling
. Learning dependencies among table columns directly from the data, such as
functional and inclusion dependencies among columns
,as well as correlations and soft functional
[11]dependencies is a popular aspect when profiling the data. The general complexity of the problem is O(2m) for m
columns, but algorithms have been maturing for over 20 years now, and new algorithms
can realistically be
[12]applied to datasets with a fewmillion rows and tens of attributes.
Missing values.
One important aspect of data quality is the proportion of missing data values. Many data mining
algorithms handle missing data in a rather ad-hoc way, or simply ignore the problem. Until relatively recently, the
only methods widely available for analyzing incomplete data focused on removing the missing values, either by
ignoring subjects with incomplete information or by substituting plausible values (e.g. mean values) for the missing
items.