WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

All datasets in the lake should have an associated “

quality metadata” tag

, and set by default to an uncertain value.

The data custodian (and eventually the data scientists) can assess the content and the quality of the data by using

profiling tools or machine learning algorithms, and adjust the value of the quality tag accordingly. He/she can then

meet with data producers to agree on standardization of data elements, as well as the minimal business validation

rules that need to be put in place for the data to be deemed consumable.

Once this is done, depending on the nature of the application (see the above section), the data custodian can define

the appropriate data quality processes. Data quality for applications working with transactional data in the lake would

be as follows: (i) the custodian or the data scientist defines transformation flows needed to cleanse the data, directly

within the lake, for it to meet agreed data standardizations and the necessary business rules, (ii) s/he would then

define the validation and the cleansing rules, as well as (iii) the data matching and consolidation rules to remove

duplicates and improve data accuracy in the lake. But then again, as mentioned in the previous section, clickstream

analysis applications, fraud detection applications or applications based on sensor or social data may call for different

data quality processes.

As for the selection of features for a machine learning task, custodians or data scientists preparing the data may be

surprised of the pervasiveness of profiling tasks during the data cleansing stage while working on big data. This is

more the case than in traditional data cleansing for populating data warehouses, perhaps because the availability of

all the data in the lake prompts for more interactive workloads. Also, these days, data preparation on big data sets

including data quality is becoming possible by less technically skilled people, for instance business analysts: we will

look at this topic in section

below.

6.3.4

As datasets become curated, the value of the quality tags can be raised. Curated datasets may get permanently

refreshed from their original source, so it is appropriate to apply their corresponding validation rules to these datasets,

monitoring the results of validation rules over time, and adjusting the quality metadata tags accordingly.

6.3.3.3 Datamining &machine learning techniques relevant to quality on big data

Besides being used to learn to accomplish a wide variety of business problems such as image recognition, product

recommendation, and online advertising, machine learning allows the use of very raw forms of data in order to build

higher-level understanding of data automatically, which is essential for improving the quality of data. The following

is a very quick summary, not meant to be exhaustive neither, of the areas wheremachine learning has been applied

successfully to such end.

Data dependency profiling

. Learning dependencies among table columns directly from the data, such as

functional and inclusion dependencies among columns

as well as correlations and soft functional

[11]

dependencies is a popular aspect when profiling the data. The general complexity of the problem is O(2m) for m

columns, but algorithms have been maturing for over 20 years now, and new algorithms

can realistically be

[12]

applied to datasets with a fewmillion rows and tens of attributes.

Missing values.

One important aspect of data quality is the proportion of missing data values. Many data mining

algorithms handle missing data in a rather ad-hoc way, or simply ignore the problem. Until relatively recently, the

only methods widely available for analyzing incomplete data focused on removing the missing values, either by

ignoring subjects with incomplete information or by substituting plausible values (e.g. mean values) for the missing

items.