WP_Data Management Best Practices

W H I T E P A P E R

Rule-based approaches can be considered as distance-based techniques, where the distance of two records is either 0 or 1.

Unlike an “ordinary” learner that is trained using a static training set, an “active” learner actively picks subsets of instances from unlabeled

data, which, when labeled, will provide the highest information gain to the learner.

For instance, the tool may suggest to run deduplication on datasets where it has not been run. When deduplication is selected, it

incorporates in its workflow a suggestion to perform cleansing if it has not been done, as it makes duplicates easier to eliminate. Also,

survivorship rules take into account the existence of fields in the datasets to suggest frequent rules that may use those fields: take the most

recent record in match set, or take the record from source A as opposed to source B.

www.persistent.com

Matching / entity resolution.

Traditional research and practice from the database field emphasizes relatively simple

and fast duplicate detection techniques that can be applied to databases with millions of records. Such techniques

typically rely on domain knowledge (people names, addresses, organization names) or on generic distance metric

to match records, and emphasize efficiency over effectiveness. There is another set of techniques coming from

research in AI and statistics that aims to develop more sophisticated matching techniques that rely on probabilistic

models. The development of new classification techniques in the machine learning community prompted the

development of supervised learning systems for entity matching. However, this requires a large number of training

examples. While it is easy to create a large number of training pairs that are either clearly non-duplicates or clearly

duplicates, it is very difficult to generate ambiguous cases that would help create a highly accurate classifier. Based on

this observation, some duplicate detection systems used active learnin

techniques

to automatically locate

[16]

such ambiguous pairs.

Understanding meaning in text and images.

There exists by now a large number of ML techniques allowing to

determine the semantics and meaning of textual data, as well as images and video. Leveraging these tools and

techniques are key to deriving the value from unstructured big data for a number of different tasks (text or image

classification, entity extraction from text or images, degree of correlation of textual or image content, etc.). Each of

these tasks have techniques that are preferred by experts in the field: Naïve Bayes for text classification, Neural

Networks or SVM for image classification, even though experts warn you about making hard rules such as these ones.

6.3.4 Self-Service Tools

As already mentioned in section

self-service data preparation tools provide many capabilities that go beyond

3.3

those of most data discovery tools for discovering, combining data and improving its quality. Their main strong points

related to data quality improvement are the following:

—

Intelligent profiling powered by data mining and machine learning algorithms that visually highlight the

structure, distribution, anomalies and repetitive patterns in data.

—

Data driven, interactive user experiences that lower the technical skill barrier for most business users. These

systems provide automatic suggestions for cleansing, enriching and matching duplicate records and its

associated survivorship procedure . They show the results of these operations interactively through

dashboards and drill-down capabilities, and allow for undoing/redoing operations interactively as well, to

make it easier to tune the parameters controlling data quality operations. This is a real change from the

technical user experience where, to run for instance matching and survivorship, users need to define

configuration rules with little or no feedback via example data, define and run jobs that execute

asynchronously, and their results need to be manually inspected through data browsing or SQL query

execution.