

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 66
25
Rule-based approaches can be considered as distance-based techniques, where the distance of two records is either 0 or 1.
26
Unlike an “ordinary” learner that is trained using a static training set, an “active” learner actively picks subsets of instances from unlabeled
data, which, when labeled, will provide the highest information gain to the learner.
27
For instance, the tool may suggest to run deduplication on datasets where it has not been run. When deduplication is selected, it
incorporates in its workflow a suggestion to perform cleansing if it has not been done, as it makes duplicates easier to eliminate. Also,
survivorship rules take into account the existence of fields in the datasets to suggest frequent rules that may use those fields: take the most
recent record in match set, or take the record from source A as opposed to source B.
www.persistent.com
Matching / entity resolution.
Traditional research and practice from the database field emphasizes relatively simple
and fast duplicate detection techniques that can be applied to databases with millions of records. Such techniques
25typically rely on domain knowledge (people names, addresses, organization names) or on generic distance metric
sto match records, and emphasize efficiency over effectiveness. There is another set of techniques coming from
research in AI and statistics that aims to develop more sophisticated matching techniques that rely on probabilistic
models. The development of new classification techniques in the machine learning community prompted the
development of supervised learning systems for entity matching. However, this requires a large number of training
examples. While it is easy to create a large number of training pairs that are either clearly non-duplicates or clearly
duplicates, it is very difficult to generate ambiguous cases that would help create a highly accurate classifier. Based on
26this observation, some duplicate detection systems used active learnin
gtechniques
to automatically locate
[16]such ambiguous pairs.
Understanding meaning in text and images.
There exists by now a large number of ML techniques allowing to
determine the semantics and meaning of textual data, as well as images and video. Leveraging these tools and
techniques are key to deriving the value from unstructured big data for a number of different tasks (text or image
classification, entity extraction from text or images, degree of correlation of textual or image content, etc.). Each of
these tasks have techniques that are preferred by experts in the field: Naïve Bayes for text classification, Neural
Networks or SVM for image classification, even though experts warn you about making hard rules such as these ones.
6.3.4 Self-Service Tools
As already mentioned in section
,self-service data preparation tools provide many capabilities that go beyond
3.3those of most data discovery tools for discovering, combining data and improving its quality. Their main strong points
related to data quality improvement are the following:
—
Intelligent profiling powered by data mining and machine learning algorithms that visually highlight the
structure, distribution, anomalies and repetitive patterns in data.
—
Data driven, interactive user experiences that lower the technical skill barrier for most business users. These
systems provide automatic suggestions for cleansing, enriching and matching duplicate records and its
27associated survivorship procedure . They show the results of these operations interactively through
dashboards and drill-down capabilities, and allow for undoing/redoing operations interactively as well, to
make it easier to tune the parameters controlling data quality operations. This is a real change from the
technical user experience where, to run for instance matching and survivorship, users need to define
configuration rules with little or no feedback via example data, define and run jobs that execute
asynchronously, and their results need to be manually inspected through data browsing or SQL query
execution.