

24
The first four fully developed packages were NORM, which performs multiple imputation under a multivariate normal model; CAT, for
multivariate categorical data; MIX, for mixed datasets containing both continuous and categorical variables; and PAN, for multivariate panel or
clustered data.
is a good resource for MI software.
http://www.stefvanbuuren.nl/mi/Software.htmlW H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 65
www.persistent.com
These ad hoc methods, though simple to implement, have serious drawbacks: deletion of the offending records
may bias the results if the subjects who provide complete data are unrepresentative of the entire sample, and
simple mean substitution may seriously dampen relationships among variables (columns). A robust method,
multiple imputation (MI)
,has gained traction among statisticians. In this method, each missing value is
[20]replaced by a set of m > 1 plausible values drawn from their predictive distribution. The variation among the m
imputations reflects the uncertainty with which the missing values can be predicted from the observed ones. After
performing MI there are m apparently complete datasets, each of which can be analyzed by complete-data
methods.After performing identical analyses on each of themdatasets, the results (estimates and standard errors)
are combined using simple rules to produce overall estimates and standard errors that reflect missing-data
uncertainty. It turns out that in many applications, just 3–5 imputations are sufficient to obtain excellent results.
24General-purposeMI software for incompletemultivariate data is now available .
From integrity constraints to unsupervised anomaly detection
. In traditional (“small data”) systems, integrity
constraints played a major role in encapsulating the knowledge of a given domain. To a certain degree, this is a
successful approach for ensuring quality of data in the domain when the latter is well understood and is static. In the
big data realm, however, big variety introduces so much variability in the data that this approach no longer works
well: it may be the case that subsets of seemingly noncompliant data are actually legitimate usable data: they are
anomalies, but may be acceptable ones.
Recent investigations have reported using machine learning to detect these anomalies, i.e., these integrity
constraints violations, from the data. In a first pass, traditional constraints such as functional and inclusion
dependencies are declared (or they can also be learned from the data, as mentioned above), and in a second pass
those violations are re-examined through special unsupervised techniques
,
.A human domain expert can
[14][15]
then look at candidate explanations for these supposed glitches and decide to incorporate them as new
constraints, or decide to cleanse the data away. For instance, in
,an HR database example is given where,
[15]upon a violation of the unicity constraint of phone numbers among employees, the learner reveals that the faulty
records correspond to employees with “new hire” status and have been given the same phone number as their
supervisor. After analysis, the HR user ends up accepting this revised constraint (e.g., because s/he knowns that
“new hire” is a transient status), which becomes a conditional functional dependency.
Supervised anomaly detection.
Assessment of the quality of sensor data is an example of domain where
unsupervised outlier detection methods as the ones described above don’t work in general. Indeed, the quality of
sensor data is indicated by discrete quality flags (normally assigned by domain experts) that indicate the level of
uncertainty associated with a sensor reading. Depending on the problem under consideration, the level of
uncertainty is different. Supervised classification is thus a feasible alternative, even though it has the problem that
data of dubious quality exists in a representative set of labelled data with very small frequency. Proposals such as
[13]solve this problem through cluster oriented sampling and training multiple classifiers to improve the overall
classification accuracy.