WP_Data Management Best Practices

The first four fully developed packages were NORM, which performs multiple imputation under a multivariate normal model; CAT, for

multivariate categorical data; MIX, for mixed datasets containing both continuous and categorical variables; and PAN, for multivariate panel or

clustered data.

is a good resource for MI software.

http://www.stefvanbuuren.nl/mi/Software.html

W H I T E P A P E R

www.persistent.com

These ad hoc methods, though simple to implement, have serious drawbacks: deletion of the offending records

may bias the results if the subjects who provide complete data are unrepresentative of the entire sample, and

simple mean substitution may seriously dampen relationships among variables (columns). A robust method,

multiple imputation (MI)

has gained traction among statisticians. In this method, each missing value is

[20]

replaced by a set of m > 1 plausible values drawn from their predictive distribution. The variation among the m

imputations reflects the uncertainty with which the missing values can be predicted from the observed ones. After

performing MI there are m apparently complete datasets, each of which can be analyzed by complete-data

methods.After performing identical analyses on each of themdatasets, the results (estimates and standard errors)

are combined using simple rules to produce overall estimates and standard errors that reflect missing-data

uncertainty. It turns out that in many applications, just 3–5 imputations are sufficient to obtain excellent results.

General-purposeMI software for incompletemultivariate data is now available .

From integrity constraints to unsupervised anomaly detection

. In traditional (“small data”) systems, integrity

constraints played a major role in encapsulating the knowledge of a given domain. To a certain degree, this is a

successful approach for ensuring quality of data in the domain when the latter is well understood and is static. In the

big data realm, however, big variety introduces so much variability in the data that this approach no longer works

well: it may be the case that subsets of seemingly noncompliant data are actually legitimate usable data: they are

anomalies, but may be acceptable ones.

Recent investigations have reported using machine learning to detect these anomalies, i.e., these integrity

constraints violations, from the data. In a first pass, traditional constraints such as functional and inclusion

dependencies are declared (or they can also be learned from the data, as mentioned above), and in a second pass

those violations are re-examined through special unsupervised techniques

A human domain expert can

[14]

[15]

then look at candidate explanations for these supposed glitches and decide to incorporate them as new

constraints, or decide to cleanse the data away. For instance, in

an HR database example is given where,

[15]

upon a violation of the unicity constraint of phone numbers among employees, the learner reveals that the faulty

records correspond to employees with “new hire” status and have been given the same phone number as their

supervisor. After analysis, the HR user ends up accepting this revised constraint (e.g., because s/he knowns that

“new hire” is a transient status), which becomes a conditional functional dependency.

Supervised anomaly detection.

Assessment of the quality of sensor data is an example of domain where

unsupervised outlier detection methods as the ones described above don’t work in general. Indeed, the quality of

sensor data is indicated by discrete quality flags (normally assigned by domain experts) that indicate the level of

uncertainty associated with a sensor reading. Depending on the problem under consideration, the level of

uncertainty is different. Supervised classification is thus a feasible alternative, even though it has the problem that

data of dubious quality exists in a representative set of labelled data with very small frequency. Proposals such as

[13]

solve this problem through cluster oriented sampling and training multiple classifiers to improve the overall

classification accuracy.