

W H I T E P A P E R
23
Data quality may be centered on other aspect, namely, relevance: it’s about choosing from the population generating the click stream only
those that are relevant to your analysis (e.g., select only actual customers accessing your site).
© 2017 Persistent Systems Ltd. All rights reserved. 61
www.persistent.com
—
Hadoop HDFS does not support updates. Reusing data quality approaches that rely on update in place
simply won’t work. One approach is to apply data cleansing to data that is flowing into or out of file systems
that don’t support updates. Applying basic quality checks on the incoming data is recommended as
typically incremental data will bemuch less in size as compared to the full data set.
—
Exception to the above statement include applications that may work on transaction data but are detecting
fraud, or any other kind of risk. Outlier data or unusual transactions should not be cleansed away. In this case,
the adequate treatment is to determine whether the outlier data correspond in fact to anomalies –more on this
on the section
below.
6.3.3.3—
An application may still deal with transactional data, but the high variability of the data it needs to integrate may
imply that a domain can no longer be simply modeled through integrity constraints, as there may be data that
does not comply with the constraints but are actually legitimate usable data. In this case, cleaning the
supposedly faulty data would in fact induce distortion in the data. Several recent proposals use machine
learning to learn constraints from the data, as explained in the section
below.
6.3.3.3—
For web applications relying on click-streaming data or for ad placement applications, if the data is not 100%
correct this might not be crucial, especially if the application or analysis is trying to convey a big picture of the
situation. For example, when looking for patterns such as at which point users leave a site, or which path is
more likely to result in a purchase, on a massive amount of clickstream data, outliers most likely won’t not
impact the overall conclusion. In this case, pattern chasing is more of an analytics process, as opposed to a
23data quality process .
—
Sensor data is not generated by humans, but it could still be missing, uncertain or invalid. Applications such as
environmental monitoring and need to have indicators on the quality of sensor data. Profiling and/or validation
on this data to detect deviations from the norm are important as they may indicate a problem with sensor
infrastructure, misplacement, climate conditions, etc. This is another area where machine learning technology
can help.
—
If data mining analysis is likely to be performed down the line on numerical data, missing values on data
elements that should not be missing are often problematic. As a minimum, the data quality processing should
distinguish between “cannot exist” and “exists but is unknown” and tag the fact data accordingly. This is a
familiar problem in the medical and social sciences, where subjects responding to a questionnaire may be
unwilling or unable to respond to some items, or may fail to complete sections of the questionnaire due to lack
of time or interest. For this reason, it may be a good idea to leave the statistical estimation, down the line, to a
special purpose data mining application for missing values of the kind “exists but is unknown”, as opposed to
rejecting rows with these missing values, which would introduce distortion in the data. More on this in section
6.3.3.3below.
—
Applications on social media data will bring data quality considerations at two levels: entity matching and data
quality related to text data.
—
Entity matching, a.k.a. entity resolution, is important when datasets built with data from social sites may
contain multiple different references to the same underlying entity (person). This is like the problem of data
matching and consolidation (see chapter 9), as entity resolution makes use of attribute similarities to
identify potential duplicates, but also may make use of the social context, or “who’s connected to who,”
which can provide useful information for qualifying or disqualifyingmatches.