WP_Data Management Best Practices

W H I T E P A P E R

Data quality may be centered on other aspect, namely, relevance: it’s about choosing from the population generating the click stream only

those that are relevant to your analysis (e.g., select only actual customers accessing your site).

www.persistent.com

—

Hadoop HDFS does not support updates. Reusing data quality approaches that rely on update in place

simply won’t work. One approach is to apply data cleansing to data that is flowing into or out of file systems

that don’t support updates. Applying basic quality checks on the incoming data is recommended as

typically incremental data will bemuch less in size as compared to the full data set.

—

Exception to the above statement include applications that may work on transaction data but are detecting

fraud, or any other kind of risk. Outlier data or unusual transactions should not be cleansed away. In this case,

the adequate treatment is to determine whether the outlier data correspond in fact to anomalies –more on this

on the section

below.

6.3.3.3

—

An application may still deal with transactional data, but the high variability of the data it needs to integrate may

imply that a domain can no longer be simply modeled through integrity constraints, as there may be data that

does not comply with the constraints but are actually legitimate usable data. In this case, cleaning the

supposedly faulty data would in fact induce distortion in the data. Several recent proposals use machine

learning to learn constraints from the data, as explained in the section

below.

6.3.3.3

—

For web applications relying on click-streaming data or for ad placement applications, if the data is not 100%

correct this might not be crucial, especially if the application or analysis is trying to convey a big picture of the

situation. For example, when looking for patterns such as at which point users leave a site, or which path is

more likely to result in a purchase, on a massive amount of clickstream data, outliers most likely won’t not

impact the overall conclusion. In this case, pattern chasing is more of an analytics process, as opposed to a

data quality process .

—

Sensor data is not generated by humans, but it could still be missing, uncertain or invalid. Applications such as

environmental monitoring and need to have indicators on the quality of sensor data. Profiling and/or validation

on this data to detect deviations from the norm are important as they may indicate a problem with sensor

infrastructure, misplacement, climate conditions, etc. This is another area where machine learning technology

can help.

—

If data mining analysis is likely to be performed down the line on numerical data, missing values on data

elements that should not be missing are often problematic. As a minimum, the data quality processing should

distinguish between “cannot exist” and “exists but is unknown” and tag the fact data accordingly. This is a

familiar problem in the medical and social sciences, where subjects responding to a questionnaire may be

unwilling or unable to respond to some items, or may fail to complete sections of the questionnaire due to lack

of time or interest. For this reason, it may be a good idea to leave the statistical estimation, down the line, to a

special purpose data mining application for missing values of the kind “exists but is unknown”, as opposed to

rejecting rows with these missing values, which would introduce distortion in the data. More on this in section

6.3.3.3

below.

—

Applications on social media data will bring data quality considerations at two levels: entity matching and data

quality related to text data.

—

Entity matching, a.k.a. entity resolution, is important when datasets built with data from social sites may

contain multiple different references to the same underlying entity (person). This is like the problem of data

matching and consolidation (see chapter 9), as entity resolution makes use of attribute similarities to

identify potential duplicates, but also may make use of the social context, or “who’s connected to who,”

which can provide useful information for qualifying or disqualifyingmatches.