

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 60
www.persistent.com
These services can be bundled with cloud integration products, cloud based platforms, or attained from REST client
apps through simple URLs. This is particularly useful for a wide number of applications that want to provide real-time
data quality at record point of entry, locate one or several places or addresses on a map, turn geographic coordinates
into a place name or address, or consolidatemaster data through batch cleansing andmatching services.
6.3.3 Big Data
In this section, we first present data quality overall role and techniques in big data repositories and then, more
specifically, in data lakes (which are big data repositories with discovery and governance tools for post-ingestion data
quality). We then devote time to present a bird’s-eye view of machine learning techniques which are becoming popular
in big data environments (even though some of these techniques predate the era of big data) to assess quality and
help with the data improvement process.
6.3.3.1 Howmuch data quality is needed for big data?
Traditional data warehouses treated dirty data as something to be avoided through purging, cleansing and / or
reconciling, record by record, on the premise that data warehouses need strict data quality standards and avoid
double counting. It is by no means clear that this is still the case in big data repositories such as Hadoop. For content
such as log files, it will neither be practical nor worthwhile to impose a structural standard, given the high variety of
such types of data structures and sources. Volume and velocity, the other 2 ‘V’s in big data, also make this proposition
impractical.
As the reader would expect, the value of high quality big data is driven by the nature of the applications or the type of
analytics that might be running on top of the big data repository. In what follows we examine some analytical
applications that may use big data repository content.
—
For transactional applications that carry regulatory compliance requirements and/or require strict audit trails,
data quality is as important as in a data warehouse. These applications generally refer to transaction data
related to key entities like customers or products, and may leverage traditional data quality technology as long
as it scales tomeet the needs of massive volume.Attention needs to be paid to the following aspects:
—
Given the massive amounts of data involved, data movement into a middleware ETL engine should be
avoided. Consider ETLengines that execute within the big data environment.
—
Consider the experience gathered in optimizing costly data quality algorithms, for instance blocking
methods in record matching and de-duplicating. We talked about blocking in sectio
nabove, and we
6.3.1have seen recently a lot of interest in adapting blockingmethods to Hadoop distributed architectures and to
perform schema agnostic blocking (i.e., which don’t rely on a common set of attributes, which is frequently
the case in semi-structured or unstructured data).
—
Prefer tools that provide high level GUIs. Programming transforms in lower level interfaces such as
MapReduce or Pig take longer to develop, are hard to read and reuse, and are less extensible.