WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

These services can be bundled with cloud integration products, cloud based platforms, or attained from REST client

apps through simple URLs. This is particularly useful for a wide number of applications that want to provide real-time

data quality at record point of entry, locate one or several places or addresses on a map, turn geographic coordinates

into a place name or address, or consolidatemaster data through batch cleansing andmatching services.

6.3.3 Big Data

In this section, we first present data quality overall role and techniques in big data repositories and then, more

specifically, in data lakes (which are big data repositories with discovery and governance tools for post-ingestion data

quality). We then devote time to present a bird’s-eye view of machine learning techniques which are becoming popular

in big data environments (even though some of these techniques predate the era of big data) to assess quality and

help with the data improvement process.

6.3.3.1 Howmuch data quality is needed for big data?

Traditional data warehouses treated dirty data as something to be avoided through purging, cleansing and / or

reconciling, record by record, on the premise that data warehouses need strict data quality standards and avoid

double counting. It is by no means clear that this is still the case in big data repositories such as Hadoop. For content

such as log files, it will neither be practical nor worthwhile to impose a structural standard, given the high variety of

such types of data structures and sources. Volume and velocity, the other 2 ‘V’s in big data, also make this proposition

impractical.

As the reader would expect, the value of high quality big data is driven by the nature of the applications or the type of

analytics that might be running on top of the big data repository. In what follows we examine some analytical

applications that may use big data repository content.

—

For transactional applications that carry regulatory compliance requirements and/or require strict audit trails,

data quality is as important as in a data warehouse. These applications generally refer to transaction data

related to key entities like customers or products, and may leverage traditional data quality technology as long

as it scales tomeet the needs of massive volume.Attention needs to be paid to the following aspects:

—

Given the massive amounts of data involved, data movement into a middleware ETL engine should be

avoided. Consider ETLengines that execute within the big data environment.

—

Consider the experience gathered in optimizing costly data quality algorithms, for instance blocking

methods in record matching and de-duplicating. We talked about blocking in sectio

above, and we

have seen recently a lot of interest in adapting blockingmethods to Hadoop distributed architectures and to

perform schema agnostic blocking (i.e., which don’t rely on a common set of attributes, which is frequently

the case in semi-structured or unstructured data).

—

Prefer tools that provide high level GUIs. Programming transforms in lower level interfaces such as

MapReduce or Pig take longer to develop, are hard to read and reuse, and are less extensible.