WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

Agile development methodology is now influencing ETL data warehouse development by moving towards

Script/API based development

at the expense of ETL technical UIs.

—

The ETL UIs are good for starting out; however, the inadequacies come to the fore very quickly, when dealing

with performance. On the plus side, the reusability and API provide a quick start mechanism, and provide a

factory approach for dealing with Data Integration. The UIs serve to hide the details, but over time, force the

users (not just power users) to learn the details, and workarounds. Sometimes, the workarounds become the

go-to path of usage.

—

Agile development methodology is now influencing ETL, with the BigData ETL world moving towards Domain

Structured Languages, a.k.a. DSL

in a big way. We see emergence of Cascading, Apache Beam (Google

Cloud Data Flow), Pig and PySpark. These can be viewed as Software Development Kit (SDK/API) or

programming model, for high-level abstractions and logical constructs in a programming language matter —

thus simplifying the overall code. Data flows built with DSLs can be visualized graphically and used for

monitoring and troubleshooting the flows: there is no inherent contradiction between API based development

and visual interfaces.

6Data Quality

6.1 Brief Summary

A few decades ago data and data management software were not considered to be an asset. Everyone agreed on

their importance, but they were not considered concrete goods. Nowadays, this view has changed so much that the

exact opposite view is starting to prevail: not only data and software are an important part of an organization’s assets,

but an increasing number of organizations are using data as a competitive advantage. New disciplines have emerged

to treat data or, more generally, information (to include content such as text documents, images and such) as an asset:

data governance is a new a corporate-wide discipline involving the management of people, technology, and

processes to deliver information-driven projects on time, which will provide to its users the right information at the right

time.

Now, how can organizations evaluate their information to see if it is “right” and take corrective actions if it is not? First

and foremost, by evaluating its quality. When the quality of data is low, it is either not used (e.g., point of sales leads on

prospective customers with incomplete contact information), or leads to incorrect decisions (e.g., cost/benefit

analysis on product profitability with inaccurate data), or even tangible money loss (e.g., erroneously lowering the

price of an item, or incurring in shipping extra costs because of products won’t fit into a lorry as product sizes were

wrong).

Data quality refers to the level of quality of data. There are multiple definitions of this term. The seminal work by Tom

Redman

summarized in

initially motivated and shaped the discussion. Redman initially identified 27 distinct

dimensions within three general categories: data model, data values, and data representation. Over the years these

have been refined to a more manageable number. If the ISO 9000 definition of quality from 2015 is applied, data

quality can be defined as the degree to which a set of characteristics of data fulfills requirements for a specific use in

areas such as operations, decisionmaking or planning. These generally agreed characteristics are: