

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 51
www.persistent.com
Agile development methodology is now influencing ETL data warehouse development by moving towards
Script/API based development
at the expense of ETL technical UIs.
—
The ETL UIs are good for starting out; however, the inadequacies come to the fore very quickly, when dealing
with performance. On the plus side, the reusability and API provide a quick start mechanism, and provide a
factory approach for dealing with Data Integration. The UIs serve to hide the details, but over time, force the
users (not just power users) to learn the details, and workarounds. Sometimes, the workarounds become the
go-to path of usage.
—
Agile development methodology is now influencing ETL, with the BigData ETL world moving towards Domain
Structured Languages, a.k.a. DSL
in a big way. We see emergence of Cascading, Apache Beam (Google
[36]Cloud Data Flow), Pig and PySpark. These can be viewed as Software Development Kit (SDK/API) or
programming model, for high-level abstractions and logical constructs in a programming language matter —
thus simplifying the overall code. Data flows built with DSLs can be visualized graphically and used for
monitoring and troubleshooting the flows: there is no inherent contradiction between API based development
and visual interfaces.
6Data Quality
6.1 Brief Summary
A few decades ago data and data management software were not considered to be an asset. Everyone agreed on
their importance, but they were not considered concrete goods. Nowadays, this view has changed so much that the
exact opposite view is starting to prevail: not only data and software are an important part of an organization’s assets,
but an increasing number of organizations are using data as a competitive advantage. New disciplines have emerged
to treat data or, more generally, information (to include content such as text documents, images and such) as an asset:
data governance is a new a corporate-wide discipline involving the management of people, technology, and
processes to deliver information-driven projects on time, which will provide to its users the right information at the right
time.
Now, how can organizations evaluate their information to see if it is “right” and take corrective actions if it is not? First
and foremost, by evaluating its quality. When the quality of data is low, it is either not used (e.g., point of sales leads on
prospective customers with incomplete contact information), or leads to incorrect decisions (e.g., cost/benefit
analysis on product profitability with inaccurate data), or even tangible money loss (e.g., erroneously lowering the
price of an item, or incurring in shipping extra costs because of products won’t fit into a lorry as product sizes were
wrong).
Data quality refers to the level of quality of data. There are multiple definitions of this term. The seminal work by Tom
Redman
,summarized in
,initially motivated and shaped the discussion. Redman initially identified 27 distinct
[2] [3]dimensions within three general categories: data model, data values, and data representation. Over the years these
have been refined to a more manageable number. If the ISO 9000 definition of quality from 2015 is applied, data
quality can be defined as the degree to which a set of characteristics of data fulfills requirements for a specific use in
areas such as operations, decisionmaking or planning. These generally agreed characteristics are: