WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

4.4.1.1 ETL: build versus buy

One of the earliest and most crucial decisions you must make is whether to hand code your ETL system or use an ETL

tool.

In today's environment and state of available technology, most organizations should use a vendor-

supplied ETL tool

. There are several reasons for this:

1. Most ETL tools provide reusable code in the form of ready to use transformations (data extractions, selections,

projections, joins, aggregations, data cleansing, matching and consolidation, loaders), that you can chain

together in a data pipeline.

2. Most tools supplement these transformations with an implementation of subsystems presented in chapter 9

which we summarize in section

below. In particular, profiling, change data capture, data quality

subsystem, slowly changing dimensions, and surrogate key generation.

3. ETL tools generate clearer and auto-documented code, which is therefore easier tomaintain and to extend.

4. The maturity of these engines is now such that performance of the generated ETL code is on par to what can

be expected out of custom code.

Technical issues aside, you shouldn't go off in a direction that your employees and managers find unfamiliar without

seriously considering the decision's long-term implications. You may also be faced with the decision to have to use

legacy licenses, although in many cases this requirement is one you can live with. Finally, if you do end up buying an

ETL tool, be aware that payback in improved productivity will only arrive after the first iteration, as it takes as much time

to set themup and learn how to use themas they save in the first round..

4.4.1.2Which processing pattern: ETL or ELT?

As the order of the letters in the ETL acronym implies, in an ETL tool, the needed transformations to integrate and

cleanse data are done prior to loading the data in the target engine, typically in amiddleware engine that comes as part

of the tool.Arecent, alternative popular option is ELT, which leverages the power of the target systemas the place to do

the transformation, in particular when there are large amounts of data to load and transform.

This trend started becoming popular with ETL engines a few years ago, driven by performance reasons when dealing

with data volumes: they can now operate in ELT mode by generating code that is “pushed down” to make it execute

directly on the target engine, whether the target is a traditional database or data warehouse engine, a cloud data

warehouse or a big data Hadoop cluster. This is the case of advanced tools such as Informatica PowerCenter, IBM

DataStage, SAPData Services, so this capability has now becomemainstream.

We have also witnessed the emergence of a market for ELT add-ins to database engines. The difference with the ETL

tool category is that there is no middleware ETL engine, just a target engine code generation play. In this last category,

the main value add beyond performance is to provide an environment that helps with the complexity of

transformations, at a reduced cost. Talend Open Studio, Attunity Compose and WhereScape RED are examples of

this market category. Finally, we observe that the ELT trend has also become popular with traditional databases: both

Oracle and SAP have now data integration and cleansing options working in ELT mode on their flagship databases.

These vendors advocate that fast loading is not just about flipping some parameters around and transformation is not

just about simple SQL code generation: complex data pipelines have to be mapped to complex scripts and/or stored

procedures to get them to perform effectively, so they believe that they are the best positioned to effectively develop

these products.