

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved.
26
www.persistent.com
4.4.1.1 ETL: build versus buy
One of the earliest and most crucial decisions you must make is whether to hand code your ETL system or use an ETL
tool.
In today's environment and state of available technology, most organizations should use a vendor-
supplied ETL tool
. There are several reasons for this:
1. Most ETL tools provide reusable code in the form of ready to use transformations (data extractions, selections,
projections, joins, aggregations, data cleansing, matching and consolidation, loaders), that you can chain
together in a data pipeline.
2. Most tools supplement these transformations with an implementation of subsystems presented in chapter 9
of
[1] ,which we summarize in section
5.3.5below. In particular, profiling, change data capture, data quality
subsystem, slowly changing dimensions, and surrogate key generation.
3. ETL tools generate clearer and auto-documented code, which is therefore easier tomaintain and to extend.
4. The maturity of these engines is now such that performance of the generated ETL code is on par to what can
be expected out of custom code.
Technical issues aside, you shouldn't go off in a direction that your employees and managers find unfamiliar without
seriously considering the decision's long-term implications. You may also be faced with the decision to have to use
legacy licenses, although in many cases this requirement is one you can live with. Finally, if you do end up buying an
ETL tool, be aware that payback in improved productivity will only arrive after the first iteration, as it takes as much time
to set themup and learn how to use themas they save in the first round..
4.4.1.2Which processing pattern: ETL or ELT?
As the order of the letters in the ETL acronym implies, in an ETL tool, the needed transformations to integrate and
cleanse data are done prior to loading the data in the target engine, typically in amiddleware engine that comes as part
of the tool.Arecent, alternative popular option is ELT, which leverages the power of the target systemas the place to do
the transformation, in particular when there are large amounts of data to load and transform.
This trend started becoming popular with ETL engines a few years ago, driven by performance reasons when dealing
with data volumes: they can now operate in ELT mode by generating code that is “pushed down” to make it execute
directly on the target engine, whether the target is a traditional database or data warehouse engine, a cloud data
warehouse or a big data Hadoop cluster. This is the case of advanced tools such as Informatica PowerCenter, IBM
DataStage, SAPData Services, so this capability has now becomemainstream.
We have also witnessed the emergence of a market for ELT add-ins to database engines. The difference with the ETL
tool category is that there is no middleware ETL engine, just a target engine code generation play. In this last category,
the main value add beyond performance is to provide an environment that helps with the complexity of
transformations, at a reduced cost. Talend Open Studio, Attunity Compose and WhereScape RED are examples of
this market category. Finally, we observe that the ELT trend has also become popular with traditional databases: both
Oracle and SAP have now data integration and cleansing options working in ELT mode on their flagship databases.
These vendors advocate that fast loading is not just about flipping some parameters around and transformation is not
just about simple SQL code generation: complex data pipelines have to be mapped to complex scripts and/or stored
procedures to get them to perform effectively, so they believe that they are the best positioned to effectively develop
these products.