

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 27
www.persistent.com
So, is ELT better than ETL? Not necessarily
. There are scenarios where the ETL approach can provide better
performance. For instance, if you are streaming data in real time (frommessages, database changes or sensor data),
and your data pipelines havemultiple steps that can be processed inmainmemory without requiring any reading (e.g.,
data lookups) nor writing (e.g., no generation of history records) to the database, the time to process the data while it
moves through takes little relative to the time required to read from or write to disk. Even complex calculations
performed in memory take a negligible performance hit. Furthermore, if the data can be split without affecting the
computation, parallel execution of pipelines, which is possible in many ETL tools, may further yield significant
performance gains. Another scenario that works better with ETL is when there are several databases, data
warehouses or data marts to be fed after the transformations, or the transformations are so complex that cannot be
translated into an ELT code.
On the other hand, all things being equal, the reason why ELT has become a valid and important architectural pattern
in data integration is twofold: (i) it is preferable to have fewer moving parts:
by reducing the datamovement through
processing engines you might get better performance
and this translates also into cost and risk advantages in
terms of money, development time and maintenance; and (ii)
using all the horsepower of your data engine to load
and process your data will also helpwith performance.
Arelated decision is the following:
4.4.1.3 Shouldwe stage the data during ETL or not?
As mentioned in the point above, ETL tools can establish a direct connection to the source database, extract and
stream the data through the ETL tool to apply any required transformation inmemory, and finally write it, only once, into
the target data warehouse table. However, despite the performance advantages, this may not be the best approach.
There are several reasons an organization might decide to physically stage the data
(i.e., write it to disk)
during
the ETLprocess
:
—
The organization may need to stage the data immediately after extract for archival purposes — possibly to
meet compliance and audit requirements.
—
A recovery/restart point is desired in the event the ETL job fails in midstream— potentially due to a break in the
connection between the source and ETLenvironment.
—
Long running ETL processes may open a connection to the source system that create problems with database
locks and that stresses the transaction system.
4.4.1.4Which Change Data Capture (CDC) mechanisms shouldwe choose?
CDC is the capability of being able to isolate the relevant changes to the source data since the last data warehouse
load. The most relevant CDCmechanisms are reviewed i
n ,pages 376-378 and summarized in section
below.
[1] 5.3.5Finding the most comprehensive strategy can be elusive and you must clearly evaluate your strategy for each data
source. The good news is that many databases are providing CDC capabilities out of the box. Our experience with
them, which we justify in more detail in sections
and
, is that
using the CDC capabilities from the source
5.4.1 7.2.4databases isworth the effort.