WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

So, is ELT better than ETL? Not necessarily

. There are scenarios where the ETL approach can provide better

performance. For instance, if you are streaming data in real time (frommessages, database changes or sensor data),

and your data pipelines havemultiple steps that can be processed inmainmemory without requiring any reading (e.g.,

data lookups) nor writing (e.g., no generation of history records) to the database, the time to process the data while it

moves through takes little relative to the time required to read from or write to disk. Even complex calculations

performed in memory take a negligible performance hit. Furthermore, if the data can be split without affecting the

computation, parallel execution of pipelines, which is possible in many ETL tools, may further yield significant

performance gains. Another scenario that works better with ETL is when there are several databases, data

warehouses or data marts to be fed after the transformations, or the transformations are so complex that cannot be

translated into an ELT code.

On the other hand, all things being equal, the reason why ELT has become a valid and important architectural pattern

in data integration is twofold: (i) it is preferable to have fewer moving parts:

by reducing the datamovement through

processing engines you might get better performance

and this translates also into cost and risk advantages in

terms of money, development time and maintenance; and (ii)

using all the horsepower of your data engine to load

and process your data will also helpwith performance.

Arelated decision is the following:

4.4.1.3 Shouldwe stage the data during ETL or not?

As mentioned in the point above, ETL tools can establish a direct connection to the source database, extract and

stream the data through the ETL tool to apply any required transformation inmemory, and finally write it, only once, into

the target data warehouse table. However, despite the performance advantages, this may not be the best approach.

There are several reasons an organization might decide to physically stage the data

(i.e., write it to disk)

during

the ETLprocess

—

The organization may need to stage the data immediately after extract for archival purposes — possibly to

meet compliance and audit requirements.

—

A recovery/restart point is desired in the event the ETL job fails in midstream— potentially due to a break in the

connection between the source and ETLenvironment.

—

Long running ETL processes may open a connection to the source system that create problems with database

locks and that stresses the transaction system.

4.4.1.4Which Change Data Capture (CDC) mechanisms shouldwe choose?

CDC is the capability of being able to isolate the relevant changes to the source data since the last data warehouse

load. The most relevant CDCmechanisms are reviewed i

n ,

pages 376-378 and summarized in section

below.

[1] 5.3.5

Finding the most comprehensive strategy can be elusive and you must clearly evaluate your strategy for each data

source. The good news is that many databases are providing CDC capabilities out of the box. Our experience with

them, which we justify in more detail in sections

and

, is that

using the CDC capabilities from the source

5.4.1 7.2.4

databases isworth the effort.