

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 15
www.persistent.com
3.2.1 Hadoop for ETLworkloads
When the transformations required by the analytics project require the execution of complex SQL queries and
procedures, and/or when, for performance reasons, execution is required in the target repository which is not Hadoop
but, e.g., an MPP relational database, then clearly Hadoop is an unlikely candidate: to start with, ANSI standard SQL
and other relational technologies are not fully supported on Hadoop today. This situation may change in the future,
given Hadoop's recent improvements on this respect (e.g., the introduction of declarative technology to Spark 2.0,
rd
see other query engines below), and if 3 party tools make it easy to develop native Hadoop ETL workloads and
provide high level data management abstractions. For instance, functionality such as Change Data Capture (CDC),
merging UPDATEs and capturing history aremuch easier to achieve in relational databases.
At the other end of the spectrum, some ETL jobs may just need basic relational capabilities or even no relational
capabilities at all, e.g., computation expressed in MapReduce (MR). ETL jobs that make simple aggregations and
derived computations at a massive scale are well-suited to the Hadoop framework, and they can be developed for a
fraction of the cost of a high-end ETL tool. Indeed, those were Hadoop's origins: MR was designed to optimize simple
computations such as click sums and page views onmassive data.
Today, numerous extensions to the Hadoop framework exist both for moving data and for querying and managing
data.
—
For moving data, the new tools below include visual interfaces to create, manage, monitor and troubleshoot
data pipelines, a welcome change from the initial set of tools around Hadoop which only offered programmatic
interfaces and were hard to use and debug.
—
Apache NiFi
is a platform for automating the movement of data between disparate systems. It
[32]supports directed graphs for data routing, transformation, and system mediation logic via a web interface.
The workflow can bemonitored, and provides data provenance features to track data from source to end.
—
Apache Storm
adds fast and reliable real-time data processing capabilities to Hadoop. It provides
[33]Complex Event Processing capabilities, processing and joining data from multiple streams, and rolling
window aggregation functionality.
—
Kafka
, also from the Apache foundation, is a fast, scalable, durable, and fault-tolerant publish-
[34]subscribe messaging system. Kafka is often used in place of traditional message brokers like JMS and
AMQP because of its higher throughput, reliability and replication.
—
Airflow
,from Airbnb, is an open source platform to programmatically author, schedule and monitor
[35]data pipelines as directed acyclic graphs (DAGs) of tasks. The rich GUI makes it easy to visualize and
troubleshoot production pipelines, monitor progress.
—
For querying and managing data: Spark, Impala, Hive/Tez, and Presto. All of these engines have
dramatically improved in one year. Both Impala and Presto continue to lead in BI-type queries and Spark
leads performance-wise in large analytics queries. Hive/Tez with the new LLAP (Live Long and Process)
feature has made impressive gains across the board and is close to the other engines now.
Finally, usage of high-end ETL tools or specialized ELT add-ins for Hadoop remain an option, especially if there are
already present in the organization because, as mentioned above, these systems are now able to push down
processing of high level flows into Hadoop, providing again development productivity and administrative control.