WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

3.2.1 Hadoop for ETLworkloads

When the transformations required by the analytics project require the execution of complex SQL queries and

procedures, and/or when, for performance reasons, execution is required in the target repository which is not Hadoop

but, e.g., an MPP relational database, then clearly Hadoop is an unlikely candidate: to start with, ANSI standard SQL

and other relational technologies are not fully supported on Hadoop today. This situation may change in the future,

given Hadoop's recent improvements on this respect (e.g., the introduction of declarative technology to Spark 2.0,

see other query engines below), and if 3 party tools make it easy to develop native Hadoop ETL workloads and

provide high level data management abstractions. For instance, functionality such as Change Data Capture (CDC),

merging UPDATEs and capturing history aremuch easier to achieve in relational databases.

At the other end of the spectrum, some ETL jobs may just need basic relational capabilities or even no relational

capabilities at all, e.g., computation expressed in MapReduce (MR). ETL jobs that make simple aggregations and

derived computations at a massive scale are well-suited to the Hadoop framework, and they can be developed for a

fraction of the cost of a high-end ETL tool. Indeed, those were Hadoop's origins: MR was designed to optimize simple

computations such as click sums and page views onmassive data.

Today, numerous extensions to the Hadoop framework exist both for moving data and for querying and managing

data.

—

For moving data, the new tools below include visual interfaces to create, manage, monitor and troubleshoot

data pipelines, a welcome change from the initial set of tools around Hadoop which only offered programmatic

interfaces and were hard to use and debug.

—

Apache NiFi

is a platform for automating the movement of data between disparate systems. It

[32]

supports directed graphs for data routing, transformation, and system mediation logic via a web interface.

The workflow can bemonitored, and provides data provenance features to track data from source to end.

—

Apache Storm

adds fast and reliable real-time data processing capabilities to Hadoop. It provides

[33]

Complex Event Processing capabilities, processing and joining data from multiple streams, and rolling

window aggregation functionality.

—

Kafka

, also from the Apache foundation, is a fast, scalable, durable, and fault-tolerant publish-

[34]

subscribe messaging system. Kafka is often used in place of traditional message brokers like JMS and

AMQP because of its higher throughput, reliability and replication.

—

Airflow

from Airbnb, is an open source platform to programmatically author, schedule and monitor

[35]

data pipelines as directed acyclic graphs (DAGs) of tasks. The rich GUI makes it easy to visualize and

troubleshoot production pipelines, monitor progress.

—

For querying and managing data: Spark, Impala, Hive/Tez, and Presto. All of these engines have

dramatically improved in one year. Both Impala and Presto continue to lead in BI-type queries and Spark

leads performance-wise in large analytics queries. Hive/Tez with the new LLAP (Live Long and Process)

feature has made impressive gains across the board and is close to the other engines now.

Finally, usage of high-end ETL tools or specialized ELT add-ins for Hadoop remain an option, especially if there are

already present in the organization because, as mentioned above, these systems are now able to push down

processing of high level flows into Hadoop, providing again development productivity and administrative control.