WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

—

Included below are some of the best practices to tuneApache Pig queries:

—

Explicitly declare types for the columns, to avoid lazy evaluation or type conversions downstream

—

Run pig jobs onApache Tez mode – for optimized job flow, edge semantics and container reuse

—

Use optimal number of mappers/reducers (4:1), based on size of the input files e.g. if input file

size=1.280 GB and Hadoop data split size=128MB, then use 10mappers and 2 reducers

—

Reorder JOINs properly to reduce I/O - Join with small/filtered set first

—

Project out columns early and often

—

Prefer DISTINCT over GROUPBY/GENERATE

—

Push up FILTER – filter early, filter often

—

Push up LIMIT– reduce number of rows

—

Push down FLATTEN

—

Enable LZO compression

—

Newer frameworks are available to perform interactive data analysis over data on Hadoop, also referred as

SQL-on-Hadoop. These frameworks are:

—

Hive running over Apache Tez engine, a DAG network engine with cost-based optimizer to reduce

wait time.

—

Presto, fromFacebook, a stage pipelining engine with in-memory data transfer with reduced or no IO.

—

Impala, from Cloudera, an MPP execution engine on top of Hadoop working with the data stored in

HDFS.

—

SparkSQL, an in-memory execution engine trying to get the best of both MapReduce and MPP-over-

Hadoop approaches.

These engines do not leverage the MapReduce paradigm (except for SparkSQL) but utilize in-memory,

distributed computations to speed-up queries. For a query having workload to calculate MAX, MIN, AVG

(Query-1), STANDARD DEVIATION (Query-2) on a time-series data with 16.5 million records residing on

Hadoop, here are the numbers on a 5-node cluster.

—

Apache HBase, which also works on top of the Hadoop data nodes, provides fast data inserts and reads with

very low latency to locate relevant rows. However, this engine is not recommended for analytics workloads.

Some of the HBase performance tuning best practices are:

—

Control the number of Hfiles

—

HBase performs major compactions by time, tune it

—

TuneMemStore /cache sizing;Increase the JVMheap size for the region servers

—

Many cores are good as HBase is CPU intensive!

—

OS - Disable swap – or else JVMwill respond poorly and will throwOSOOM

—

Tune JVMparameters like -XX:SurvivorRatio=4, -XX:MaxTenuringThreshold=0

—

Increase write buffer - hbase.client.write.buffer = 8388608, for better write performance

—

Enable BloomFilters, save having to go to disk and improve read latencies