Table of Contents Table of Contents
Previous Page  87 / 96 Next Page
Information
Show Menu
Previous Page 87 / 96 Next Page
Page Background

W H I T E P A P E R

© 2017 Persistent Systems Ltd. All rights reserved. 87

www.persistent.com

Included below are some of the best practices to tuneApache Pig queries:

Explicitly declare types for the columns, to avoid lazy evaluation or type conversions downstream

Run pig jobs onApache Tez mode – for optimized job flow, edge semantics and container reuse

Use optimal number of mappers/reducers (4:1), based on size of the input files e.g. if input file

size=1.280 GB and Hadoop data split size=128MB, then use 10mappers and 2 reducers

Reorder JOINs properly to reduce I/O - Join with small/filtered set first

Project out columns early and often

Prefer DISTINCT over GROUPBY/GENERATE

Push up FILTER – filter early, filter often

Push up LIMIT– reduce number of rows

Push down FLATTEN

Enable LZO compression

Newer frameworks are available to perform interactive data analysis over data on Hadoop, also referred as

SQL-on-Hadoop. These frameworks are:

Hive running over Apache Tez engine, a DAG network engine with cost-based optimizer to reduce

wait time.

Presto, fromFacebook, a stage pipelining engine with in-memory data transfer with reduced or no IO.

Impala, from Cloudera, an MPP execution engine on top of Hadoop working with the data stored in

HDFS.

SparkSQL, an in-memory execution engine trying to get the best of both MapReduce and MPP-over-

Hadoop approaches.

These engines do not leverage the MapReduce paradigm (except for SparkSQL) but utilize in-memory,

distributed computations to speed-up queries. For a query having workload to calculate MAX, MIN, AVG

(Query-1), STANDARD DEVIATION (Query-2) on a time-series data with 16.5 million records residing on

Hadoop, here are the numbers on a 5-node cluster.

Apache HBase, which also works on top of the Hadoop data nodes, provides fast data inserts and reads with

very low latency to locate relevant rows. However, this engine is not recommended for analytics workloads.

Some of the HBase performance tuning best practices are:

Control the number of Hfiles

HBase performs major compactions by time, tune it

TuneMemStore /cache sizing;Increase the JVMheap size for the region servers

Many cores are good as HBase is CPU intensive!

OS - Disable swap – or else JVMwill respond poorly and will throwOSOOM

Tune JVMparameters like -XX:SurvivorRatio=4, -XX:MaxTenuringThreshold=0

Increase write buffer - hbase.client.write.buffer = 8388608, for better write performance

Enable BloomFilters, save having to go to disk and improve read latencies