

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 88
www.persistent.com
—
Improving performance of computation engines like Spark
—
Recent versions of Spar
kprovide DataFrame and Spark SQL structures and are much faster than
[26]RDD APIs, as the optimizations happen as late as possible and even across functions with predicate
push-down and/or column pruning.
—
Tests have revealed that in a clustered environment network performance improvement does not
contributemuch to the cluster performance and jobs becomemore bottlenecked by IOand CPU.
—
Poor data storage choices can degrade performance drastically like too small file sizes may require
opening too many file handles vs. too big file will require more splits (64Mb to 1 GB may be optimal) or
use of compression formats (minimize compressed file size) or use of data interchange formats (Avro,
Parquet are common ones)
—
ETL data processing –Apply filtering, cleansing, pruning, conforming, matching, joining, and diagnosing at the
earliest touch points in the data pipe line possible.
—
Streaming applications – Make use of continuous query language (CQL) or SQL-like structures provided by
some of the libraries to perform computations on a smaller chunk of the data, the results of which can be
consolidated by other process downstream.
7.3.4.2 Security
The use of big data technology offers such great opportunities in terms of cost reduction that organizations are looking
to store and process both sensitive and non-sensitive data in big data repositories. However, big data presents
several challenges to data security. Security and privacy challenges are magnified by the velocity, volume, and variety
of big data.
Volume
- Managing massive amounts of data increases the risk and magnitude of a potential data breach. Sensitive,
personal data and confidential data can be exposed and violate compliance and data-security regulations. Many
existing enterprise security products are licensed by user or by CPU. This model makes sense in traditional systems
but is problematic in distributed big data systems because of the larger user base.
Variety
- Understanding what data you have (or don't have) in any dataset is made more difficult when you must
manage and enforce data access and usage of unstructured data froma variety of sources.
Velocity
- The faster data arrives, the more difficult backup and restore operations become. Throughput is also
challenging, because traditional security tools are designed and optimized for traditional system architectures and
may not be equipped for the throughput required to keep up with big data velocity rates.
Security drives business value by enabling safe handling of regulated data, but new technologies are required.
Traditional security tools designed and optimized for traditional system architectures may not be equipped for the
throughput required to keep up with big data. Although existing and emerging security capabilities apply to big data,
many organizations lack the tools to protect, monitor, and manage access to data processed at high rates and in a
variety of formats.
The approach to securing big data varies by platform: Hadoop distributed file systems, NoSQL databases, and search
services.