WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

—

Improving performance of computation engines like Spark

—

Recent versions of Spar

provide DataFrame and Spark SQL structures and are much faster than

[26]

RDD APIs, as the optimizations happen as late as possible and even across functions with predicate

push-down and/or column pruning.

—

Tests have revealed that in a clustered environment network performance improvement does not

contributemuch to the cluster performance and jobs becomemore bottlenecked by IOand CPU.

—

Poor data storage choices can degrade performance drastically like too small file sizes may require

opening too many file handles vs. too big file will require more splits (64Mb to 1 GB may be optimal) or

use of compression formats (minimize compressed file size) or use of data interchange formats (Avro,

Parquet are common ones)

—

ETL data processing –Apply filtering, cleansing, pruning, conforming, matching, joining, and diagnosing at the

earliest touch points in the data pipe line possible.

—

Streaming applications – Make use of continuous query language (CQL) or SQL-like structures provided by

some of the libraries to perform computations on a smaller chunk of the data, the results of which can be

consolidated by other process downstream.

7.3.4.2 Security

The use of big data technology offers such great opportunities in terms of cost reduction that organizations are looking

to store and process both sensitive and non-sensitive data in big data repositories. However, big data presents

several challenges to data security. Security and privacy challenges are magnified by the velocity, volume, and variety

of big data.

Volume

- Managing massive amounts of data increases the risk and magnitude of a potential data breach. Sensitive,

personal data and confidential data can be exposed and violate compliance and data-security regulations. Many

existing enterprise security products are licensed by user or by CPU. This model makes sense in traditional systems

but is problematic in distributed big data systems because of the larger user base.

Variety

- Understanding what data you have (or don't have) in any dataset is made more difficult when you must

manage and enforce data access and usage of unstructured data froma variety of sources.

Velocity

- The faster data arrives, the more difficult backup and restore operations become. Throughput is also

challenging, because traditional security tools are designed and optimized for traditional system architectures and

may not be equipped for the throughput required to keep up with big data velocity rates.

Security drives business value by enabling safe handling of regulated data, but new technologies are required.

Traditional security tools designed and optimized for traditional system architectures may not be equipped for the

throughput required to keep up with big data. Although existing and emerging security capabilities apply to big data,

many organizations lack the tools to protect, monitor, and manage access to data processed at high rates and in a

variety of formats.

The approach to securing big data varies by platform: Hadoop distributed file systems, NoSQL databases, and search

services.