WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

1. Hadoop is a framework for processing computationally intensive workloads –often batch processes – with

large amounts of unstructured and structured data. The distributed nature of the technology makes it

vulnerable, provides a broad attack surface and emphasizes the need for automated, validated, and

consistent application of security capabilities.

At the infrastructure level, Hadoop files can be protected via the network using a firewall that allows access to

the Hadoop Name node only. Client communication to data nodes is prohibited. Data-at-rest encryption

protects sensitive data and keeps at a minimum the disks out of audit scope. This encryption ensures no

readable residual data remains when data is removed or copied and when the disks are decommissioned.

Hadoop cluster management security involves token delegation and managing lifetime and renewable

tokens. Data security controls outside Hadoop can be applied to:

—

Data inbound to Hadoop: Sensitive data such as personally identifiable information (PII) data may be

chosen not to be stored in Hadoop, or encrypted, as mentioned above.

—

Data retrieved from Hadoop: Data in Hadoop should inherit all data warehouse controls that can

enforce standard SQL security and data auditing capabilities. Most computation engines along with

NoSQL databases like Cassandra, Couchbase or HBase and Search services like ElasticSearch

provide row-level and column-level security with simple add-ons likeApache Ranger or Shield. Apache

Accumulo (NoSQL DB) also provides full flexibility to control the data access at cell level in a column

family.

—

Data discovery: Identifying whether sensitive data is present in Hadoop, where it is located, and then

triggering appropriate data protection measures such as data masking, tokenization/encryption, etc.

Compared to structured data, activities such as data discovery, identification of location and

subsequent classification and protection of unstructured data, are much more challenging. To mitigate

the risk of exposing the entire data to a few set of people, it can be categorized into projects or high-level

directories in Hadoop and they can be given access to specific entities or columns. These people or

data analysts can run tools (semi-automated or custom scripts) to gather stats and once the data is

discovered, profiled and analyzed, further granular access control can be granted.

2. NoSQL databases (such as Cassandra) form a distributed columnar data platform designed for scalability and

performance, often for online transaction processing. Cassandra nodes do not have distinct roles, requiring a

well-planned, layered approach to evaluating and applying security controls.

3. Search services – With the rapid adoption of search-based tools to perform interactive and context-based

searches, it is becoming easier to store, search, and analyze data. These distributed search engines allow to

easily protecting the data with a username and password, while simplifying the security architecture.

Advanced security features like encryption, role-based access control, IP filtering, and auditing are also

available. Moreover, these tools also support plug-in based architecture for authentication and authorization

and are easily extensible.