

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 89
www.persistent.com
1. Hadoop is a framework for processing computationally intensive workloads –often batch processes – with
large amounts of unstructured and structured data. The distributed nature of the technology makes it
vulnerable, provides a broad attack surface and emphasizes the need for automated, validated, and
consistent application of security capabilities.
At the infrastructure level, Hadoop files can be protected via the network using a firewall that allows access to
the Hadoop Name node only. Client communication to data nodes is prohibited. Data-at-rest encryption
protects sensitive data and keeps at a minimum the disks out of audit scope. This encryption ensures no
readable residual data remains when data is removed or copied and when the disks are decommissioned.
Hadoop cluster management security involves token delegation and managing lifetime and renewable
tokens. Data security controls outside Hadoop can be applied to:
—
Data inbound to Hadoop: Sensitive data such as personally identifiable information (PII) data may be
chosen not to be stored in Hadoop, or encrypted, as mentioned above.
—
Data retrieved from Hadoop: Data in Hadoop should inherit all data warehouse controls that can
enforce standard SQL security and data auditing capabilities. Most computation engines along with
NoSQL databases like Cassandra, Couchbase or HBase and Search services like ElasticSearch
provide row-level and column-level security with simple add-ons likeApache Ranger or Shield. Apache
Accumulo (NoSQL DB) also provides full flexibility to control the data access at cell level in a column
family.
—
Data discovery: Identifying whether sensitive data is present in Hadoop, where it is located, and then
triggering appropriate data protection measures such as data masking, tokenization/encryption, etc.
Compared to structured data, activities such as data discovery, identification of location and
subsequent classification and protection of unstructured data, are much more challenging. To mitigate
the risk of exposing the entire data to a few set of people, it can be categorized into projects or high-level
directories in Hadoop and they can be given access to specific entities or columns. These people or
data analysts can run tools (semi-automated or custom scripts) to gather stats and once the data is
discovered, profiled and analyzed, further granular access control can be granted.
2. NoSQL databases (such as Cassandra) form a distributed columnar data platform designed for scalability and
performance, often for online transaction processing. Cassandra nodes do not have distinct roles, requiring a
well-planned, layered approach to evaluating and applying security controls.
3. Search services – With the rapid adoption of search-based tools to perform interactive and context-based
searches, it is becoming easier to store, search, and analyze data. These distributed search engines allow to
easily protecting the data with a username and password, while simplifying the security architecture.
Advanced security features like encryption, role-based access control, IP filtering, and auditing are also
available. Moreover, these tools also support plug-in based architecture for authentication and authorization
and are easily extensible.