WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

—

Volume Challenge.

It is expensive to warehouse all the data sources an organization wants to integrate,

especially not knowing in advance the queries to be asked of the data. And for the areas where queries are

known, assuming they can be aligned in a consistent schema, query execution may exhibit great latency on

database sizes of hundreds of terabytes or petabytes if the internal architecture is not distributed and/or

massively parallel. Existing techniques for data quality challenges such as duplicate records are also being

challenged because of performance reasons on these volumes.

—

Velocity challenge.

Many sources systems want to integrate provide rapidly changing data, for instance

sensor data or stock prices. Depending on the rate of change, it may be unfeasible to capture rapid data

changes in a timely fashion, in particular if analysts and BI applications need low query latencies on the

incoming data. On the other hand, data quality monitoring in such environments need new approaches than

are just starting to get explored.

Conventional RDBMSs and SQL simply cannot store or analyze this wide range of use cases and cope with the

volume, velocity and variety challenges posed by big data. In response to these challenges, two types of systems

have emerged: NoSQLand Hadoo

Hadoop is one of the most interesting and promising new technologies related to big data. Historically, the first use

case for Hadoop was the storage and processing of semi-structured web log files stored in plain text. Since then,

many more types of datasets relevant for analysis are being stored into Hadoop clusters. Although big data has

passed the top of the Gartner's Hype Cycle, and based on the amount of attention that big data technology is

receiving, one might think that adoption and deployment is already pervasive. The truth is that this technology is still in

a relatively early stage of development. For instance, only recently did Apache Spark or Apache Flink, the new

distributed engine of the Hadoop infrastructure, adopt declarative technology that makes complex querying both easy

and fast. On the other hand, preparing big data for effective analysis beyond descriptive BI and into predictive BI

(insights about what will happen) and prescriptive BI (what actions should be taken given these insights) requires

highly developed skills of a special group of people – data scientists. These skills go beyond traditional data

management and include advanced statistics and machine learning. Most companies are still trying to find the talent

and the products to effectively manage big data in order to get tangible business benefits.

What about NoSQL systems? These are systems that also store and manipulate big data, but are not meant for

complex querying or OLAP style aggregation but rather for operational systems, or for search-intensive applications.

NoSQL databases are unstructured in nature, trading off stringent consistency requirements and modeled for

querying patterns for speed and agility. As Hadoop, they store their data across multiple processing nodes, and often

across multiple servers, but interactions are through very simple queries or APIs and updates involving one or a few

records accessed by their key .

We have witnessed two main uses of Hadoop in integration projects: (i) as the area where raw data reflecting both

entities and events arrive from operational processes, and where they are transformed for integration and sent down

the chain to the target repository, i.e., where analytics workloads are run, and (ii) as the target repository itself. As we

will see below, the Hadoop repository may play these two roles simultaneously, but then it will need other components

tomake this viable. We analyze these two use cases below.

In all fairness, relational database vendors also tried offering an alternative, which is to extend their type systems through user defined data

types (UDFs) that process unstructured and semi-structured data within the DBMS inner loop. However, these systems have not had much

traction, mainly due to the volume challenge, which needs to be tamed through massive distribution and parallel computation.

HBase, a NoSQL system from the Apache Foundation, is actually layered on top of Hadoop HDFS.