

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 14
www.persistent.com
—
Volume Challenge.
It is expensive to warehouse all the data sources an organization wants to integrate,
especially not knowing in advance the queries to be asked of the data. And for the areas where queries are
known, assuming they can be aligned in a consistent schema, query execution may exhibit great latency on
database sizes of hundreds of terabytes or petabytes if the internal architecture is not distributed and/or
massively parallel. Existing techniques for data quality challenges such as duplicate records are also being
challenged because of performance reasons on these volumes.
—
Velocity challenge.
Many sources systems want to integrate provide rapidly changing data, for instance
sensor data or stock prices. Depending on the rate of change, it may be unfeasible to capture rapid data
changes in a timely fashion, in particular if analysts and BI applications need low query latencies on the
incoming data. On the other hand, data quality monitoring in such environments need new approaches than
are just starting to get explored.
Conventional RDBMSs and SQL simply cannot store or analyze this wide range of use cases and cope with the
volume, velocity and variety challenges posed by big data. In response to these challenges, two types of systems
10have emerged: NoSQLand Hadoo
p .Hadoop is one of the most interesting and promising new technologies related to big data. Historically, the first use
case for Hadoop was the storage and processing of semi-structured web log files stored in plain text. Since then,
many more types of datasets relevant for analysis are being stored into Hadoop clusters. Although big data has
passed the top of the Gartner's Hype Cycle, and based on the amount of attention that big data technology is
receiving, one might think that adoption and deployment is already pervasive. The truth is that this technology is still in
a relatively early stage of development. For instance, only recently did Apache Spark or Apache Flink, the new
distributed engine of the Hadoop infrastructure, adopt declarative technology that makes complex querying both easy
and fast. On the other hand, preparing big data for effective analysis beyond descriptive BI and into predictive BI
(insights about what will happen) and prescriptive BI (what actions should be taken given these insights) requires
highly developed skills of a special group of people – data scientists. These skills go beyond traditional data
management and include advanced statistics and machine learning. Most companies are still trying to find the talent
and the products to effectively manage big data in order to get tangible business benefits.
What about NoSQL systems? These are systems that also store and manipulate big data, but are not meant for
complex querying or OLAP style aggregation but rather for operational systems, or for search-intensive applications.
NoSQL databases are unstructured in nature, trading off stringent consistency requirements and modeled for
querying patterns for speed and agility. As Hadoop, they store their data across multiple processing nodes, and often
across multiple servers, but interactions are through very simple queries or APIs and updates involving one or a few
11records accessed by their key .
We have witnessed two main uses of Hadoop in integration projects: (i) as the area where raw data reflecting both
entities and events arrive from operational processes, and where they are transformed for integration and sent down
the chain to the target repository, i.e., where analytics workloads are run, and (ii) as the target repository itself. As we
will see below, the Hadoop repository may play these two roles simultaneously, but then it will need other components
tomake this viable. We analyze these two use cases below.
10
In all fairness, relational database vendors also tried offering an alternative, which is to extend their type systems through user defined data
types (UDFs) that process unstructured and semi-structured data within the DBMS inner loop. However, these systems have not had much
traction, mainly due to the volume challenge, which needs to be tamed through massive distribution and parallel computation.
11
HBase, a NoSQL system from the Apache Foundation, is actually layered on top of Hadoop HDFS.