

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 13
www.persistent.com
—
Strong Data Governance
—
Trusted data through high data quality, data enrichment, integrity and interoperability
—
Full transparency on data movements through tracking and auditing
—
Data security and compliance
—
End to endmonitoring and support
Our experience is that the best architecture concept to handle this list of requirements is a Data Integration platform,
9preferably built with technology from cloud data integration vendors , that avoids point-to-point interfaces between
applications. So rather than connecting applications A and B directly, they talk through this middleware platform,
which could run on premise or on a private cloud. In theory, this platform could also run in a public cloud; however, if
data exchanged through these platforms must comply to regulations, or if customers chose to store data in transit
(some good reasons for doing that are mentioned in section
below), public cloud integration providers may be
6.3.2.1seen as a risky bet for these customers.
3.2 Big Data
The Big Data era is the direct consequence of our ability to generate and collect digital data at an unprecedented scale
and, at the same time, our desire to analyze and extract value from this data in making data-driven decisions, thereby
altering all aspects of society. Abundant data from the internet, from sensors and smartphones, and from corporate
databases can be analyzed to detect unknown patterns and insights, for smarter, data-driven decision-making in
every field. As the value of data increases dramatically when it can be linked and integrated with other data to create a
unified representation, integrating big data becomes critical to realizing the promise of Big Data.
Aword of caution is in order, though, as the term “Big Data Integration” may mean at least two different things. It was
initially coined by the 2013 publication
,and meant, for each domain such medical records, astronomy, or
[10]genomics, literally mining the web and attempting to integrate its thousands of sources per domain. Here we are using
this term in our data warehousing setting, which is about (i) building a large, central data repository for multi-structured
data from various data sources, including data of the traditional sort such as those generated by applications within an
organization, but also of the “big data sort”, i.e., internal web logs or document files, and external data generated by
web sites such as social networks, or by sensor machines; and (ii) providing search, querying and analytics
capabilities on top of this repository, for application use cases such as customer churn analysis, fraud detection (for
instance, in financials), cohort group discovery (for instance, in healthcare), and the like.
This being said, some of the volume, velocity and variety challenges described in
still remain in our warehousing
[10]setting, albeit to a lesser degree:
—
Variety Challenge.
Data sources, even within the same domain, remain extremely heterogeneous both at the
schema level regarding how they structure their data and at the instance level regarding how they describe the
same real world entity, exhibiting considerable variety even for substantially similar entities. This impacts the
entire integration effort. Variety at the local schema level means that it is difficult to come up with a target
schema that aligns all the sources, which is the very first step of an integration project. As for detecting and
resolving duplicates, techniques applicable for structured data have to evolve to cope with unstructured or
semi-structured data such as tweets or social posts, as shown in section
. 6.3.3.19
This is in line with the observation that the borders between application and data integration are blurring, as data integration vendors position
themselves as a one-stop shop for all integration needs, thanks to their strong governance functionality.