WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

—

Strong Data Governance

—

Trusted data through high data quality, data enrichment, integrity and interoperability

—

Full transparency on data movements through tracking and auditing

—

Data security and compliance

—

End to endmonitoring and support

Our experience is that the best architecture concept to handle this list of requirements is a Data Integration platform,

preferably built with technology from cloud data integration vendors , that avoids point-to-point interfaces between

applications. So rather than connecting applications A and B directly, they talk through this middleware platform,

which could run on premise or on a private cloud. In theory, this platform could also run in a public cloud; however, if

data exchanged through these platforms must comply to regulations, or if customers chose to store data in transit

(some good reasons for doing that are mentioned in section

below), public cloud integration providers may be

6.3.2.1

seen as a risky bet for these customers.

3.2 Big Data

The Big Data era is the direct consequence of our ability to generate and collect digital data at an unprecedented scale

and, at the same time, our desire to analyze and extract value from this data in making data-driven decisions, thereby

altering all aspects of society. Abundant data from the internet, from sensors and smartphones, and from corporate

databases can be analyzed to detect unknown patterns and insights, for smarter, data-driven decision-making in

every field. As the value of data increases dramatically when it can be linked and integrated with other data to create a

unified representation, integrating big data becomes critical to realizing the promise of Big Data.

Aword of caution is in order, though, as the term “Big Data Integration” may mean at least two different things. It was

initially coined by the 2013 publication

and meant, for each domain such medical records, astronomy, or

[10]

genomics, literally mining the web and attempting to integrate its thousands of sources per domain. Here we are using

this term in our data warehousing setting, which is about (i) building a large, central data repository for multi-structured

data from various data sources, including data of the traditional sort such as those generated by applications within an

organization, but also of the “big data sort”, i.e., internal web logs or document files, and external data generated by

web sites such as social networks, or by sensor machines; and (ii) providing search, querying and analytics

capabilities on top of this repository, for application use cases such as customer churn analysis, fraud detection (for

instance, in financials), cohort group discovery (for instance, in healthcare), and the like.

This being said, some of the volume, velocity and variety challenges described in

still remain in our warehousing

[10]

setting, albeit to a lesser degree:

—

Variety Challenge.

Data sources, even within the same domain, remain extremely heterogeneous both at the

schema level regarding how they structure their data and at the instance level regarding how they describe the

same real world entity, exhibiting considerable variety even for substantially similar entities. This impacts the

entire integration effort. Variety at the local schema level means that it is difficult to come up with a target

schema that aligns all the sources, which is the very first step of an integration project. As for detecting and

resolving duplicates, techniques applicable for structured data have to evolve to cope with unstructured or

semi-structured data such as tweets or social posts, as shown in section

. 6.3.3.1

This is in line with the observation that the borders between application and data integration are blurring, as data integration vendors position

themselves as a one-stop shop for all integration needs, thanks to their strong governance functionality.