Table of Contents Table of Contents
Previous Page  11 / 96 Next Page
Information
Show Menu
Previous Page 11 / 96 Next Page
Page Background

W H I T E P A P E R

© 2017 Persistent Systems Ltd. All rights reserved. 11

www.persistent.com

Data integration and data quality vendors (Informatica, Talend, IBM, Pentaho), emerging players (SnapLogic, Iron

ETL) and cloud providers (AWS, with Data pipeline and Kinesis, Microsoft with Azure Data Factory and Google with

Cloud Data Flow and Cloud Data Proc) now offer cloud-based ETL tools. The public cloud services additionally

support dynamic tenant provisioning, elastic scaling, and a web administration and monitoring GUI, as well as pay-

per-use billing. Again, cost and ease of use are big factors driving adoption. License-based (on premise or private

cloud) data integration tools can also connect on premise and cloud applications; they are costlier and they require

additional hardware, but pose fewer risks to customers in use cases where security and /or compliance to regulations

are needed. In both type of deployments, these tools generally provide services that enables the development,

execution, and governance of bidirectional integration flows between on premise and cloud-based systems or cloud-

to-cloud processes and services.

However, the big challenge for data in the cloud revolves around moving and integrating data in the cloud. Per our

experience, moving on-premises data to the cloud is problematic on the following areas:

Large number of data sources.

Connectivity to a large number of sources is needed. Even if the connectivity part is themost mature in established

vendor products (as on premises versions included web-service-based sources connectivity for a while now), the

explosion of data sources is such that no list of prepackaged connectors will work in most cases. Because of this,

providers offer development kits that enable customers to create custom adapters for non-supported systems.

Large data volumes.

Moving data on the internet is sometimes a stumbling block, for several reasons.

Ÿ

Performance. The public internet is slow. A possible approach is to use private network connection

solutions, which allows organizations to directly connect to cloud providers rather than through

connections established on the public Internet. See section

for more details.

7.2.2

Ÿ

Shared resources in a public cloud. Heavy-duty data loads or ETL operations will disturb tenants. This

should be done during the nights or off-peak period.

Leading vendors like Amazon have taken a big leap in this space and are providing Batch Cloud Data Transfer

services likeAmazon Snowball Appliance, Bulk Import to S3 storage.

Legacy transforms.

Reusing existing transformation (the T of ETL) code to be applied on data being moved to the cloud is an issue,

when the ETL tool that was used on premises is not the same as the tool retained for the cloud. In our experience,

vendor's tools frequently change when changing system landscape, and there is no interoperability standard; and

even within the same vendor, the on premises tool and the cloud tool are not always fully interoperable. Indeed,

cloud tools do not always offer the same transformations as on premise tools; besides, some of the bells and

whistles of traditional on-premises tools in terms of richness of transformations may have not made it to the cloud

tool.

Firewalls.

Security issues arise when cloud providers integrate with on premise systems requiring IT administrators to open

an external communications port. This creates a huge security hole, which is why some cloud integration

providers have devised ways to tunnel through corporate firewalls without requiring administrators to open

external ports.