WP_Cloud Analytics - Mapping Requirements to Technology

W H I T E P A P E R

www.persistent.com

Data volume

– One of the biggest challenges is when you have a huge volume of data available on

premise and want to move it to the cloud. There are migration tools and fast networks connections

available; however, if the volume is measured in hundreds of terabytes, then customers need to rely

on a physical hardware based solution to transfer the data. Indeed, even at 100 Mbps

, transferring 1

TB takes about a day. Most cloud providers including AWS and Azure provide such a mechanism. With

these solutions, transferring 100 TBs can be done in a matter of a few days, including shipping time, as

opposed to several months.

Data velocity

– Today, business and IoT applications generate huge amounts of data per second and

businesses may require that data to be consumed and analyzed in real time, as shown in the previous

sections. Each cloud provider has its own data movement tool (Amazon Data Pipeline, Azure Data

factory, Google Cloud Dataflow and IBM data connect), but if the velocity and throughput expected are

high, open source solutions such as Kafka, Spark and Storm are available in these cloud platforms.

Data variety

– In traditional data warehouses the typical number of enterprise data sources is low (in

the tens of sources), so data processing to transform the incoming data to the target data structures is

highly optimized, uses sophisticated techniques to deal with changed data capture at the sources, slowly

changing dimensions to keep history, stability of source data structures, and the like. A similar pattern

exists with cloud data warehouses.

However, as explained in

4.3

above, medium and large data variety available for analysis has driven

the use of Hadoop as data lakes for analysis, both as a service and managed by the customer on the

provider’s infrastructure, and has increased demand for a data movement style which takes data from

these various data sources, loads it in Hadoop without a predefined schema necessarily, and defers

transformation at a later point in time for further analysis –an ELT style, where the transformation part is

done through Hadoop programming. The data movement/ETL tools you pick specializes in this case in

extraction, so you should make sure it connects to the large variety of data sources you need, or has at

least a development kits that enable customers to create custom adapters for non-supported systems. If

change data capture or stability of source structures is important, as this topic is more mature with data

integration tools than it is with Hadoop, we recommend picking a data integration tool (e.g., Informatica)

that supports it directly on top of Hadoop.

Internal technical skills

– Traditionally, developing ETL processes required heavy, intensive coding to

operate. As time passes they become difficult to maintain and extend – often due to the original coders

moving on and new developers not understanding the code. If, on top of all the changes involved in

moving analytics to the cloud, a change of ETL tool is forced upon the team, this may prove to be too

much; in this case, an option is to keep using the same tool on the cloud infrastructure environment to

increase the IT team’s comfort level.

On the other hand, self-service, cloud-ready tools have recently entered the data integration and data

governance field, allowing business analysts to develop their models with data-driven visual GUIs,

without explicit coding. These tools complement and integrate with self-service BI tools and are becoming

realistic alternatives for business units and departments to process data themselves, without or with

limited IT assistance, while keeping some degree of IT governance.

4.6 Additional PaaS services

Apart from data management services, there are many other PaaS services available which can be used for your

end to end processing.

Writing custom code responding to a variety of events is addressed through event-driven

functions-as-a-service

(e.g. Microsoft Azure Functions, AWS Lambda Functions or Google Cloud Functions). They differ in term

of the language they support; see

section 6

for a comparison. The main advantage is that this custom code

once deployed as function as a service becomes automatic scalable and secured.

Which is a fast connection, twice as fast than today’s fiber optic cables.