W H I T E P A P E R
www.persistent.com
© 2017 Persistent Systems Ltd. All rights reserved.
c.
Data volume
– One of the biggest challenges is when you have a huge volume of data available on
premise and want to move it to the cloud. There are migration tools and fast networks connections
available; however, if the volume is measured in hundreds of terabytes, then customers need to rely
on a physical hardware based solution to transfer the data. Indeed, even at 100 Mbps
14, transferring 1
TB takes about a day. Most cloud providers including AWS and Azure provide such a mechanism. With
these solutions, transferring 100 TBs can be done in a matter of a few days, including shipping time, as
opposed to several months.
d.
Data velocity
– Today, business and IoT applications generate huge amounts of data per second and
businesses may require that data to be consumed and analyzed in real time, as shown in the previous
sections. Each cloud provider has its own data movement tool (Amazon Data Pipeline, Azure Data
factory, Google Cloud Dataflow and IBM data connect), but if the velocity and throughput expected are
high, open source solutions such as Kafka, Spark and Storm are available in these cloud platforms.
e.
Data variety
– In traditional data warehouses the typical number of enterprise data sources is low (in
the tens of sources), so data processing to transform the incoming data to the target data structures is
highly optimized, uses sophisticated techniques to deal with changed data capture at the sources, slowly
changing dimensions to keep history, stability of source data structures, and the like. A similar pattern
exists with cloud data warehouses.
However, as explained in
4.3above, medium and large data variety available for analysis has driven
the use of Hadoop as data lakes for analysis, both as a service and managed by the customer on the
provider’s infrastructure, and has increased demand for a data movement style which takes data from
these various data sources, loads it in Hadoop without a predefined schema necessarily, and defers
transformation at a later point in time for further analysis –an ELT style, where the transformation part is
done through Hadoop programming. The data movement/ETL tools you pick specializes in this case in
extraction, so you should make sure it connects to the large variety of data sources you need, or has at
least a development kits that enable customers to create custom adapters for non-supported systems. If
change data capture or stability of source structures is important, as this topic is more mature with data
integration tools than it is with Hadoop, we recommend picking a data integration tool (e.g., Informatica)
that supports it directly on top of Hadoop.
f.
Internal technical skills
– Traditionally, developing ETL processes required heavy, intensive coding to
operate. As time passes they become difficult to maintain and extend – often due to the original coders
moving on and new developers not understanding the code. If, on top of all the changes involved in
moving analytics to the cloud, a change of ETL tool is forced upon the team, this may prove to be too
much; in this case, an option is to keep using the same tool on the cloud infrastructure environment to
increase the IT team’s comfort level.
On the other hand, self-service, cloud-ready tools have recently entered the data integration and data
governance field, allowing business analysts to develop their models with data-driven visual GUIs,
without explicit coding. These tools complement and integrate with self-service BI tools and are becoming
realistic alternatives for business units and departments to process data themselves, without or with
limited IT assistance, while keeping some degree of IT governance.
4.6 Additional PaaS services
Apart from data management services, there are many other PaaS services available which can be used for your
end to end processing.
Writing custom code responding to a variety of events is addressed through event-driven
functions-as-a-service
(e.g. Microsoft Azure Functions, AWS Lambda Functions or Google Cloud Functions). They differ in term
of the language they support; see
section 6for a comparison. The main advantage is that this custom code
once deployed as function as a service becomes automatic scalable and secured.
14
14
Which is a fast connection, twice as fast than today’s fiber optic cables.