W H I T E P A P E R
www.persistent.com
© 2017 Persistent Systems Ltd. All rights reserved.
13
e.
End user skills
– A growing demand for easy to use tools accessing trusted data in the cloud has
created a shift in the BI market towards governed self-service. Organizations can now enable broader
access of analytic insight to remain competitive without requesting their business users to improve their
technical skills: business users can analyze data without necessarily having to write queries in SQL, as
they did with Excel, but now through more powerful tools such as PowerBI, Tableau, and Qlik Sense.
On the other end, traditional BI product suites require dedicated IT resources with developer skills,
as they are more complex to implement; as mentioned above, most are available as SaaS services
and can also be used in single tenant mode installed from the cloud provider marketplace directly on
top of their infrastructure. The platform also needs to support a new breed of users, data scientists,
who run experiments with the data, develop predictive analytic and ML models, and assist in real-time
decision-making.
4.5 Data movement /ETL tools
Once you have decided a data model for your cloud database, you also need to decide how to transform and load
data from one or multiple sources into it. Integration and data movement was identified in
[1]as the second
leading obstacle, after security, to cloud adoption, pointing to the critical importance of full-featured data
integration tools for the cloud. For this reason, you might need to consider this before making the final choice of
cloud platform provider.
a.
Data Integration
and
Data quality
- Data needs to be integrated and processed for quality either when
it is written in the cloud data warehouse schema, a simpler NoSQL schema, or at a later point in time,
in a data lake. Make sure your transformation needs are covered, whatever your data requirements
might be. Possible pitfalls of PaaS data movement / ETL include (i) reusing legacy transforms: this is
generally not supported, as the tool that was used on premises is not the same as the tool retained for
the clou
d 13; (ii) the non-availability of quality specific transforms such as cleansing and de-duplicating,
which are present generally in more mature on-premise tools; (iii) processing data at high velocities (see
below): typical transformations on high velocity data may include joining data from multiple streams,
and rolling window aggregation functionality; and (iv) make sure there is a comprehensive data lifecycle
management and administration capabilities.
We believe that development productivity remains a serious obstacle in the cloud, as with on-premise
ETL. Self-service data integration tools such as Trifacta and Alteryx (see last point below) are a possible
path for mitigating this problem.
b. The choice of service level when managing your data movement tool –as with cloud databases, with
data movement tools there is also a deployment choice between IaaS or as PaaS, so this can be seen
as part of
Resource management.
IaaS deployment of traditional ETL tools is a way to solve the DI/DQ pitfalls enumerated on the previous
point, as they are still more mature than PaaS data movement tools. Internal technical skills, analyzed
separately below, may also weigh in on the final choice. On the other hand, PaaS data movement
requires less administration, management and setup than traditional ETL deployed on IaaS. As with their
cloud database platform service counterparts, availability and scalability of ETL tools is also taken care
by the PaaS provider: this matters with large data volumes (see more on this requirement below). PaaS
data movement tools are much more likely to outperform 3rd party ETL tools, for instance, by taking
advantage of parallelism in data transfers to internal nodes of a target MPP data warehouse cluster,
something JDBC based connections of non-native ETL tools will have a hard time doing (especially if
running outside of the cloud provider).
13
Even within the same vendor, we have found that the on premises tool and the cloud tool are not always fully interoperable. In this case, one possible option is to deploy the
on-premise tool on IaaS; another is to use the on-premise tool from your premises if it supports connectivity to the selected cloud database (requiring IT administrators to open
an external communications port, something that administrators don’t easily allow).