W H I T E P A P E R
www.persistent.com
© 2017 Persistent Systems Ltd. All rights reserved.
On the plus side:
a. Elastic resources of Azure Data Factory will address scalability concerns with respect to the volume of
data to move to the cloud and transform.
b. Flexible ETL and faster turn-around is also expected, even though debugging facilities appear to be
limited.
On the negative side, some limits and constraints of the product would be hit in this project; for instance, the
maximum number of pipelines is 100, maximum number of fields per database object is also 100.
Pricing follows the “pay as you go” model (see
https://azure.microsoft.com/en-us/pricing/details/data-factory/ )and is a function of activities being used. It depends on factors such as (i) the frequency of activities, (ii) place
where the activities run (between cloud data stores or whether an on-premises store is involved, and the pricing
is a fraction of a dollar per hour), (iii) whether a pipeline is active or not, and (iv) whether you are re-running
an activity. As an example, if it takes 2 hours in a day to move data from on-premises SQL Server database to
Azure blob storage, the total per month comes out to be $7.50 per month. Again, cost should generally compare
favorably.
As transferring hundreds of terabytes over the internet is almost unfeasible (100 TB transfer over a dedicated 100
Mbps connection takes over 100 days), the solution would be to use Azure’s Import/Export service. This service
allows Azure customers to securely transfer large amounts of data to Azure blob storage by shipping hard disk
drives to an Azure data center.
For now, the customer has decided to go with the standard alternative, namely, to use SSIS, deployed on
Azure VMs (as opposed to on-premise). Some months into the project, as with many ETL tools, development
productivity started becoming a serious obstacle, in line with common experience
18 .Productivity achievements
from ETL tools is related to (i) the collection of features of the tool itself (which keeps improving with time), (ii) the
skill of the person using the tool, (iii) the way in which the tool is used for the project at hand, and (iv) the amounts
of tools and templates available to speed up the development process.
PSL tackled the last point and could demonstrate that over 75% of the tables in the warehouse could be
produced with T-SQL stored procedure parameterized templates. Template parameters include schema names,
table names, stored procedure names, field names, and values. Recent versions of SSIS come with predefined
templates for many common tasks, as well as the ability to create a custom template for a complex script that
you must create frequently. Once parameters in a T-SQL template have been replaced, the resulting stored
procedure can be called from SSIS, which is therefore used as an orchestration engine and to implement the
remaining transformations that cannot be templatized.
On the gains front:
• PSL found that 40% of the code could be reused, and that development times could be trimmed by about
33%.
• Also, there is an improvement in ETL performance, as T-SQL stored procedures execute natively on the
database engine.
On the drawbacks side,
• Debugging and maintainability would be harder; in particular, re-start, check-pointing, and handling event
errors are available at procedure-level, not line-level
• Portability suffers if the DBMS engine changes (e.g., if and when changing to Azure SQL DW)
• Logging and error handling would need to be implemented in procedures as well
The customer accepted the drawbacks in exchange for productivity and performance.
18
ETL development costs typically amount to around 70% of the total cost of a warehousing project.
21