Table of Contents Table of Contents
Previous Page  21 / 54 Next Page
Information
Show Menu
Previous Page 21 / 54 Next Page
Page Background

W H I T E P A P E R

www.persistent.com

© 2017 Persistent Systems Ltd. All rights reserved.

On the plus side:

a. Elastic resources of Azure Data Factory will address scalability concerns with respect to the volume of

data to move to the cloud and transform.

b. Flexible ETL and faster turn-around is also expected, even though debugging facilities appear to be

limited.

On the negative side, some limits and constraints of the product would be hit in this project; for instance, the

maximum number of pipelines is 100, maximum number of fields per database object is also 100.

Pricing follows the “pay as you go” model (see

https://azure.microsoft.com/en-us/pricing/details/data-factory/ )

and is a function of activities being used. It depends on factors such as (i) the frequency of activities, (ii) place

where the activities run (between cloud data stores or whether an on-premises store is involved, and the pricing

is a fraction of a dollar per hour), (iii) whether a pipeline is active or not, and (iv) whether you are re-running

an activity. As an example, if it takes 2 hours in a day to move data from on-premises SQL Server database to

Azure blob storage, the total per month comes out to be $7.50 per month. Again, cost should generally compare

favorably.

As transferring hundreds of terabytes over the internet is almost unfeasible (100 TB transfer over a dedicated 100

Mbps connection takes over 100 days), the solution would be to use Azure’s Import/Export service. This service

allows Azure customers to securely transfer large amounts of data to Azure blob storage by shipping hard disk

drives to an Azure data center.

For now, the customer has decided to go with the standard alternative, namely, to use SSIS, deployed on

Azure VMs (as opposed to on-premise). Some months into the project, as with many ETL tools, development

productivity started becoming a serious obstacle, in line with common experience

18 .

Productivity achievements

from ETL tools is related to (i) the collection of features of the tool itself (which keeps improving with time), (ii) the

skill of the person using the tool, (iii) the way in which the tool is used for the project at hand, and (iv) the amounts

of tools and templates available to speed up the development process.

PSL tackled the last point and could demonstrate that over 75% of the tables in the warehouse could be

produced with T-SQL stored procedure parameterized templates. Template parameters include schema names,

table names, stored procedure names, field names, and values. Recent versions of SSIS come with predefined

templates for many common tasks, as well as the ability to create a custom template for a complex script that

you must create frequently. Once parameters in a T-SQL template have been replaced, the resulting stored

procedure can be called from SSIS, which is therefore used as an orchestration engine and to implement the

remaining transformations that cannot be templatized.

On the gains front:

• PSL found that 40% of the code could be reused, and that development times could be trimmed by about

33%.

• Also, there is an improvement in ETL performance, as T-SQL stored procedures execute natively on the

database engine.

On the drawbacks side,

• Debugging and maintainability would be harder; in particular, re-start, check-pointing, and handling event

errors are available at procedure-level, not line-level

• Portability suffers if the DBMS engine changes (e.g., if and when changing to Azure SQL DW)

• Logging and error handling would need to be implemented in procedures as well

The customer accepted the drawbacks in exchange for productivity and performance.

18

ETL development costs typically amount to around 70% of the total cost of a warehousing project.

21