Table of Contents Table of Contents
Previous Page  5 / 96 Next Page
Information
Show Menu
Previous Page 5 / 96 Next Page
Page Background

W H I T E P A P E R

© 2017 Persistent Systems Ltd. All rights reserved. 5

www.persistent.com

1. Introduction and Scope of this document

Under the joint sponsorship of Sid Chatterjee, Head Corporate CTO organization and Sameer Dixit, Head Analytics

practice in the Services Unit Persistent Systems Ltd (PSL), Mataprasad Agrawal and Shirish Joshi, Senior analytics

architects and Fernando Velez, Chief Data Technologist in the Corporate CTO organization, are striving to identify and

document best datamanagement practices for projects within theAnalytics practice. These projects provide analytical

reporting and dashboarding based on structured and/or ad hoc queries, as well as predictive analysis, across various

technologies within analytics space.

This effort addresses a gap, as there is currently no definite way or process or best practices that are documented in

this space. The intended audience for this document is IT professionals that need to know about the details of building

and managing a data platform for analytics applications: this includes data architects, designers, developers,

database administrators / dev-ops teams and managers. It will be of most use to a professional who has already had

some exposure to data management, data warehousing and business intelligence. We do provide, as a minimum, a

glossary in chapter

to explain the intendedmeanings concepts and acronyms, notably those in italics.

9

By data management we mean data acquisition, data integration and data governance activities to build a dedicated

environment consolidating data from a growing number of sources for Business Intelligence, i.e., for analytics

projects. The last part of the previous sentence is important, as there are other data management endeavors with a

different final goal, such as creating master data for an organization. We are leaving data management for

Master

Data Management (MDM)

out of the initial scope, but will point out some of the commonalities with analytics data

management along the way.

The presence of a dedicated, target environment assumes that the data from multiple heterogeneous sources are

integrated in advance and are stored in this environment. There is an alternative architecture, called data federation,

which is to leave the data in place and query it live from the target environment running only analytics tools. However,

experience shows that this works only in limited environments; in its general setting, poor query performance may be a

showstopper and, it any case, it does not provide important capabilities such as maintaining data history and

improving data quality.

This dedicated environment may be characterized along several dimensions:

a. Type: It may be either a

datamart

, a

data warehouse

or a

data lake

.

b. Deployment: On premise or in the cloud,

c. Industry: It may serve any vertical industry.

This best practices document covers all cases from the dimensions above. Besides, it is mostly tool agnostic.

2. Document Plan

As it can be appreciated, the scope defined above is quite daunting; besides, there already exist publications

addressing most of these topics in detail. Therefore, rather than embarking on a long journey to rewrite a yet another

book on data integration and data governance, we decided to pursue the following strategy for this document:

1. Pick a widely-known reference publication,

2. Summarize themost relevant points of the publication on selected topics, and

3. Enhance it on areas not covered by the publication and including our own experience.