

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 5
www.persistent.com
1. Introduction and Scope of this document
Under the joint sponsorship of Sid Chatterjee, Head Corporate CTO organization and Sameer Dixit, Head Analytics
practice in the Services Unit Persistent Systems Ltd (PSL), Mataprasad Agrawal and Shirish Joshi, Senior analytics
architects and Fernando Velez, Chief Data Technologist in the Corporate CTO organization, are striving to identify and
document best datamanagement practices for projects within theAnalytics practice. These projects provide analytical
reporting and dashboarding based on structured and/or ad hoc queries, as well as predictive analysis, across various
technologies within analytics space.
This effort addresses a gap, as there is currently no definite way or process or best practices that are documented in
this space. The intended audience for this document is IT professionals that need to know about the details of building
and managing a data platform for analytics applications: this includes data architects, designers, developers,
database administrators / dev-ops teams and managers. It will be of most use to a professional who has already had
some exposure to data management, data warehousing and business intelligence. We do provide, as a minimum, a
glossary in chapter
to explain the intendedmeanings concepts and acronyms, notably those in italics.
9By data management we mean data acquisition, data integration and data governance activities to build a dedicated
environment consolidating data from a growing number of sources for Business Intelligence, i.e., for analytics
projects. The last part of the previous sentence is important, as there are other data management endeavors with a
different final goal, such as creating master data for an organization. We are leaving data management for
Master
Data Management (MDM)
out of the initial scope, but will point out some of the commonalities with analytics data
management along the way.
The presence of a dedicated, target environment assumes that the data from multiple heterogeneous sources are
integrated in advance and are stored in this environment. There is an alternative architecture, called data federation,
which is to leave the data in place and query it live from the target environment running only analytics tools. However,
experience shows that this works only in limited environments; in its general setting, poor query performance may be a
showstopper and, it any case, it does not provide important capabilities such as maintaining data history and
improving data quality.
This dedicated environment may be characterized along several dimensions:
a. Type: It may be either a
datamart
, a
data warehouse
or a
data lake
.
b. Deployment: On premise or in the cloud,
c. Industry: It may serve any vertical industry.
This best practices document covers all cases from the dimensions above. Besides, it is mostly tool agnostic.
2. Document Plan
As it can be appreciated, the scope defined above is quite daunting; besides, there already exist publications
addressing most of these topics in detail. Therefore, rather than embarking on a long journey to rewrite a yet another
book on data integration and data governance, we decided to pursue the following strategy for this document:
1. Pick a widely-known reference publication,
2. Summarize themost relevant points of the publication on selected topics, and
3. Enhance it on areas not covered by the publication and including our own experience.