WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

Variations of the concept of data highway, are the Lambda architecture (Speed layer, Batch layer and Serving

Layer), or the Kappa architecture (Streaming layer and Serving layer).

We see a demand for similar information delivery, and to satisfy this demand, a CDC database transaction

based mechanism like Oracle GoldenGate, or a framework likeApache Spark is used. GoldenGate orApache

Storm fits in with the existing enterprise to create the “higher speed lanes”, while not affecting any other

existing operations (like batch Informatica or DataStage jobs), while Apache Spark has the advantage that it

operates on both streaming and batch data.

—

Create a detailed “interface agreement” for feed based extraction.

The data modeler needs to

understand, profile and document every data source.A large portion of data sources are feeds coming from the

SoRs (Systems of Records). In large financial institutions, or airline companies (with a lot of legacy), these are

often mainframes. For each of these systems that supply a feed, it is important to create a common

understanding, and this understanding is documented in an “interface agreement” that both producer and

consumer agree on. In large enterprises, other Lines of Business or business entities that are the source of

data to the warehouse are almost like external organizations. This document is jointly created by the data

modeler and the ETLarchitect.

The agreement would detail the following information:

—

Format of the data (fixed width, separated – with separator character, block-style layout

—

Field names and expected data characteristics for fields (domains, ranges)

—

Incremental or full feed

—

File encoding (ASCII/Unicode, EBCDIC)

—

Transmissionmechanism (Secure FTP, Shared Folder) and Transmission Frequency

—

Re-transmission policy

—

File naming convention (with overwrite-if-duplicate logic)

—

Any markers to indicate file is ready for consuming (a polling agent can poll for a filename.complete file

in the shared folder)

—

Any markers to indicate that “no data available” for that time period (maybe a 0 byte file is made

available, or maybe filename.empty file is created in the shared folder)

—

Any actions to be taken once file is consumed (archiving)

—

Use Modeling tools to the maximum, but only till they hit their limitations

Some observations about usage of modeling tools are

—

In the organizations we have worked with, the preferred diagramming tool to be used by data modeler is

either CAERWin, MS-Visio, or Embarcadero/Idera ER/Studio.

—

When supported by the tool, these organizations have a repository of data models, and well defined

(tool-enforced) naming conventions and practices.

—

The reverse engineering capabilities generating physical models from database schemas are largely

unused, partly due to the version issues, but also due to the fear of “messing” with the original

design/layout.

—

Very often, the tool versions used are the older versions