WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

—

Multidimensional schema to NoSQL - Column Family.

The basis for the logical model in NoSQL systems is

unchanged. However, the physical model for column family based systems like Cassandra and HBase will

differ. With these systems, usually a single table for the Fact will contain a de-normalized structure. The

Dimensions attributes become part of column families of the Fact table. This is the way to achieve join-based

filtering for these databases. In addition to being present in the de-normalized structure, the dimensions are

also separately retained as tables. Performance of equality driven queries on dimensions and range based

queries on dimensions need to be carefully evaluated.

This approach works well in case of few dimensions.

A NoSQL system is applicable in multiple business

domains like network access, perimeter/physical security, and data related to audio-video analysis.

—

Multidimensional schema to NoSQL - Document Oriented.

In case of document oriented NoSQL

databases, new “dimensional attributes” (who, why, what, where, when) get added as JSON elements, to the

main “document” fact table. The “document” fact table stores the “how” (many, much, often). We don’t see

document databases playing the role of a data warehouse system, as often.

5.4.5 Self-Service and Agility

Self Service Dimension Modeling and ETL seem a little distant currently. Some of the difficult problems in this field

relate to inferring a relationship, and then cascading maintenance of updates or deletes automatically. Tracking

history and propagating changes, or merging keys related to entities (and their associated measures and results) are

all difficult to achieve. In section

we show an example of the limitations of today’s self-service data preparation

6.3.4

tools when it comes to maintaining relationships across datasets (e.g., referential integrity). When data is deleted or

updated, the relationships with other datasets typically need to bemaintained by hand.

Agile software development methods can be applied to dimensional modeling in the following areas:

—

During Sprint Planning.

The Bus Matrix is an effective framework to help come up with the backlog and

plan the sprints.

Clusters of the common dimensions (X axis of the bus matrix), or dimensions in order of

importance (in terms of priority/criticality of business processes) could be a guiding principle for selecting

activities per sprint. Make sure you involve business users. In addition, experience has shown that the first

iteration should deliver the maximum number of key dimensions and facts. This ensures that the designs for

the data model and BI semantic layer are validated with actual data and reports early, minimizing refactoring

later in the project.

—

During Development.

Agile BI/DW & ETL development can take place provided the Data Modeler, Data

Architect and ETL Architect all work in separate sprints

, with the modeler working at least 1-2 sprints in

advance of the ETL Architect. Some of this is because the Modeling activity involves a lot of discussions and

meetings, it is not always possible to complete this activity in the first few days of the sprint, and it is better to

dedicate a full sprint advantage to modeling. The Data Architect must (in conjunction with the Data Modeler)

design the physical schema and carry out some scripting/tests to validate the structures. The ETL Architect

can then take over in the following sprint.

—

After Deployment:

—

In our experience, the work carried out after deployment largely relates to the performance tuning of the

presentation layer to address issues related to the BI layer.

—

Other than the performance, the addition of newer attributes is the other big set of changes, post

deployment.

—

Major requirements about historization or changes to the bus matrix can be driven the same way in a

series of development sprints.