WP_Data Management Best Practices

W H I T E P A P E R

www.persistent.com

c. Techniques like dataflow parallelism for unrelated dimensions were used within ETL jobs to speed data

loading. This did pose problems, e.g., surrogate key generation within DB using SQL Server or Oracle

Sequences could not be used with IDENTITY columns. However, with tMap (a join component in Talend)

and global variables this was resolved.

d. Load processing of the FACT data chunks was done in parallel.

e. Data was sourced with Talend’s JDBC components in batches with a database fetch size=50000 to

100000, to optimize the process time.

f. Stored procedure-based ETL was profiled for each query using database query profilers and

optimizations were applied like loading the data into temporary stage tables, dropping and rebuilding

indexes post-load and loading data in chunk sizes of 200000 - 250000 records.

g. Data pages and indexes became fragmented due to continuous loads (inserts/updates), so periodic

rebuild of database statistics and indexes was done using scripts. A post-load re-build indexing job was

run in parallel across many processors available on themachine.

h. FACT and dimension tables were created on separate file groups to speed up the disk IOs for dimensional

query performance.

i. It was recommended to use a dedicated server for ETL, as Talend is a memory-centric engine, along with

JVMheap-space and other factors tuned.

j. Data model was designed in such a way that each business area / process was modeled in a single star

schema.

k. For large fact tables, partitioning was used with an appropriate index scheme to cover the report queries.

l. PSL also proved that system is linearly scalable with additional hardware as the mixed query workload

was doubled.

m. Eventually, the team followed an approach of “Monitor-Analyze-Optimize” blended within sprint cycles, to

collect system and DB statistics such as query counts, avg. SQL executions per second, avg. execution

time taken per query, max execution time, avg. IOPs, avg. active sessions, etc. to identify if system was

deteriorating over time.

Data Visualizations

– Microstrategy (9.x) was used view dashboards and reports along with Ad Hoc

views. Report queries were optimized using de-normalized dimensional model and aggregate tables. For

a few dashboards, Microstrategy’s in-memory cube was used to speed up dashboards onMobile devices.

3. BI Performance – School InformationManagement System

—

Requirements/Challenges

a. PSLwas tasked to develop a cloud-based analytics solution for School Informationmanagement.

b. The reporting requirements were a mix of transactional reporting along with analytical reporting. There

were a large number of entities which were having 1: m relationships and needed to be tracked for each

change (Slowly Changing Dimensions). In general, there were 50+ dimension tables by which data had to

be sliced-diced along with pre-defined date ranges.