

W H I T E P A P E R
© 2017 Persistent Systems Ltd. All rights reserved. 80
www.persistent.com
c. Techniques like dataflow parallelism for unrelated dimensions were used within ETL jobs to speed data
loading. This did pose problems, e.g., surrogate key generation within DB using SQL Server or Oracle
Sequences could not be used with IDENTITY columns. However, with tMap (a join component in Talend)
and global variables this was resolved.
d. Load processing of the FACT data chunks was done in parallel.
e. Data was sourced with Talend’s JDBC components in batches with a database fetch size=50000 to
100000, to optimize the process time.
f. Stored procedure-based ETL was profiled for each query using database query profilers and
optimizations were applied like loading the data into temporary stage tables, dropping and rebuilding
indexes post-load and loading data in chunk sizes of 200000 - 250000 records.
g. Data pages and indexes became fragmented due to continuous loads (inserts/updates), so periodic
rebuild of database statistics and indexes was done using scripts. A post-load re-build indexing job was
run in parallel across many processors available on themachine.
h. FACT and dimension tables were created on separate file groups to speed up the disk IOs for dimensional
query performance.
i. It was recommended to use a dedicated server for ETL, as Talend is a memory-centric engine, along with
JVMheap-space and other factors tuned.
j. Data model was designed in such a way that each business area / process was modeled in a single star
schema.
k. For large fact tables, partitioning was used with an appropriate index scheme to cover the report queries.
l. PSL also proved that system is linearly scalable with additional hardware as the mixed query workload
was doubled.
m. Eventually, the team followed an approach of “Monitor-Analyze-Optimize” blended within sprint cycles, to
collect system and DB statistics such as query counts, avg. SQL executions per second, avg. execution
time taken per query, max execution time, avg. IOPs, avg. active sessions, etc. to identify if system was
deteriorating over time.
n.
Data Visualizations
– Microstrategy (9.x) was used view dashboards and reports along with Ad Hoc
views. Report queries were optimized using de-normalized dimensional model and aggregate tables. For
a few dashboards, Microstrategy’s in-memory cube was used to speed up dashboards onMobile devices.
3. BI Performance – School InformationManagement System
—
Requirements/Challenges
a. PSLwas tasked to develop a cloud-based analytics solution for School Informationmanagement.
b. The reporting requirements were a mix of transactional reporting along with analytical reporting. There
were a large number of entities which were having 1: m relationships and needed to be tracked for each
change (Slowly Changing Dimensions). In general, there were 50+ dimension tables by which data had to
be sliced-diced along with pre-defined date ranges.