(v13) Overview of tuning example
This page applies to Harlequin v13.1r0 and later; and to Harlequin Core but not Harlequin MultiRIP
The measurements were done on a machine with two Intel® Xeon® Silver 4110 processors, 10 core per processor, 64 GiB of memory, L1 cache = 1.3 MiB, L2 cache = 20 MiB, L3 cache = 27.5 MiB. Hyper-Threading was turned off. There is no coordination between the RIPs as they are being run standalone in their own sandpit.
It is a common experience that each individual RIP “takes longer” to process the same page range as the number of RIPs is increased, despite having “plenty of cores and memory”.
Initial investigation immediately ruled out disk I/O or any interprocess mutex contention as being a bottleneck using “Performance Monitor” physical disk counters and by observing all OS events from the RIP via “Process Monitor” (from sysinternals
).
In trying to understand why this is, we used a VTune amplifier to inspect a single RIP instance in setups using one RIP, four RIPs and eight RIPs, using CPU sampling of 5 ms. Before this analysis was done, we tuned the RIP to use both a minimal optimal band height and number of render threads for the job page range to complete in the shortest time. Although BandHeight would likely remain optimal across all jobs on this machine due to the size of the L2 cache, how optimal the number of render threads is likely to depend on the complexity of the job being processed.