Row-parallelization of entire TRD hit reconstruction.
This MR builds upon !1783 (merged) and pushes the changes implemented therein even further: Now the entire chain "cluster finding -> hit finding -> hit merging" is parallelized by row index.
The hit merging step, which was formerly done module by module, is now done between neighboring rows, such that first all rows with an even index attempt to merge with the next row, then all rows with an odd index repeat this step. This is fully implemented for TRD 2D and a placeholder class is supplied where the same can be done for 1D in the future.
This procedure is much more efficient than merging for an entire module, in particular if the extra time-sort step is applied, that fixes the bug reported in !1783 (merged) (a fix for this is supplied here, but currently commented out). The changes introduce a small amount of overhead, which leads to a small performance hit when running on a single core, but the resulting algorithm scales much better with the number of cores.
In addition, the following changes to the online TRD hit reconstruction (1D and 2D) were applied:
- Separated TRD hit merging from hit finding.
- Moved HitFactory2D class to dedicated file.
- Cleaned up TRD reconstruction classes and fixed several warnings.
It should be pointed out here, that the hit-merge step by definition now misses possible three-row merges. If this is relevant, an extension to three-row-merging can easily be implemented.
A review is requested by @a.bercuci, but @apuntke, @p.kaehler and @dschledt should also take a look. In principle this is now at a stage where a thorough investigation of the correctness of the algorithm by the TRD teams (1D and 2D) is desireable.
@a.bercuci: We can probably now think about adding support for a more complicated fitting procedure. It is conceivable that this can even be done online.
@s.zharko @se.gorbunov: The segmentation of the partitioned vector produced by trd::Hitfind will be changed if this is merged. The partitions are now groups of hits with the same row index. The module address is supplied for each block.
Also of interest to @p.-a.loizeau @fweig.
Update:
Added the option to parallelize the hit merge step by module index, while keeping row-wise parallelization of the previous steps (cluster building and hit finding). The choice is controlled by a preprocessor flag.
If the sorting of the input data of the hit mergers is enabled, I find a non-trivial difference in the number of output hits between the two methods, of around 1 percent. The number of hits is lower with module-wise merging, suggesting that large clusters that span multiple rows can play a role. On the other hand, the runtime on 8 cores is roughly cut in half when switching to row-wise parallelization. So it ultimately comes down to priorities. I have nothing against including both variants in the final code.