Draft: Parallel version of TofHitFinder

requested review from @j.decuveland

assigned to @d.smith

added 4 commits

ca0451d4 - 1 commit from branch computing:master
c97953e4 - New TOF hitfinder class HitFinderTof and corresponding CbmTaskTofHitFinder.
58f6cb4f - Switched to using iterators in HitFinderTof.cxx. Fixed a bug in calibration....
7b4658e6 - First draft of parallelized TofHitFinder.

@j.decuveland can you please take a look at this? What I tried to do here, is apply OMP to the code from !955 (merged) in a similar fashion as you did for CbmTaskUnpack. The crucial changes are in CbmTaskTofHitFinder::BuildClusters().

The code works and reproduces the output from the other MR. Unfortunately, if I set the number of cores to >1 by changing OMP_NUM_THREADS this actually makes the execution slower (the result remains correct).

I'm guessing that this has something to do with this: https://docs.oracle.com/cd/E19205-01/819-5270/aewcz/index.html

Do you have an idea how this can be done correctly?

Also of interest to: @p.-a.loizeau @f.uhlig @v.friese

I found no obvious issues from looking at the code. However, there are two statements that can generally be troublesome: schedule(dynamic) and critical.

The first, schedule(dynamic), causes code to be generated that distributes the work items to the worker threads, which can include significant overhead. It is necessary only if a static division of labour yields very bad (imbalanced) results. In most cases, static is the recommended way. This relates to the question of granularity: Which portion of work are we distributing? How many supermodules and RPCs are there? The dynamic scheduling scheme can be especially wasteful when applied in many interations. Here, we are distributing the RPCs, but processing the SMs sequentially. I think it may make sense to combine both loops into one, especially since the combination is explicitely done already in the code. I suggest: for (uint32_t rpcIdx = 0; rpcIdx < iNbSm * iNbRpc; rpcIdx++) My suggestion here would be: Combine the two for loops, schedule(static) and see if the run time changes.

The second, critical should always be a last resort as it requires all threads to wait and synchronize. As a side note: A minimal change of the code to use a single omp critical would only require you to put the two statements into a block ({...}). However, I suspect it will probably not improve the performance by much. Here, I suspect that the insert might actually take significant time, so that a large portion of the code is in fact sequential. This could be studied by adding a timing counter to both computation and insert parts in each loop and comparing the results. I think a better solution here would be to have each thread copy its local vector into the shared output vector individually in parallel. This would require just one synchronization point where the output vector is sized appropriately to hold all contributions and a vector of start indexes (prefixes) is built so that every thread knows where to copy its data to. When written like this, almost everything can be done in parallel, so that I expect a significant improvement in this part. If you like, I can write you a small example of what I mean.

Ok, I tested the application of the changes to the schedule and the loop. I also tested the same code, but without the merge step, for each case. Here are the results:

old

OMP_NUM_THREADS=1 ./run_qa.sh 
Time  / TS           : 7.39 ms
OMP_NUM_THREADS=4 ./run_qa.sh 
Time  / TS           : 6.32 ms
OMP_NUM_THREADS=8 ./run_qa.sh 
Time  / TS           : 5.52 ms

old without merge

OMP_NUM_THREADS=1 ./run_qa.sh
Time  / TS           : 5.63 ms
OMP_NUM_THREADS=4 ./run_qa.sh
Time  / TS           : 3.65 ms
OMP_NUM_THREADS=8 ./run_qa.sh
Time  / TS           : 4.19 ms

new

OMP_NUM_THREADS=1 ./run_qa.sh 
Time  / TS           : 7.16 ms
OMP_NUM_THREADS=4 ./run_qa.sh 
Time  / TS           : 6.73 ms
OMP_NUM_THREADS=8 ./run_qa.sh
Time  / TS           : 5.30 ms

new without merge

OMP_NUM_THREADS=1 ./run_qa.sh
Time  / TS           : 5.58 ms
OMP_NUM_THREADS=4 ./run_qa.sh
Time  / TS           : 3.61 ms
OMP_NUM_THREADS=8 ./run_qa.sh
Time  / TS           : 3.22 ms

There doesn't seem to be a big effect of changing the schedule and the loop. The merge step however is important and the relative time of it seems to grow with increasing the number of threads. This points to large overhead from this step. I will try your solution.

The total runtime of the recently optimized single-threaded implementation from !955 (merged) is ~5.5 ms. This includes writing to output. This makes me wonder whether OMP parallelization is feasible for the TOF hitfinder. Even if we take the best result here, 3.61 ms for 4 cores, and ignore the fact that we are still missing the merge step, this is still a pretty bad "per core" scaling. This is probably due to the large contribution of the distribution step (where digis are sorted by rpc) which probably cannot be parallelized efficiently.

This is with mCBM data and only a few TS right?
I am not sure if the data are really balanced "per RPC/per SM" with the geometry of this experiment, as we measure only one side of the collision and some detectors were placed closer to the beam-pipe to test detectors.

So it could be that you are limited by a single data block with the highest occupancy (which is then itself sequentially processed). If this is possible (no clue, last worked with OMP ~10-15 years ago), I would try to check the throughput per thread in Mean time per Data-units (Digis or Hits depending on whether you consider Input or Output to be limiting)

These tests were run on simulated data with the "sis100_electron" setup. I will redo the tests on real data. Maybe the overhead shrinks in relation when the number of events is large.

@p.-a.loizeau @j.decuveland I just added some diagnostic output to the Init() stage, where the number of SM types is printed to the screen, and subsequently the number of SMs of each type and the number of RPCS for each SM. Here is what I get:

[INFO] #Sm Types: 15
[INFO] Sm Type 0 #Sm: 62 #Rpc 5
[INFO] Sm Type 1 #Sm: 32 #Rpc 5
[INFO] Sm Type 2 #Sm: 8 #Rpc 5
[INFO] Sm Type 3 #Sm: 100 #Rpc 5
[INFO] Sm Type 4 #Sm: 16 #Rpc 5
[INFO] Sm Type 5 #Sm: 1 #Rpc 30
[INFO] Sm Type 6 #Sm: 1 #Rpc 24
[INFO] Sm Type 7 #Sm: 2 #Rpc 27
[INFO] Sm Type 8 #Sm: 1 #Rpc 18
[INFO] Sm Type 9 #Sm: 1 #Rpc 24
[INFO] Sm Type 10 #Sm: 1 #Rpc 30
[INFO] Sm Type 11 #Sm: 1 #Rpc 24
[INFO] Sm Type 12 #Sm: 2 #Rpc 27
[INFO] Sm Type 13 #Sm: 1 #Rpc 18
[INFO] Sm Type 14 #Sm: 1 #Rpc 24

These numbers are taken from the CbmTofDigiBdfPar object. I also checked that a non-zero number of digis is produced for each of the unique Rpcs. This appears to be the case, although the distribution is not even.

Since the parallel execution is currently within the loop over SM types, I guess it is not surprising that there is a lot of overhead. I will try to parallelize also over SM types.

The loop over SM types is now included in the 'parallel for' statement. Doesn't seem to do much unfortunately.

Dear @f.uhlig, @v.friese, @p.-a.loizeau,

you have been identified as code owner of at least one file which was changed with this merge request.

Please check the changes and approve them or request changes.

added CodeOwners label

mentioned in merge request !955 (merged)

added 7 commits

7b4658e6...d3069b66 - 3 commits from branch computing:master
df4893c9 - New TOF hitfinder class HitFinderTof and corresponding CbmTaskTofHitFinder.
8c52f8fd - Switched to using iterators in HitFinderTof.cxx. Fixed a bug in calibration....
209b7a11 - Refactored HitFinderTof. Two micro-optimizations of CbmTaskTofHitFinder which...
d2b8fe6e - First draft of parallelized TofHitFinder.

Compare with previous version

Updated to reflect the recent changes of !955 (merged)

I profiled the code with Valgrind. This didn't provide any insights unfortunately. It shows that a large fraction of the runtime goes to libgomp.so without any more details. So we lose a lot to "generic parallelization overhead" apparently.

Valgrind is a powerful tool, but I really recommend perf as the primary tool if you want to understand where time is spent on a modern machine. It is a little bit harder to wield, unfortunately, but since it reads from hardware CPI registers that count hardware events, the program flow is not affected and all effects such as cache and memory timing are included.

If you want to see where exactly time is spent within a library such as libgomp, you can use a version of that library compiled with debug symbols.

May I invite you to have a look at this example code and results? It includes a working example implementation of the parallel vector merge.

I wanted to try this out, since we seem to have this problem (combining many vectors efficiently into one) in several places in our current local reconstruction code.

added 10 commits

d2b8fe6e...ba0d449c - 5 commits from branch computing:master
6341f860 - New TOF hitfinder class HitFinderTof and corresponding CbmTaskTofHitFinder.
682381a7 - Switched to using iterators in HitFinderTof.cxx. Fixed a bug in calibration....
0c4fe3c6 - Refactored HitFinderTof. Two micro-optimizations of CbmTaskTofHitFinder which...
3eff75a4 - First draft of parallelized TofHitFinder.
c8df959b - CbmTaskTofHitFinder: Minor changes to loop ordering etc.

Compare with previous version

I just included the improvements to the scheduling and the loop order suggested above.

@j.decuveland Ok, here is a version that uses the parallel vector merge. Unfortunately first tests don't seem to suggest that it is faster however. I will take a look at the data tomorrow, as @p.-a.loizeau suggested.

pair<int32_t, int32_t> CbmTaskTofHitFinder::BuildClusters(CbmEvent* event)
{
  // --- MC Event info (input file, entry number, start time)
  int32_t iInputNr  = 0;
  int32_t iEventNr  = 0;
  double dEventTime = 0.;
  GetEventInfo(iInputNr, iEventNr, dEventTime);

  // Local variables
  int32_t nDigis = 0;
  int32_t nHits  = 0;

  nDigis = (event ? event->GetNofData(ECbmDataType::kTofDigi) : fDigiMan->GetNofDigis(ECbmModuleId::kTof));
  LOG(debug) << "Number of TOF digis: " << nDigis;

  // Loop over the digis array and store the Digis in separate vectors for each RPC modules
  for (int32_t iDigi = 0; iDigi < nDigis; iDigi++) {
    const uint32_t digiIndex = (event ? event->GetIndex(ECbmDataType::kTofDigi, iDigi) : iDigi);
    const CbmTofDigi* pDigi  = fDigiMan->Get<CbmTofDigi>(digiIndex);
    assert(pDigi);

    // These are doubles in the digi class
    const int32_t smType = CbmTofAddress::GetSmType(pDigi->GetAddress());
    const int32_t smId   = CbmTofAddress::GetSmId(pDigi->GetAddress());
    const int32_t rpcId  = CbmTofAddress::GetRpcId(pDigi->GetAddress());
    const int32_t numRpc = fDigiBdfPar->GetNbRpc(smType);
    fStorDigiExp[smType][smId * numRpc + rpcId].push_back(*pDigi);
    fStorDigiInd[smType][smId * numRpc + rpcId].push_back(digiIndex);
  }

  const uint32_t iNbSmTypes = fDigiBdfPar->GetNbSmTypes();

  for (uint32_t iSmType = 0; iSmType < iNbSmTypes; iSmType++) {
    const uint32_t iNbSm  = fDigiBdfPar->GetNbSm(iSmType);
    const uint32_t iNbRpc = fDigiBdfPar->GetNbRpc(iSmType);

    size_t hitsSize    = fTofHits->size();
    size_t matchesSize = fTofDigiMatch->size();

    std::vector<size_t> prefix;
#pragma omp parallel
    {
      int ithread  = omp_get_thread_num();
      int nthreads = omp_get_num_threads();
#pragma omp single
      prefix.resize(nthreads + 1);

      //Store hits and match
      std::vector<CbmTofHit> hits;
      std::vector<CbmMatch> matches;

#pragma omp for schedule(static) nowait
      for (uint32_t rpcIdx = 0; rpcIdx < iNbRpc * iNbSm; rpcIdx++) {

        //read digis
        std::vector<CbmTofDigi>& digiExp = fStorDigiExp[iSmType][rpcIdx];
        std::vector<int32_t>& digiInd    = fStorDigiInd[iSmType][rpcIdx];

        //call cluster finder
        std::vector<TofCluster> clusters = fAlgo[iSmType][rpcIdx](digiExp, digiInd);

        for (auto const& cluster : clusters) {
          TVector3 hitpos = TVector3(cluster.globalPos.X(), cluster.globalPos.Y(), cluster.globalPos.Z());
          TVector3 hiterr = TVector3(cluster.globalErr.X(), cluster.globalErr.Y(), cluster.globalErr.Z());
          //Fix: nHits will not be a sequence when running in parallel
          hits.emplace_back(cluster.detId, hitpos, hiterr, nHits, cluster.weightedTime, cluster.weightedTimeErr, 0, 0);
          nHits++;

          CbmMatch digiMatch;
          for (uint32_t i = 0; i < cluster.vDigiIndRef.size(); i++) {
            digiMatch.AddLink(CbmLink(0., cluster.vDigiIndRef.at(i), iEventNr, iInputNr));
          }
          matches.push_back(digiMatch);
        }
        digiExp.clear();
        digiInd.clear();
      }
      prefix[ithread + 1] = hits.size();
#pragma omp barrier

#pragma omp single
      {
        for (int i = 1; i < (nthreads + 1); i++) {
          prefix[i] += prefix[i - 1];
        }
        fTofHits->resize(hitsSize + prefix[nthreads]);
        fTofDigiMatch->resize(matchesSize + prefix[nthreads]);
      }
      std::copy(hits.begin(), hits.end(), fTofHits->begin() + hitsSize + prefix[ithread]);
      std::copy(matches.begin(), matches.end(), fTofDigiMatch->begin() + matchesSize + prefix[ithread]);

      // To do: Fix this when the rest works
      // const int32_t hitIndex = fTofHitsColl->GetEntriesFast();
      //if (event) event->AddData(ECbmDataType::kTofHit, hitIndex);
    }
  }

  return std::make_pair(nDigis, nHits);
}

added 1 commit

1e34fc79 - CbmTaskTofHitFinder: Included loop over sm types in 'parallel for' statement.

Compare with previous version

added 1 commit

899af624 - CbmTaskTofHitFinder: Included vector merge in parallelization.

Compare with previous version

Ok, I now also included the parallel vector merge into the version that includes the SM type loop in the parallelization.

We do get a speedup compared to the single-threaded version !955 (merged) when using the new version with multiple cores. The result is not quite what I hoped for unfortunately. On my latest test, which runs on a timeslice with 250 events, we barely lose 25% of runtime when running the new version with 8 cores (on a single core it loses by far).

I haven't managed to run the hit finder on real data yet however. Maybe things look better when the number of digis is some orders of magnitude higher.

Seeing the mildly underwhelming results I am wondering if we are starting out in the right place when it comes to parallelization.

Does it really make sense to distribute the individual fractions of an event instead of processing the events in parallel?

While limited parallel speedup can come from newly introduced overhead, the main cause should usually be in non-parallalized (sequential) parts of the code.

If the parallelization overhead is small, the parallel version configured to run with a single thread should not be much slower than the original sequential version. You report that this is not the case. Do we understand this difference?

On the other hand, if we can reduce the run time only by some 25 %, this would mean that almost 3/4 of our code are not yet executed in parallel. I feel this is something that we should (and I think we can) figure out.

I think we are currently bottlenecked by the for loop following the comment

// Loop over the digis array and store the Digis in separate vectors for each RPC modules

I can do some time measurements to quantify this, but it seems likely.

We need a way to sort the digis by rpc, while preserving the time order. A straight-forward parallel for will create a race condition, but there may be a clever way to do this.

A two-step parallel sort, where we sort by unique rpcid first and by timestamp second.

@p.-a.loizeau Is my statement correct, that we must preserve the time sortedness of the input digis?

We can probably do some tricks with std::sort. Will think about this.

@p.-a.loizeau Is my statement correct, that we must preserve the time sortedness of the input digis?

Forgot a bit about the algo organization, but I think so, as otherwise we cannot do a fast escape to the next cluster as soon as we find a Digi out of matching [Time, Space] range. So I would suspect that removing the requirement of time-sortedness would force a complete scan of the local vector for each cluster, "just in case the last digi fits".

@p.-a.loizeau @v.friese @j.decuveland What is your opinion on this construction?

I wonder whether we should switch to the following:

  std::vector<std::vector<std::vector<CbmTofDigi>*>> fStorDigiExp;  
  std::vector<std::vector<std::vector<int32_t>*>> fStorDigiInd;

or

  std::vector<std::vector<std::vector<CbmTofDigi>*>*> fStorDigiExp;  
  std::vector<std::vector<std::vector<int32_t>*>*> fStorDigiInd;

The original construction forces all elements to be stored consecutively in memory. This restriction is unnecessary and might be problematic for multiple threads to run efficiently.

I'm guessing that the upper variant would be sufficent. Only the number digis changes and may lead to resizings.

Indeed, the two first dimensions are determined once at startup/init stage and should not change in normal operation. If this does happen, it will probably only between runs (configuration change/Re-Init) where the cost of eventually moving them is not relevant (and maybe in these cases we will even just destroy/recreate the algo objects)

I appreciate your efforts in understanding (multi-threaded) parallelisation for this application case. However, I feel that we should leave this for a future step. Two reasons for that:

The intention of this MR was to bring the TOF reconstruction in a form where it can be optimised (also, e.g., by third parties) and where it fits into our so-far developed scheme. I think this is achieved. Further optimisations can be done in a separate MR.
It seems the bottlenecks are in the data I/O (sorting input digis and output hits w.r.t. time and/or detector module). We must here consider the interface to the previous step (unpacking or event building) and the subsequent step (global tracking). Just one vector for all TOF digis / hits is probably not the best solution.

Specifically to Jan's remark:

Seeing the mildly underwhelming results I am wondering if we are starting out in the right place when it comes to parallelization. Does it really make sense to distribute the individual fractions of an event instead of processing the events in parallel?

Out of the blue my statement would be: If we have established events on digi level, we can execute the entire reconstruction (including TOF) and trigger analysis in parallel between events. In this case, no further concurrency within one event is needed (or should even be avoided in order not to compete for resources). We still need to sort w.r.t. modules, but the loop over modules can be just sequential.

On the other hand, if we process the entire timeslice (without prior event definition), there are enough TOF data in each module, such that the parallelization between modules will pay off.

It is still not clear whether a meaningful event definition at digi level is possible also at highest interaction rates. So, we should be compatible with both options (event or full timeslice). But if we are to study parallelisation, this should not be done within one event in the event-by-event mode, but on the full timeslice. Multithreading should thus be optional / switchable through a flag or so.

So, I would recommend to proceed as follows:

Within this MR, show the application to timeslice processing. In particular, does the execution time scale with the amount of input data in the timeslice? Does parallelisation between modules help in this operation mode?
Having shown that, we should finish this MR and leave further optimization to a next step. I think that more urgent is to make this algorithm fit for real mCBM data, for which, I understand, not the TofSimpleClusterizer, which is the model you started from, but another algorithm is being used (TofEventClusterizer). What are the additional features in the latter and how can we get them into this new implementation here? This shall be investigated and the necessary modifications done in a separate MR.
Once we have a complete chain running (from unpacking to complete reconstruction), we should identify the most time consuming pieces and decide where to put efforts for further optimization and parallelization.

Sorry. My previous comment was meant for the other MR !955 (merged). Nevertheless, maybe not too bad to have it in here, too.

   // Input variables
   CbmDigiManager* fDigiMan = nullptr;
   TClonesArray* fEvents    = nullptr;
   // Output variables
   std::vector<CbmTofHit>* fTofHits     = nullptr;
   std::vector<CbmMatch>* fTofDigiMatch = nullptr;
   // Hit finder algo
   std::map<uint32_t, std::map<uint32_t, cbm::algo::HitFinderTof>> fAlgo = {};  //[nbType][nbSm*nbRpc]
   std::vector<cbm::algo::HitFinderTof*> fAlgoPtr;                              //[rpcUnique]
   // Intermediate storage variables
   std::vector<std::vector<std::vector<CbmTofDigi>>> fStorDigiExp;  //[nbType][nbSm*nbRpc][nDigis]
   std::vector<std::vector<std::vector<int32_t>>> fStorDigiInd;     //[nbType][nbSm*nbRpc][nDigis]

Draft: Parallel version of TofHitFinder

Closed by Dominik Smith 1 year ago (Nov 14, 2023 12:10pm UTC) 1 year ago

Activity

Draft: Parallel version of TofHitFinder

Merge request reports

Closed by Dominik Smith 1 year ago (Nov 14, 2023 12:10pm UTC) 1 year ago

Activity