Skip to content
Snippets Groups Projects

cbmreco: Add support to store results to file

Merged Felix Weiglhofer requested to merge fweig/cbmroot:cbmreco-archive into master
All threads resolved!

Adds rudimentary support to store reconstruction results via cbmreco:

  • Adds RecoIO class that contains results from one pass of the reconstruction (called IO as i plan to use it as possible input to reconstruction in addition to timeslices)
  • Add Archive class to store one or more reconstructed timeslices to file
  • Add --output flag to specify output-file, and add --write-per-timeslice flag to write results once per each timeslice.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Just two minor remarks. I'll put my main issue in a separate comment. ;-)

  • Thanks for the MR, this is clear code and there is nothing to complain about from a formal point of view.

    Architecturally, however, let's try and sync our ideas.

    From your code, I would summarize the structure as such:

    output_file:
      [Archive][Archive][Archive]...
    
    class Archive:
      vector<fles::SubsystemIdentifier> fDetectors
      vector<RecoIO> fTimesliceResults
    
    class RecoIO:
      vector<sts::Digi> fStsDigis

    This raises several questions for me:

    • Why does an output file consist of a stream of objects named "Archive"? I would naively consider the whole file an "Archive" for several "Timeslice Results".
    • Should we now include a descriptor/identifier object as the first object in the archive stream so that we can include metadata about the archive file and the type of objects contained? That would also allow us to destinguish between several subtypes of archive files.
    • Where would the next detector's results go, say fTofDigis? Logically, they would need to be added to RecoID, alongside the fStsDigis. But then, why is there a vector for the RecoID structs in the Archive object?
    • What purpose does the fDetectors vector serve? E.g., can the fTofDigis vector not just be empty if Tof data is not included for whatever reason?
    • Shouldn't we include the monitoring output structures of the algorithms in the output object?
    • Where in this structure would we add metadata and additional information that is not contained in vectors of Digis?

    What I envisioned is more like this structure:

    output_file:
      [ArchiveDescriptor][TimesliceResults][TimesliceResults][TimesliceResults]...
    
    TimesliceResults (a.k.a. CbmDigiTimeslice):   // Results per timeslice
      TimesliceDescriptor fDesc                // Timeslice descriptor (metadata)
      MonitoringSummaries fMonitoringSummaries // Monitoring results
      DigiData fData                           // Digi-level data results
    
    DigiData (a.k.a. CbmDigiData)
      StsDigiData fSts;    // STS data
      TofDigiData fTof;    // TOF data

    This is basically the structure that is already present (with Cbm... prefix) in the cbmroot repository in a first version created by @v.friese, and for the archive file format in ArchiveDescriptor.hpp and surrounding flesnet files.

    I understand that it might make sense to revise and extend some of the internal structures (such as StsDigiData).

    However, this MR seems to me like it goes in a different direction and aims for a significantly different output structure. Would it make sense to meet in a small circle to discuss the different ideas?

    • Resolved by Felix Weiglhofer

      Another aspect which I think deserves attention is a kind of schema evolution for the format of the contained data. For instance, Felix proposes to store vector. We know that a single vector for detector digis will not be the final word, so this is subject to evolution. This evolution is foreseen in CbmStsDigiData, which currently wraps a single std::vector, but with the intention to be able to change the representation later on.

      How do we catch this evolution when reading files? ROOT with its automatic streamers is nice for that, but schema evolution can surely also be implemented in the BOOST streamer.

  • Plus, maybe most importantly: in the end, we do not intend to store all digis from a timeslice - we want to select events within and store only digis within the selected events.

    What we have right now is

    • CbmDigiTimeslice: for all digis within the timeslice. Contains the timeslice descriptor as metadata, as described by Jan above.
    • CbmDigiEvent: for digis in an event. Contains some basic event metadata along.

    Both differ only in the metadata and carry the same container CbmDigiData.

    Edited by Volker Friese
  • Felix Weiglhofer added 5 commits

    added 5 commits

    • f94aa51e - 1 commit from branch computing:master
    • 1dfdf71d - algo: Add archive class for reconstruction results.
    • 9218cbba - Options.cxx: Add flag for output-file.
    • 84ef75ea - algo::Options: Rename write-per-ts flag to split-output-per-ts.
    • 6db49d56 - algo::Archive: Implement %n wildcard in filenames to store TS id in filename.

    Compare with previous version

  • added 1 commit

    • e9d7b150 - algo::Archive: Implement %n wildcard in filenames to store TS id in filename.

    Compare with previous version

  • added 1 commit

    • adcfd20f - algo::Archive: Implement %n wildcard in filenames to store TS id in filename.

    Compare with previous version

  • added 1 commit

    • 1df1d7a1 - algo::Archive: Implement %n wildcard in filenames to store TS id in filename.

    Compare with previous version

  • The current archive structure is in no way intended to be final. I should have made that clearer. I don't think we need to do schema evolution before the data challenge. So this is just intended to be a starting point to evolve from.

    CbmDigiData and CbmTimesliceData aren't currently available in algo. But should be easy to add them via OnlineData. With that i can change the scheme to Jan's proposal. The ArchiveDescriptor would go into Archive. RecoIO should contain DigiData (and eventually also hold Hits and Tracks).

  • When we did these data / container classes last year, I understood DigiTimeslice not as an object going to file, but a transient one as result of unpacking and input to reconstruction. What should be stored are the selected DigiEvents. I think it good, though, to at least have the possibility to store also DigiTimeslice for debugging purposes.

    One thing to consider is: when the final result is an array of DigiEvents, shall we store this flat or sorted into timeslices? In the former option, the reference to the timeslice metadata would be less straightforward. The latter solution would like like e.g.

    TimesliceResults (a.k.a. CbmDigiTimeslice):   // Results per timeslice
      TimesliceDescriptor fDesc                // Timeslice descriptor (metadata)
      MonitoringSummaries fMonitoringSummaries // Monitoring results
      DigiEvents fEvents                           // Digi-level data results
    
    DigiEvent (a.k.a. CbmDigiEvent)
      Event_metadata fEventDesc;    // event information, e.g., trigger time, event number, trigger type...
      DigiData fData;               // Digi-level data results
    
    DigiData (a.k.a. CbmDigiData)
      StsDigiData fSts;    // STS data
      TofDigiData fTof;    // TOF data

    On the expense that in the analysis, one needs a double loop (first over timeslice, then over events in the timeslice). This how it is organised in cbmreco_fairrun, where timeslice is an entry in the ROOT tree, and events an array within.

    The classes CbmDigiEvent, CbmDigiTimeslice, CbmDigiData are simple, non-ROOT and, I think, suitable to be used in algo (for GPU I cannot speak). They are in core/data so they can be used also from offline code. I think we can leave that for until the DC and think of renaming / moving afterwards.

    Edited by Volker Friese
  • @fweig:

    The ArchiveDescriptor would go into Archive. How does that work, given the structure:

    output_file:
      [Archive][Archive][Archive]...

    My understanding is: Boost serialization always reads full objects from the stream, one object (and all its constituents) at a time. If this object is not of the specified type, an exception is thrown.

    And the ArchiveDescriptor is meant to describe the Archive (file), not the Timeslice - there is a different descriptor for that. Also, it shall be used to distinguish between Archive (file) types so that the right types of objects can be read from the file.

    RecoIO should contain DigiData

    The 'RecoIO' structure looks like it is supposed to be used per detector, but 'DigiData' contains the data of all detectors.

    What is OnlineData? Did I miss that?

    BTW: What is the use case for per-timeslice output files? Speed of random access? Or do we rather want a mechanism similar to the one in flesnet, splitting large files by size for manageability?

    @v.friese: I have no clear opinion if the digis should be stored in a structure per event or rather in a flat array with index tables.

    I do however have an opinion on the second question whether data should be archived per timeslice: I think we should do that, at least for now. Some structure will be helpful also for the offline analysis, which will probably also be distributed similarly to many nodes. And at least for Boost serialization it is essential to have an object of manageable size when reading from a file.

    Edited by Jan de Cuveland
  • Felix Weiglhofer added 94 commits

    added 94 commits

    • 1df1d7a1...404e37d5 - 87 commits from branch computing:master
    • 70dac77b - algo: Add archive class for reconstruction results.
    • 16565fca - Options.cxx: Add flag for output-file.
    • 7a1cb68a - algo::Options: Rename write-per-ts flag to split-output-per-ts.
    • 600f1a7b - algo::Archive: Implement %n wildcard in filenames to store TS id in filename.
    • d8ebeb78 - algo::Options: Rename output-types flag to avoid nameclash with output.
    • 42b8dc5f - algo: Added descriptors to Archive and RecoIO.
    • c2656612 - algo: Rename RecoIO to RecoResults.

    Compare with previous version

  • Felix Weiglhofer added 22 commits

    added 22 commits

    • c2656612...2cd24535 - 15 commits from branch computing:master
    • 07013a66 - algo: Add archive class for reconstruction results.
    • 2881a099 - Options.cxx: Add flag for output-file.
    • 334c4655 - algo::Options: Rename write-per-ts flag to split-output-per-ts.
    • 97b6e770 - algo::Archive: Implement %n wildcard in filenames to store TS id in filename.
    • 554d97a1 - algo::Options: Rename output-types flag to avoid nameclash with output.
    • 75ed473e - algo: Added descriptors to Archive and RecoIO.
    • 665197fd - algo: Rename RecoIO to RecoResults.

    Compare with previous version

  • I've updated the MR:

    • Renamed RecoIO to RecoResults
    • Archive and RecoResults now both contain a descriptor
    • For Archive the descriptor is modeled after fles::ArchiveDescriptor (Should be merged in the future?)
    • RecoResultsDescriptor currently contains the Options and Parameters used by the reconstruction.
  • Merging this now. We'll need to revisit the storage format again after the Data Challenge.

  • Felix Weiglhofer added 20 commits

    added 20 commits

    • 665197fd...3767a708 - 13 commits from branch computing:master
    • 932d9bc7 - algo: Add archive class for reconstruction results.
    • 81202236 - Options.cxx: Add flag for output-file.
    • 4bfcc332 - algo::Options: Rename write-per-ts flag to split-output-per-ts.
    • 90bd8f08 - algo::Archive: Implement %n wildcard in filenames to store TS id in filename.
    • de584cf5 - algo::Options: Rename output-types flag to avoid nameclash with output.
    • a85b92bb - algo: Added descriptors to Archive and RecoIO.
    • 6cbfa7cb - algo: Rename RecoIO to RecoResults.

    Compare with previous version

  • Felix Weiglhofer resolved all threads

    resolved all threads

  • Felix Weiglhofer enabled an automatic merge when the pipeline for 6cbfa7cb succeeds

    enabled an automatic merge when the pipeline for 6cbfa7cb succeeds

  • Please register or sign in to reply
    Loading