Skip to content
Snippets Groups Projects
Commit 318436bf authored by Pierre-Alain Loizeau's avatar Pierre-Alain Loizeau
Browse files

[MQ] Sbatch scripts fixes following investigation of problems seen online in Au runs

    - Change HWM for Sink process input to 1 as this applies to each link to an Evt builder individually and not to the sum of all as expected (solves huge bufferings)
    - Do not re-order the TS in the Sink by default (leads to buffering if one branch is really slow)
    - Add option for DigiEvent I/O between builders and Sink
    - Reduce HWM at input and output of Unpackes and Event builders to 1
    - Add bash and sbatch scripts for testing replay of a run on the mFLES cluster (single source node + single processing node)
    - Add information about online MQ problems in Gold run and these fixes
parent 89523023
No related branches found
No related tags found
1 merge request!882Changes from mCBM 2022 prod to mCBM MQ devices and execution
......@@ -326,4 +326,38 @@ check loop behind identical to the latter.
disk
1. Something fishy is happening with the ZMQ buffering, as even without re-ordering and missing TS insertion, the memory
usage of the sink increase up to `180 GB`, which is far more than expected with the HWM of 2 messages at input
- Now understood: was caused by wrong usage of FairMQ channel option `rcvBufSize` in case of fan-in, which sets the
limits for each link and not for all links as originally expected.
- After setting it to 1 the "unassigned memory size" is limited to approx. `NBranch * max TS size` as expected
- Memory usage of the Unpackers and Event builders + number of TS in-flight also reduced by setting all HWM to 1
1. The plots generated by the sink for the buffer monitoring and processed TS/Event counting have messed up scales
- Fixed after review of the way the plots are filled
1. Processing of TS in Sink far slower than expected: around 2 TS/10 s instead of 20 TS/10 s expected
- Partially linked to writing to single disk
- Partially linked to extraction of selected data in the Sink itself
- Improved by adding option to make DigiEvent in Event builder and transmit these = ~4-6 TS/10 s in Sink
# Replay testing scripts
These scripts allow to replay a run from one of the archiver nodes with a single processing node.
They consist of:
- create_log_folder_dev.sbatch
- mq_processing_node_dev.sbatch
- replay.sbatch
- start_topology_dev.sh
This version is first starting a full topology with `<Nb branches>` on `en13`, writing a single set of files to
`/storage/${_Disk}/mcbm2022/data/<Run Id>_<Trigger Set>_end13.digi_events[_FileIdx].root`,
then starting a replay of all files for a given `<Run Id>` from node8 and connecting it to the MQ topology using the
Infiniband network.
The replay is done at a rate of around `2 TS/s`, which is slightly more than what one would expect for a single
processing node in a `2 TS builder + 4 processing nodes` configuration
It expects 4 parameters in the following order:
- the `<Run Id>`, as reported by flesctl
- the `<Number of branches to be started per node>`, leading to a total parallel capability of `4 x n` timeslices
- the `<Trigger set>` in the range `[0-14]`, with `[0-6]` corresponding to the trigger settings tested by N. Herrmann
and `[7-14]` those used for development by P.-A. Loizeau
- the `<Disk index>` in the range `[0-8]`, with currently only indices `6` and `7` being valid for `en13` (HDDs were
moved around)
#!/bin/bash
mkdir -p /storage/6/mcbm2022/online_logs/$1
......@@ -458,7 +458,7 @@ EVTSINK+=" --id evtsink1"
EVTSINK+=" --severity info"
# EVTSINK+=" --severity debug"
EVTSINK+=" --StoreFullTs 0"
# EVTSINK+=" --BypassConsecutiveTs true"
EVTSINK+=" --BypassConsecutiveTs true"
EVTSINK+=" --WriteMissingTs false"
EVTSINK+=" --DisableCompression true"
EVTSINK+=" --TreeFileMaxSize 4000000000"
......@@ -472,7 +472,7 @@ EVTSINK+=" --PubFreqTs $_pubfreqts"
EVTSINK+=" --PubTimeMin $_pubminsec"
EVTSINK+=" --PubTimeMax $_pubmaxsec"
EVTSINK+=" --EvtNameIn events"
EVTSINK+=" --channel-config name=events,type=pull,method=bind,transport=zeromq,rcvBufSize=$_nbbranch,address=tcp://127.0.0.1:11556,rateLogging=$_ratelog"
EVTSINK+=" --channel-config name=events,type=pull,method=bind,transport=zeromq,rcvBufSize=1,address=tcp://127.0.0.1:11556,rateLogging=$_ratelog"
EVTSINK+=" --channel-config name=missedts,type=sub,method=connect,transport=zeromq,address=tcp://127.0.0.1:11006,rateLogging=$_ratelog"
EVTSINK+=" --channel-config name=commands,type=sub,method=connect,transport=zeromq,address=tcp://127.0.0.1:11007,rateLogging=$_ratelog"
EVTSINK+=" --channel-config name=histogram-in,type=pub,method=connect,transport=zeromq,address=tcp://${_histServHost}:11666,rateLogging=$_ratelog"
......@@ -520,7 +520,7 @@ while (( _iBranch < _nbbranch )); do
# fi
UNPACKER+=" --TsNameOut unpts$_iBranch"
UNPACKER+=" --channel-config name=ts-request,type=req,method=connect,transport=zeromq,address=tcp://127.0.0.1:11555,rateLogging=$_ratelog"
UNPACKER+=" --channel-config name=unpts$_iBranch,type=push,method=bind,transport=zeromq,sndBufSize=2,address=tcp://127.0.0.1:$_iPort,rateLogging=$_ratelog"
UNPACKER+=" --channel-config name=unpts$_iBranch,type=push,method=bind,transport=zeromq,sndBufSize=1,address=tcp://127.0.0.1:$_iPort,rateLogging=$_ratelog"
# UNPACKER+=" --channel-config name=commands,type=sub,method=connect,transport=zeromq,address=tcp://127.0.0.1:11007"
UNPACKER+=" --channel-config name=parameters,type=req,method=connect,transport=zeromq,address=tcp://${_parServHost}:11005,rateLogging=0"
UNPACKER+=" --channel-config name=histogram-in,type=pub,method=connect,transport=zeromq,address=tcp://${_histServHost}:11666,rateLogging=$_ratelog"
......@@ -579,8 +579,8 @@ while (( _iBranch < _nbbranch )); do
EVTBUILDER+=" --TsNameIn unpts$_iBranch"
EVTBUILDER+=" --EvtNameOut events"
# EVTBUILDER+=" --DoNotSend true"
EVTBUILDER+=" --channel-config name=unpts$_iBranch,type=pull,method=connect,transport=zeromq,rcvBufSize=2,address=tcp://127.0.0.1:$_iPort,rateLogging=$_ratelog"
EVTBUILDER+=" --channel-config name=events,type=push,method=connect,transport=zeromq,sndBufSize=2,address=tcp://127.0.0.1:11556,rateLogging=$_ratelog"
EVTBUILDER+=" --channel-config name=unpts$_iBranch,type=pull,method=connect,transport=zeromq,rcvBufSize=1,address=tcp://127.0.0.1:$_iPort,rateLogging=$_ratelog"
EVTBUILDER+=" --channel-config name=events,type=push,method=connect,transport=zeromq,sndBufSize=1,address=tcp://127.0.0.1:11556,rateLogging=$_ratelog"
# EVTBUILDER+=" --channel-config name=commands,type=sub,method=connect,transport=zeromq,address=tcp://127.0.0.1:11007"
EVTBUILDER+=" --channel-config name=parameters,type=req,method=connect,transport=zeromq,address=tcp://${_parServHost}:11005,rateLogging=0"
EVTBUILDER+=" --channel-config name=histogram-in,type=pub,method=connect,transport=zeromq,address=tcp://${_histServHost}:11666,rateLogging=$_ratelog"
......
......@@ -452,7 +452,7 @@ case $SLURM_ARRAY_TASK_ID in
EVTSINK+=" --PubTimeMin $_pubminsec"
EVTSINK+=" --PubTimeMax $_pubmaxsec"
EVTSINK+=" --EvtNameIn events"
EVTSINK+=" --channel-config name=events,type=pull,method=bind,transport=zeromq,rcvBufSize=$_nbbranch,address=tcp://127.0.0.1:11556,rateLogging=$_ratelog"
EVTSINK+=" --channel-config name=events,type=pull,method=bind,transport=zeromq,rcvBufSize=1,address=tcp://127.0.0.1:11556,rateLogging=$_ratelog"
EVTSINK+=" --channel-config name=missedts,type=sub,method=connect,transport=zeromq,address=tcp://127.0.0.1:11006,rateLogging=$_ratelog"
EVTSINK+=" --channel-config name=commands,type=sub,method=connect,transport=zeromq,address=tcp://127.0.0.1:11007,rateLogging=$_ratelog"
EVTSINK+=" --channel-config name=histogram-in,type=pub,method=connect,transport=zeromq,address=tcp://127.0.0.1:11666,rateLogging=$_ratelog"
......@@ -498,7 +498,7 @@ case $SLURM_ARRAY_TASK_ID in
UNPACKER+=" --TsNameOut unpts$_iBranch"
UNPACKER+=" --channel-config name=ts-request,type=req,method=connect,transport=zeromq,address=tcp://127.0.0.1:11555,rateLogging=$_ratelog"
UNPACKER+=" --channel-config name=parameters,type=req,method=connect,transport=zeromq,address=tcp://127.0.0.1:11005,rateLogging=0"
UNPACKER+=" --channel-config name=unpts$_iBranch,type=push,method=bind,transport=zeromq,sndBufSize=2,address=tcp://127.0.0.1:$_iPort,rateLogging=$_ratelog"
UNPACKER+=" --channel-config name=unpts$_iBranch,type=push,method=bind,transport=zeromq,sndBufSize=1,address=tcp://127.0.0.1:$_iPort,rateLogging=$_ratelog"
# UNPACKER+=" --channel-config name=commands,type=sub,method=connect,transport=zeromq,address=tcp://127.0.0.1:11007"
UNPACKER+=" --channel-config name=histogram-in,type=pub,method=connect,transport=zeromq,address=tcp://127.0.0.1:11666,rateLogging=$_ratelog"
UNPACKER+=" --transport zeromq"
......@@ -554,8 +554,8 @@ case $SLURM_ARRAY_TASK_ID in
EVTBUILDER+=" --TsNameIn unpts$_iBranch"
EVTBUILDER+=" --EvtNameOut events"
EVTBUILDER+=" --DoNotSend true"
EVTBUILDER+=" --channel-config name=unpts$_iBranch,type=pull,method=connect,transport=zeromq,rcvBufSize=2,address=tcp://127.0.0.1:$_iPort,rateLogging=$_ratelog"
EVTBUILDER+=" --channel-config name=events,type=push,method=connect,transport=zeromq,sndBufSize=2,address=tcp://127.0.0.1:11556,rateLogging=$_ratelog"
EVTBUILDER+=" --channel-config name=unpts$_iBranch,type=pull,method=connect,transport=zeromq,rcvBufSize=1,address=tcp://127.0.0.1:$_iPort,rateLogging=$_ratelog"
EVTBUILDER+=" --channel-config name=events,type=push,method=connect,transport=zeromq,sndBufSize=1,address=tcp://127.0.0.1:11556,rateLogging=$_ratelog"
# EVTBUILDER+=" --channel-config name=commands,type=sub,method=connect,transport=zeromq,address=tcp://127.0.0.1:11007"
EVTBUILDER+=" --channel-config name=parameters,type=req,method=connect,transport=zeromq,address=tcp://127.0.0.1:11005,rateLogging=0"
EVTBUILDER+=" --channel-config name=histogram-in,type=pub,method=connect,transport=zeromq,address=tcp://127.0.0.1:11666,rateLogging=$_ratelog"
......
This diff is collapsed.
......@@ -86,7 +86,7 @@ EVTSINK+=" --PubFreqTs $_pubfreqts"
EVTSINK+=" --PubTimeMin $_pubminsec"
EVTSINK+=" --PubTimeMax $_pubmaxsec"
EVTSINK+=" --EvtNameIn events"
EVTSINK+=" --channel-config name=events,type=pull,method=bind,transport=zeromq,rcvBufSize=$_nbbranch,address=tcp://127.0.0.1:11556,rateLogging=$_ratelog"
EVTSINK+=" --channel-config name=events,type=pull,method=bind,transport=zeromq,rcvBufSize=1,address=tcp://127.0.0.1:11556,rateLogging=$_ratelog"
EVTSINK+=" --channel-config name=missedts,type=sub,method=connect,transport=zeromq,address=tcp://127.0.0.1:11006,rateLogging=$_ratelog"
EVTSINK+=" --channel-config name=commands,type=sub,method=connect,transport=zeromq,address=tcp://127.0.0.1:11007,rateLogging=$_ratelog"
EVTSINK+=" --channel-config name=histogram-in,type=pub,method=connect,transport=zeromq,address=tcp://${_histServHost}:11666,rateLogging=$_ratelog"
......
#!/bin/bash
#SBATCH -J Replay
# Copyright (C) 2022 Facility for Antiproton and Ion Research in Europe, Darmstadt
# SPDX-License-Identifier: GPL-3.0-only
# author: Pierre-Alain Loizeau [committer]
if [ $# -ge 2 ]; then
_run_id=$1
_port=$2
else
echo 'Missing parameters. Only following pattern allowed:'
echo 'replay.sbatch <Run Id> <port>'
return -1
fi
Filename1=/storage/1/data/${_run_id}_*_1_*.tsa
Filename2=/storage/2/data/${_run_id}_*_2_*.tsa
Filename3=/storage/3/data/${_run_id}_*_3_*.tsa
Filename4=/storage/4/data/${_run_id}_*_4_*.tsa
Filename5=/storage/5/data/${_run_id}_*_5_*.tsa
Filename6=/storage/7/data/${_run_id}_*_6_*.tsa
Filename7=/storage/8/data/${_run_id}_*_7_*.tsa
Filename8=/storage/6/data/${_run_id}_*_8_*.tsa
Filename9=/storage/6/data/${_run_id}_*_9_*.tsa
HOSTNAME=`hostname`
HOST=${HOSTNAME:4:2}
# Force cast to base 10 to avoid wrong octal base assumption
HOSTCLEAN=$((10#${HOST}))
## => Hostname and port for replay toward virgo
#hostnameIB="cbmfles"$HOSTNAME
#ipaddrIB=`dig $hostnameIB.gsi.de +short`
#Port=$((5550 + $HOSTCLEAN))
## => Hostname and port for replay withing mFLES (could also be set to "*")
ipaddrIB=*
LogFile=/home/loizeau/rep_${HOSTNAME}_${_run_id}.log
echo "${ipaddrIB}" $Port "${Filename1};${Filename2};${Filename3};${Filename4};${Filename5};${Filename6};${Filename7};${Filename8};${Filename9}"
tsclient -i file:"${Filename1};${Filename2};${Filename3};${Filename4};${Filename5};${Filename6};${Filename7};${Filename8};${Filename9}"? -P "tcp://${ipaddrIB}:${_port}" --publish-hwm 100 --rate-limit 2 &> $LogFile # ~real readout
#!/bin/bash
# Copyright (C) 2022 Facility for Antiproton and Ion Research in Europe, Darmstadt
# SPDX-License-Identifier: GPL-3.0-only
# author: Pierre-Alain Loizeau [committer]
if [ $# -eq 4 ]; then
_run_id=$1
_nbbranch=$2
_TriggSet=$3
_Disk=$4
if [ ${_nbbranch} -eq 0 ]; then
echo 'Nb branches cannot be 0! At least one branch is needed!'
return -1
fi
if [ ${_Disk} -lt 0 ] || [ ${_Disk} -gt 7 ]; then
echo 'Disk index on the en13 nodes can only be in [0-7]!'
return -1
fi
else
echo 'Missing parameters. Only following pattern allowed:'
echo 'start_topology.sh <Run Id> <Nb // branches> <Trigger set> <Storage disk index>'
return -1
fi
((_nbjobs = 4 + $_nbbranch*2 ))
#_log_folder="/local/mcbm2022/online_logs/${_run_id}"
_log_folder="/storage/6/mcbm2022/online_logs/${_run_id}"
_log_config="-D ${_log_folder} -o ${_run_id}_%A_%a.out.log -e ${_run_id}_%A_%a.err.log"
# Create the log folders
sbatch -w en13 create_log_folder_dev.sbatch ${_run_id}
sleep 2
# Online ports
#sbatch -w en13 ${_log_config} mq_processing_node.sbatch ${_run_id} ${_nbbranch} ${_TriggSet} ${_Disk} node8ib2:5560
# Replay ports
sbatch -w en13 ${_log_config} mq_processing_node_dev.sbatch ${_run_id} ${_nbbranch} ${_TriggSet} ${_Disk} node8ib2:5557
sleep 10
# Replay job
sbatch -w node8 replay.sbatch ${_run_id} 5557
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment