DUNE-DAQ Error States With Targeted Commands A Deep Dive
Hey guys! Today, we're diving deep into a consistent yet pesky issue within DUNE-DAQ involving addressed commands and their tendency to throw us into error states. Let's break down what's happening, why it's happening, and how we can fix it. Think of this as our roadmap to smoother operations!
The Problem: Error States with Targeted Commands
So, what's the deal? We've been seeing some undesirable behavior when using targeted commands in DUNE-DAQ. To illustrate, hereβs a real-world example from a recent nightly build. Imagine running a series of commands like this:
drunc-unified-shell > boot
pawel status
ββββββββββββββββββββββββββ³βββββββ³ββββββββββ³βββββββββββ³βββββββββββ³βββββββββββ³ββββββββββββββββββββββββββββ
β Name β Info β State β Substate β In error β Included β Endpoint β
β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β root-controller β β initial β initial β No β Yes β grpc://10.73.136.36:36719 β
β df-controller β β initial β initial β No β Yes β grpc://10.73.136.36:43161 β
β df-01 β β initial β idle β No β Yes β rest://10.73.136.36:49339 β
β dfo-01 β β initial β idle β No β Yes β rest://10.73.136.36:38799 β
β tp-stream-writer β β initial β idle β No β Yes β rest://10.73.136.36:51079 β
β hsi-fake-controller β β initial β initial β No β Yes β grpc://10.73.136.36:40285 β
β hsi-fake-01 β β initial β idle β No β Yes β rest://10.73.136.36:51791 β
β hsi-fake-to-tc-app β β initial β idle β No β Yes β rest://10.73.136.36:40851 β
β ru-controller β β initial β initial β No β Yes β grpc://10.73.136.36:38461 β
β ru-01 β β initial β idle β No β Yes β rest://10.73.136.36:49273 β
β trg-controller β β initial β initial β No β Yes β grpc://10.73.136.36:38725 β
β mlt β β initial β idle β No β Yes β rest://10.73.136.36:40571 β
β tc-maker-1 β β initial β idle β No β Yes β rest://10.73.136.36:53261 β
ββββββββββββββββββββββββββ΄βββββββ΄ββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ΄ββββββββββββββββββββββββββββ
Waiting on tree initialisation... βββΊβββββββββββββββββββββββββββββββββββββ 6% 0:01:06
[2025/10/14 10:27:28] INFO commands.py:72 unified_shell.boot: Booted successfully
drunc-unified-shell > conf --target trg-controller
[2025/10/14 10:30:35] INFO shell_utils.py:436 controller.shell_utils: Running transition 'conf' on controller 'root-controller', targeting: 'trg-controller'
pawel status
ββββββββββββββββββββββββββ³βββββββ³βββββββββββββ³βββββββββββββ³βββββββββββ³βββββββββββ³ββββββββββββββββββββββββββββ
β Name β Info β State β Substate β In error β Included β Endpoint β
β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β root-controller β β initial β initial β No β Yes β grpc://10.73.136.36:36719 β
β df-controller β β initial β initial β No β Yes β grpc://10.73.136.36:43161 β
β df-01 β β initial β idle β No β Yes β rest://10.73.136.36:49339 β
β dfo-01 β β initial β idle β No β Yes β rest://10.73.136.36:38799 β
β tp-stream-writer β β initial β idle β No β Yes β rest://10.73.136.36:51079 β
β hsi-fake-controller β β initial β initial β No β Yes β grpc://10.73.136.36:40285 β
β hsi-fake-01 β β initial β idle β No β Yes β rest://10.73.136.36:51791 β
β hsi-fake-to-tc-app β β initial β idle β No β Yes β rest://10.73.136.36:40851 β
β ru-controller β β initial β initial β No β Yes β grpc://10.73.136.36:38461 β
β ru-01 β β initial β idle β No β Yes β rest://10.73.136.36:49273 β
β trg-controller β β configured β configured β No β Yes β grpc://10.73.136.36:38725 β
β mlt β β configured β idle β No β Yes β rest://10.73.136.36:40571 β
β tc-maker-1 β β configured β idle β No β Yes β rest://10.73.136.36:53261 β
ββββββββββββββββββββββββββ΄βββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββ΄βββββββββββ΄ββββββββββββββββββββββββββββ
Waiting for conf to complete... ββββββββββββββββββββββββββββββββββββββββ 0% -:--:--
AttributeError: 'NoneType' object has no attribute 'flag'
drunc-unified-shell > conf
[2025/10/14 10:30:44] INFO shell_utils.py:436 controller.shell_utils: Running transition 'conf' on controller 'root-controller', targeting: 'root-controller'
pawel status
ββββββββββββββββββββββββββ³βββββββ³βββββββββββββ³βββββββββββββ³βββββββββββ³βββββββββββ³ββββββββββββββββββββββββββββ
β Name β Info β State β Substate β In error β Included β Endpoint β
β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β root-controller β β configured β configured β Yes β Yes β grpc://10.73.136.36:36719 β
β df-controller β β configured β configured β No β Yes β grpc://10.73.136.36:43161 β
β df-01 β β configured β idle β No β Yes β rest://10.73.136.36:49339 β
β dfo-01 β β configured β idle β No β Yes β rest://10.73.136.36:38799 β
β tp-stream-writer β β configured β idle β No β Yes β rest://10.73.136.36:51079 β
β hsi-fake-controller β β configured β configured β No β Yes β grpc://10.73.136.36:40285 β
β hsi-fake-01 β β configured β idle β No β Yes β rest://10.73.136.36:51791 β
β hsi-fake-to-tc-app β β configured β idle β No β Yes β rest://10.73.136.36:40851 β
β ru-controller β β configured β configured β No β Yes β grpc://10.73.136.36:38461 β
β ru-01 β β configured β idle β No β Yes β rest://10.73.136.36:49273 β
β trg-controller β β configured β configured β No β Yes β grpc://10.73.136.36:38725 β
β mlt β β configured β idle β No β Yes β rest://10.73.136.36:40571 β
β tc-maker-1 β β configured β idle β No β Yes β rest://10.73.136.36:53261 β
ββββββββββββββββββββββββββ΄βββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββ΄βββββββββββ΄ββββββββββββββββββββββββββββ
Waiting for conf to complete... ββββββββββββββββββββββββββββββββββββββββ 0% -:--:--
conf execution report
ββββββββββββββββββββββββββ³ββββββββββββββββββββββββ³ββββββββββββββββββββββββββββ
β Name β Command execution β FSM transition β
β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β root-controller β Executed Successfully β Fsm Invalid Transition β
β trg-controller β Executed Successfully β Fsm Invalid Transition β
β hsi-fake-controller β Executed Successfully β Fsm Executed Successfully β
β hsi-fake-01 β Executed Successfully β Fsm Executed Successfully β
β hsi-fake-to-tc-app β Executed Successfully β Fsm Executed Successfully β
β ru-controller β Executed Successfully β Fsm Executed Successfully β
β ru-01 β Executed Successfully β Fsm Executed Successfully β
β df-controller β Executed Successfully β Fsm Executed Successfully β
β dfo-01 β Executed Successfully β Fsm Executed Successfully β
β df-01 β Executed Successfully β Fsm Executed Successfully β
β tp-stream-writer β Executed Successfully β Fsm Executed Successfully β
ββββββββββββββββββββββββββ΄ββββββββββββββββββββββββ΄ββββββββββββββββββββββββββββ
pawel status
ββββββββββββββββββββββββββ³βββββββ³βββββββββββββ³βββββββββββββ³βββββββββββ³βββββββββββ³ββββββββββββββββββββββββββββ
β Name β Info β State β Substate β In error β Included β Endpoint β
β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β root-controller β β configured β configured β Yes β Yes β grpc://10.73.136.36:36719 β
β df-controller β β configured β configured β No β Yes β grpc://10.73.136.36:43161 β
β df-01 β β configured β idle β No β Yes β rest://10.73.136.36:49339 β
β dfo-01 β β configured β idle β No β Yes β rest://10.73.136.36:38799 β
β tp-stream-writer β β configured β idle β No β Yes β rest://10.73.136.36:51079 β
β hsi-fake-controller β β configured β configured β No β Yes β grpc://10.73.136.36:40285 β
β hsi-fake-01 β β configured β idle β No β Yes β rest://10.73.136.36:51791 β
β hsi-fake-to-tc-app β β configured β idle β No β Yes β rest://10.73.136.36:40851 β
β ru-controller β β configured β configured β No β Yes β grpc://10.73.136.36:38461 β
β ru-01 β β configured β idle β No β Yes β rest://10.73.136.36:49273 β
β trg-controller β β configured β configured β No β Yes β grpc://10.73.136.36:38725 β
β mlt β β configured β idle β No β Yes β rest://10.73.136.36:40571 β
β tc-maker-1 β β configured β idle β No β Yes β rest://10.73.136.36:53261 β
ββββββββββββββββββββββββββ΄βββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββ΄βββββββββββ΄ββββββββββββββββββββββββββββ
[2025/10/14 10:30:44] ERROR shell_utils.py:157 utils.ShellContext: FSM is in error (state: "configured"
sub_state: "configured"
in_error: true
included: true
), not currently accepting new commands.
drunc-unified-shell > status
pawel status
ββββββββββββββββββββββββββ³βββββββ³βββββββββββββ³βββββββββββββ³βββββββββββ³βββββββββββ³ββββββββββββββββββββββββββββ
β Name β Info β State β Substate β In error β Included β Endpoint β
β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β root-controller β β configured β configured β Yes β Yes β grpc://10.73.136.36:36719 β
β df-controller β β configured β configured β No β Yes β grpc://10.73.136.36:43161 β
β df-01 β β configured β idle β No β Yes β rest://10.73.136.36:49339 β
β dfo-01 β β configured β idle β No β Yes β rest://10.73.136.36:38799 β
β tp-stream-writer β β configured β idle β No β Yes β rest://10.73.136.36:51079 β
β hsi-fake-controller β β configured β configured β No β Yes β grpc://10.73.136.36:40285 β
β hsi-fake-01 β β configured β idle β No β Yes β rest://10.73.136.36:51791 β
β hsi-fake-to-tc-app β β configured β idle β No β Yes β rest://10.73.136.36:40851 β
β ru-controller β β configured β configured β No β Yes β grpc://10.73.136.36:38461 β
β ru-01 β β configured β idle β No β Yes β rest://10.73.136.36:49273 β
β trg-controller β β configured β configured β No β Yes β grpc://10.73.136.36:38725 β
β mlt β β configured β idle β No β Yes β rest://10.73.136.36:40571 β
β tc-maker-1 β β configured β idle β No β Yes β rest://10.73.136.36:53261 β
ββββββββββββββββββββββββββ΄βββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββ΄βββββββββββ΄ββββββββββββββββββββββββββββ
[2025/10/14 10:30:58] ERROR shell_utils.py:157 utils.ShellContext: FSM is in error (state: "configured"
sub_state: "configured"
in_error: true
included: true
), not currently accepting new commands.
drunc-unified-shell > recompute-status
pawel status
ββββββββββββββββββββββββββ³βββββββ³βββββββββββββ³βββββββββββββ³βββββββββββ³βββββββββββ³ββββββββββββββββββββββββββββ
β Name β Info β State β Substate β In error β Included β Endpoint β
β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β root-controller β β configured β configured β No β Yes β grpc://10.73.136.36:36719 β
β df-controller β β configured β configured β No β Yes β grpc://10.73.136.36:43161 β
β df-01 β β configured β idle β No β Yes β rest://10.73.136.36:49339 β
β dfo-01 β β configured β idle β No β Yes β rest://10.73.136.36:38799 β
β tp-stream-writer β β configured β idle β No β Yes β rest://10.73.136.36:51079 β
β hsi-fake-controller β β configured β configured β No β Yes β grpc://10.73.136.36:40285 β
β hsi-fake-01 β β configured β idle β No β Yes β rest://10.73.136.36:51791 β
β hsi-fake-to-tc-app β β configured β idle β No β Yes β rest://10.73.136.36:40851 β
β ru-controller β β configured β configured β No β Yes β grpc://10.73.136.36:38461 β
β ru-01 β β configured β idle β No β Yes β rest://10.73.136.36:49273 β
β trg-controller β β configured β configured β No β Yes β grpc://10.73.136.36:38725 β
β mlt β β configured β idle β No β Yes β rest://10.73.136.36:40571 β
β tc-maker-1 β β configured β idle β No β Yes β rest://10.73.136.36:53261 β
ββββββββββββββββββββββββββ΄βββββββ΄βββββββββββββ΄βββββββββββββ΄βββββββββββ΄βββββββββββ΄ββββββββββββββββββββββββββββ
[2025/10/14 10:31:02] INFO shell_utils.py:168 utils.ShellContext: Current FSM status is configured. Available transitions are start, scrap. Available sequence commands are shutdown, start-run.
See how things get a little wonky? We're trying to configure a specific target (trg-controller
), but then the whole system seems to stumble, and we end up in an error state. The logs even show an AttributeError: 'NoneType' object has no attribute 'flag'
, which is definitely not what we want to see.
Why is this happening?
That's the million-dollar question, isn't it? To really nail this down, we need to dig into the underlying mechanisms that govern how addressed commands interact with the system's state management. It's like being a detective, piecing together clues to solve the mystery.
Here's what we need to figure out:
- Why do we enter this error state in the first place? What's the specific trigger? Is it a timing issue, a misconfiguration, or something else entirely?
- What should the system do when it enters an error state? Should it halt? Should it attempt to recover? What's the ideal behavior?
- How can we recover from this error state? Are there specific commands or procedures we can use to get back on track?
Understanding these points is crucial for building a more robust and reliable system. It's not just about fixing the symptom; it's about understanding the root cause.
Breaking Down the Error
Let's dissect this a bit further. When we run conf --target trg-controller
, we're essentially telling the system to configure the trg-controller
and its associated components. The system should transition the trg-controller
from its initial state to a configured state. However, the error logs suggest that something goes wrong during this transition, leading to the AttributeError
and eventually the error state.
When we subsequently run conf
without a target, we're telling the system to configure everything. This is where things get even more interesting. The root-controller
ends up in an error state, even though other controllers might have configured successfully. This suggests that there's a cascading effect, where an error in one part of the system can impact other parts.
The pawel status command is our window into the soul of the system, showing the state and substate of each component. Looking at the output, we see that the root-controller
is in the configured
state but also flagged as In error: Yes
. This is a contradictory state, and it highlights the inconsistency we're dealing with.
It is crucial to highlight the importance of understanding the Finite State Machine (FSM) transitions within the system. An FSM is a mathematical model of computation that describes the behavior of a system with a finite number of states. In our case, each controller has an FSM that governs its transitions between states like initial
, configured
, running
, etc. The error report mentions "Fsm Invalid Transition," which tells us that the system attempted an illegal state transition, indicating a potential flaw in the FSM logic or the command execution sequence.
The Importance of Error Handling
Think of error handling like the emergency brakes in a car. You hope you never have to use them, but you're sure glad they're there when you need them. In a complex system like DUNE-DAQ, robust error handling is essential for maintaining stability and preventing catastrophic failures.
When an error occurs, the system should:
- Detect the error: This seems obvious, but it's not always straightforward. The system needs to be able to identify when something has gone wrong.
- Log the error: Detailed logs are invaluable for debugging. They provide a historical record of what happened, which can help us trace the root cause.
- Handle the error: This is where things get interesting. What should the system do in response to the error? Should it retry the operation? Should it revert to a previous state? Should it simply halt?
- Recover from the error: Ideally, the system should be able to recover from errors automatically. This might involve restarting a component, reconfiguring the system, or taking other corrective actions.
Our current situation highlights a gap in our error handling strategy. We're detecting the error (the AttributeError
and the FSM invalid transition), but we're not handling it gracefully. The system ends up in an inconsistent state, and we don't have a clear path to recovery.
Testing Suggestions: Let's Break Things (So We Can Fix Them)
Okay, so we know we have a problem. But how do we go about fixing it? The first step is to reproduce the error consistently. This allows us to experiment with different solutions and verify that they actually work.
The testing suggestion provided in the initial description is spot on: we need to replicate the sequence of commands that led to the error. This involves running drunc-unified-shell
, booting the system, targeting the trg-controller
with the conf
command, and then running a global conf
. This sequence seems to reliably trigger the error state.
But we can't stop there! We need to expand our testing efforts to cover a wider range of scenarios. This might involve:
- Varying the target: Try targeting different controllers with the
conf
command. Does the error only occur withtrg-controller
, or does it happen with others as well? - Introducing delays: Sometimes, timing issues can cause errors. Try adding delays between commands to see if that makes a difference.
- Simulating failures: What happens if a component fails during configuration? Can we simulate this to test our error handling?
- Using different configurations: Try different system configurations. Does the error occur in all configurations, or is it specific to certain setups?
The more we can break the system in a controlled environment, the better we'll understand its weaknesses and the more robust our fixes will be.
The Need for Comprehensive Testing
This brings us to a crucial point: the need for a comprehensive testing strategy. The original description rightly points out that we are