Troubleshooting
Overlay network
The tree-based overlay network interconnects brokers of the system instance. The current status of the overlay subtree at any rank can be shown with:
$ flux overlay status -r RANK
The possible status values are:
- Full
Node is online and no children are in partial, offline, degraded, or lost state.
- Partial
Node is online, and some children are in partial or offline state; no children are in degraded or lost state.
- Degraded
Node is online, and some children are in degraded or lost state.
- Lost
Node has gone missing, from the parent perspective.
- Offline
Node has not yet joined the instance, or has been cleanly shut down.
Note that the RANK argument is where the request will be sent, not necessarily the rank whose status is of interest. Parents track the status of their children, so a good approach when something is wrong to start with rank 0 (the default). The following options can be used to ask rank 0 for a detailed listing:
$ flux overlay status
0 fluke62: degraded
├─ 1 fluke63: full
│ ├─ 3 fluke65: full
│ │ ├─ 7 fluke70: full
│ │ └─ 8 fluke71: full
│ └─ 4 fluke67: full
│ ├─ 9 fluke72: full
│ └─ 10 fluke73: full
└─ 2 fluke64: degraded
├─ 5 fluke68: full
│ ├─ 11 fluke74: full
│ └─ 12 fluke75: full
└─ 6 fluke69: degraded
├─ 13 fluke76: full
└─ 14 fluke77: lost
To determine if a broker is reachable from the current rank, use:
$ flux ping RANK
A broker that is not responding but is not shown as lost or offline
by flux overlay status may be forcibly detached from the overlay
network with:
$ flux overlay disconnect RANK
However, before doing that, it may be useful to see if a broker acting as a router to that node is actually the problem. The overlay parent of RANK may be listed with
$ flux overlay parentof RANK
Using flux ping and flux overlay parentof iteratively, one should
be able to isolate the problem rank.
See also flux-overlay(1), flux-ping(1).
Systemd journal
Flux brokers log information to standard error, which is normally captured by the systemd journal. It may be useful to look at this log when diagnosing a problem on a particular node:
$ journalctl -u flux
Sep 14 09:53:12 sun1 systemd[1]: Starting Flux message broker...
Sep 14 09:53:12 sun1 systemd[1]: Started Flux message broker.
Sep 14 09:53:12 sun1 flux[23182]: broker.info[2]: start: none->join 0.0162958s
Sep 14 09:53:54 sun1 flux[23182]: broker.info[2]: parent-ready: join->init 41.8603s
Sep 14 09:53:54 sun1 flux[23182]: broker.info[2]: rc1.0: running /etc/flux/rc1.d/01-enclosing-instance
Sep 14 09:53:54 sun1 flux[23182]: broker.info[2]: rc1.0: /bin/sh -c /etc/flux/rc1 Exited (rc=0) 0.4s
Sep 14 09:53:54 sun1 flux[23182]: broker.info[2]: rc1-success: init->quorum 0.414207s
Sep 14 09:53:54 sun1 flux[23182]: broker.info[2]: quorum-full: quorum->run 9.3847e-05s
Broker log buffer
The rank 0 broker accumulates log information for the full instance in a circular buffer. For some problems, it may be useful to view this log:
$ sudo flux dmesg -H |tail
[May02 14:51] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[ +0.039371] sched-fluxion-qmanager[0]: alloc success (queue=debug id=184120855100391424)
[ +0.816587] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[ +0.857458] sched-fluxion-qmanager[0]: alloc success (queue=debug id=184120868807376896)
[ +1.364430] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[ +6.361275] job-ingest[0]: job-frobnicator[0]: inactivity timeout
[ +6.367837] job-ingest[0]: job-validator[0]: inactivity timeout
[ +24.778929] job-exec[0]: exec aborted: id=184120855100391424
[ +24.779019] job-exec[0]: exec_kill: 184120855100391424: signal 15
[ +24.779557] job-exec[0]: exec aborted: id=184120868807376896
[ +24.779632] job-exec[0]: exec_kill: 184120868807376896: signal 15
[ +24.779910] sched-fluxion-qmanager[0]: alloc canceled (id=184120878001291264 queue=debug)
[ +25.155578] job-list[0]: purged 1 inactive jobs
[ +25.162650] job-manager[0]: purged 1 inactive jobs
[ +25.512050] sched-fluxion-qmanager[0]: free succeeded (queue=debug id=184120855100391424)
[ +25.647542] sched-fluxion-qmanager[0]: free succeeded (queue=debug id=184120868807376896)
[ +27.155103] job-list[0]: purged 2 inactive jobs
[ +27.159820] job-manager[0]: purged 2 inactive jobs