Troubleshooting

Overlay network

The tree-based overlay network interconnects brokers of the system instance. The current status of the overlay subtree at any rank can be shown with:

$ flux overlay status -r RANK

The possible status values are:

Full

Node is online and no children are in partial, offline, degraded, or lost state.

Partial

Node is online, and some children are in partial or offline state; no children are in degraded or lost state.

Degraded

Node is online, and some children are in degraded or lost state.

Lost

Node has gone missing, from the parent perspective.

Offline

Node has not yet joined the instance, or has been cleanly shut down.

Note that the RANK argument is where the request will be sent, not necessarily the rank whose status is of interest. Parents track the status of their children, so a good approach when something is wrong to start with rank 0 (the default). The following options can be used to ask rank 0 for a detailed listing:

$ flux overlay status
0 fluke62: degraded
├─ 1 fluke63: full
│  ├─ 3 fluke65: full
│  │  ├─ 7 fluke70: full
│  │  └─ 8 fluke71: full
│  └─ 4 fluke67: full
│     ├─ 9 fluke72: full
│     └─ 10 fluke73: full
└─ 2 fluke64: degraded
   ├─ 5 fluke68: full
   │  ├─ 11 fluke74: full
   │  └─ 12 fluke75: full
   └─ 6 fluke69: degraded
      ├─ 13 fluke76: full
      └─ 14 fluke77: lost

To determine if a broker is reachable from the current rank, use:

$ flux ping RANK

A broker that is not responding but is not shown as lost or offline by flux overlay status may be forcibly detached from the overlay network with:

$ flux overlay disconnect RANK

However, before doing that, it may be useful to see if a broker acting as a router to that node is actually the problem. The overlay parent of RANK may be listed with

$ flux overlay parentof RANK

Using flux ping and flux overlay parentof iteratively, one should be able to isolate the problem rank.

See also flux-overlay(1), flux-ping(1).

Systemd journal

Flux brokers log information to standard error, which is normally captured by the systemd journal. It may be useful to look at this log when diagnosing a problem on a particular node:

$ journalctl -u flux
Sep 14 09:53:12 sun1 systemd[1]: Starting Flux message broker...
Sep 14 09:53:12 sun1 systemd[1]: Started Flux message broker.
Sep 14 09:53:12 sun1 flux[23182]: broker.info[2]: start: none->join 0.0162958s
Sep 14 09:53:54 sun1 flux[23182]: broker.info[2]: parent-ready: join->init 41.8603s
Sep 14 09:53:54 sun1 flux[23182]: broker.info[2]: rc1.0: running /etc/flux/rc1.d/01-enclosing-instance
Sep 14 09:53:54 sun1 flux[23182]: broker.info[2]: rc1.0: /bin/sh -c /etc/flux/rc1 Exited (rc=0) 0.4s
Sep 14 09:53:54 sun1 flux[23182]: broker.info[2]: rc1-success: init->quorum 0.414207s
Sep 14 09:53:54 sun1 flux[23182]: broker.info[2]: quorum-full: quorum->run 9.3847e-05s

Broker log buffer

The rank 0 broker accumulates log information for the full instance in a circular buffer. For some problems, it may be useful to view this log:

$ sudo flux dmesg -H |tail

[May02 14:51] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[  +0.039371] sched-fluxion-qmanager[0]: alloc success (queue=debug id=184120855100391424)
[  +0.816587] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[  +0.857458] sched-fluxion-qmanager[0]: alloc success (queue=debug id=184120868807376896)
[  +1.364430] sched-fluxion-qmanager[0]: feasibility_request_cb: feasibility succeeded
[  +6.361275] job-ingest[0]: job-frobnicator[0]: inactivity timeout
[  +6.367837] job-ingest[0]: job-validator[0]: inactivity timeout
[ +24.778929] job-exec[0]: exec aborted: id=184120855100391424
[ +24.779019] job-exec[0]: exec_kill: 184120855100391424: signal 15
[ +24.779557] job-exec[0]: exec aborted: id=184120868807376896
[ +24.779632] job-exec[0]: exec_kill: 184120868807376896: signal 15
[ +24.779910] sched-fluxion-qmanager[0]: alloc canceled (id=184120878001291264 queue=debug)
[ +25.155578] job-list[0]: purged 1 inactive jobs
[ +25.162650] job-manager[0]: purged 1 inactive jobs
[ +25.512050] sched-fluxion-qmanager[0]: free succeeded (queue=debug id=184120855100391424)
[ +25.647542] sched-fluxion-qmanager[0]: free succeeded (queue=debug id=184120868807376896)
[ +27.155103] job-list[0]: purged 2 inactive jobs
[ +27.159820] job-manager[0]: purged 2 inactive jobs