Troubleshooting Guide
This guide gives a quick overview of commands and strategies which might be useful when troubleshooting Flux jobs. It is organized by stage in the job lifecycle.
Job Submission
If jobs cannot be submitted to Flux, check the following:
Verify the target queue or queues are enabled with flux queue status or flux queue list:
$ flux queue status pdebug
pdebug: Job submission is enabled
pdebug: Scheduling is started
Check the job-ingest module stats to ensure the module is loaded and that there is no backlog of requests for the job frobnicator or validator pipelines, if configured. The job-ingest module is loaded on every node, so an issue occurring on just one node may indicate an ingest issue:
# flux module stats job-ingest | jq .pipeline
{
"frobnicator": {
"running": 1,
"requests": 2357,
"errors": 3,
"trash": 0,
"backlog": 0,
"pids": [
3280386,
0,
0,
0
]
},
"validator": {
"running": 1,
"requests": 2354,
"errors": 5,
"trash": 0,
"backlog": 0,
"pids": [
3280387,
0,
0,
0
]
}
}
If there is a large errors
count, check flux dmesg -H (or
journalctl -u flux in a system instance) for errors.
If there is a nonzero backlog
count, check for stuck frobnicator or
validator processes. These will be children of the flux-broker,
and can be found via pstree(1). For example, for a system instance:
# pstree -Tplu flux
flux-broker-114(42368)─┬─python3(3282327)
└─python3(3282328)
If validator or frobnicator processes appear to be stuck, then they may need to manually be killed (after collecting any possible debug).
If job submission is failing after ingest, a jobtap plugin may be involved. Jobtap plugins may further validate a job after it is processed by job-ingest.
Loaded jobtap plugins are listed with flux jobtap list. To include builtin plugins use flux jobtap list -a. Some jobtap plugins support a query with flux jobtap query <name>.so. Plugins may be temporarily removed with flux jobtap remove <name>.so if they are suspect.
Job Dependencies
After a job is ingested, a validate
event transitions
the job to the DEPEND
state. If the job remains in the
DEPEND
state, then an unsatisfied dependency has been placed on
the job. Outstanding dependencies are typically available in the output
of flux-jobs(1). More information may be obtained by examining the
eventlog for dependency-add
events, and/or querying the jobtap
plugin responsible for the dependency, e.g.
# flux jobtap query .dependency-after | jq .dependencies[0]
{
"id": 328688799268861952,
"depid": 328688808429221888,
"type": "after-finish",
"description": "after-finish=fmFckQ8UWjZ"
}
Dependencies may be added to job for numerous reasons, including holding the job in the depend state until some setup is complete. Consult with the specific component that placed the dependency for further details.
If an expected dependency is not added to a job, then ensure the associated
plugin has not been removed. For example, if a job submitted with
flux submit -N1 --begin-time=+1h command does not stay in
the DEPEND
state for 1 hour, then ensure the
.begin-time
builtin jobtap plugin is loaded:
# flux jobtap list -a | grep begin-time
Note
Without -a
, flux jobtap list suppresses the output
of builtin plugins, which always start with a single .
.
If the plugin has somehow been removed. Try reloading it:
# flux jobtap load .begin-time
# flux jobtap list -a | grep begin-time
.begin-time
Job Prioritization
After all dependencies are resolved and a depend
event is emitted,
the job transitions to the PRIORITY
state. In this state, the job
is assigned an initial priority.
If jobs are stuck in the PRIORITY
state then the currently loaded
priority plugin may not be able to assign a priority. Check with the provider
of the priority plugin for more details.
Job Scheduling
Once a job receives an initial priority it transitions to the SCHED
state. In this state the job manager sends an allocation request to the
scheduler, which will reply when the job has been assigned resources. The
number of outstanding alloc requests can be viewed with
flux queue status -v:
$ flux queue status -v
[snip]
0 alloc requests queued
88 alloc requests pending to scheduler
181 running jobs
If jobs are stuck in the SCHED
state, obvious things to check are
A scheduler is loaded flux module list | grep sched
The associated queue is not stopped: flux queue list or flux queue status QUEUE
It can be challenging to determine why a particular job is not being scheduled if queues are started and the scheduler is loaded. Some things to check include:
Does the job have a specific constraint for resources that are not currently unavailable?
# flux job info JOBID jobspec | jq .attributes.system.constraints { "and": [ { "hostlist": [ "host1071" ] }, { "properties": [ "pbatch" ] } ] }If host1071 is down, then this job can't currently be scheduled.
Is the job held or have a low priority?
$ flux jobs -o {priority} fmctRr2YQ8f PRI 0
If these do not yield any information, it may be useful to consult the troubleshooting guide of the current scheduler module.
Job Prolog
After the scheduler responds to an alloc request, an alloc
event
is posted to the job eventlog:
# flux job eventlog -H fmctRr2YQ8f | grep alloc
[ +0.122636] alloc
At this time, one or more prolog-start
events may be posted to
the eventlog by jobtap plugins. These events prevent the start request
from the job manager to the job execution system until a corresponding
prolog-finish
event is emitted.
Once all prolog-finish
events have been posted, the start request
is sent and a start
event is posted to the job eventlog when
the job execution system has launched all the job shells for the job.
# flux job eventlog -H fmctRr2YQ8f
[snip]
[ +0.122930] prolog-start description="job-manager.prolog"
[ +4.288898] prolog-finish description="job-manager.prolog" status=0
[ +4.328720] start
The job-manager.prolog
is managed by the perilog.so
jobtap plugin, and is responsible for invoking the per-node job prolog
when configured (see flux-config-job-manager(5)). Other plugins may
post prolog-start
events to prevent the job from starting while
they perform some kind of job prolog action.
If the prolog-finish
event for the job-manager.prolog
is not posted in a timely manner, debug information can be obtained
directly from the perilog.so plugin:
# flux jobtap query perilog.so | jq .procs
{
"ƒ22EmWZk1mkb": {
"name": "prolog",
"state": "running",
"total": 4,
"active": 1,
"active_ranks": "0",
"remaining_time": 55
}
}
The above shows that a prolog for job ƒ22EmWZk1mkb
is currently
running. It was executed on a total
of 4 broker ranks, and still
currently active
on 1 rank, rank 0. The prolog will time out in
55 seconds.
In this case there may be an issue on rank 0, since the prolog completed
on the other involved ranks already. Log into the host associated with broker
rank 0 and list the process tree of the flux-prolog@JOBID
unit
to see the process tree of the prolog:
# flux overlay lookup 0
host1
# ssh host1
# systemd-cgls -u flux-prolog@f22EqDxY5rrK.service
Unit flux-prolog@f22EqDxY5rrK.service
├─2056298 /bin/sh /etc/flux/system/prolog
├─2056301 /bin/sh /etc/flux/system/prolog.d/doit.sh
└─2056302 hang
Since the prolog and epilog are executed as ephemeral systemd units, the output from these scripts can be obtained from journalctl -u flux-prolog@*.
Job Execution
Once all prolog-start
events have a corresponding
prolog-finish
event, the job manager sends a start request to
the job execution system. The job-exec
module launches job
shells (via the IMP on a multi-user system). Once all shells have started
a start
event is posted to the main job eventlog.
When the start event is delayed, the job-exec
module can
be queried for the job to get some detail:
# flux module stats job-exec | jq .jobs.fmenpX365oV
{
"implementation": "bulk-exec",
"ns": "job-331656486798360576",
"critical_ranks": "0-3",
"multiuser": 1,
"has_namespace": 1,
"exception_in_progress": 0,
"started": 1,
"running": 0,
"finalizing": 0,
"kill_timeout": 5.0,
"kill_count": 0,
"kill_shell_count": 0,
"total_shells": 4,
"active_shells": 3,
"active_ranks": "8-10"
}
This output shows that there should be a total of 4 job
shells (total_shells
), of which only 3 are active
(active_shells
) on ranks 8, 9, and 10 (active_ranks
). In
this case, the missing rank should be investigated (check output of
flux jobs -no {ranks} JOBID for expected ranks). Note also that the
job-exec module has marked the job shells as started
but not yet
running
. This situation is highly unlikely but demonstrative. The
use of active_ranks
will be much more useful when jobs are stuck
exiting in CLEANUP
state due to a stuck job shell.
The exec eventlog may also be useful when jobs appear to be stuck at launch:
# flux job eventlog -Hp exec fmenpX365oV
[Apr02 18:48] init
[ +0.021699] starting
[ +0.346878] shell.init service="28220-shell-fmenpX365oV" leader-rank=1046 size=1
[ +0.405225] shell.start taskmap={"version":1,"map":[[0,1,1,1]]}
The above output shows a normal exec eventlog. The job exec module writes
the init
and starting
events to the eventlog. The job
shell writes the shell.init
event after the first shell barrier
has been reached. The shell.start
event indicates all job shells
have started all job tasks.
If the shell.init
event is not posted, then one or more shells may
be slow to start or otherwise are not reaching the first barrier. Eventually,
the job execution system will time out the barrier and drain the affected
nodes with a message:
job JOBID start timeout: possible node hang
The job will have a fatal exception raised of the form:
start barrier timeout waiting for N/M nodes (ranks X-Y)
Check the affected ranks for issues.
Job Cleanup
When a job appears to be stuck in CLEANUP
state, first check
for the finish
event in the job eventlog:
# flux job eventlog -H JOBID | grep finish
If there is no finish event, then the job-exec module thinks there are
still active job shells. Use flux module stats job-exec to find
the active_ranks
:
# flux module stats job-exec | jq .jobs.<JOBID>
{
"implementation": "bulk-exec",
"ns": "job-331656486798360576",
"critical_ranks": "0-3",
"multiuser": 1,
"has_namespace": 1,
"exception_in_progress": 1,
"started": 1,
"running": 1,
"finalizing": 0,
"kill_timeout": 5.0,
"kill_count": 1,
"kill_shell_count": 0,
"total_shells": 4,
"active_shells": 1,
"active_ranks": "8"
}
In the output above, there is 1 (active_shells
) out of 4
(total_shells
) job shells still active. The shell is active on
rank 8 (active_ranks
). The hostname for rank 8 may be obtained
via flux overlay lookup 8.
If there is a finish
event posted to the job eventlog,
then look for any epilog-start
event without a corresponding
epilog-finish
. Consult documentation of the affected epilog
action for further debugging. For a job-manager.epilog
,
use flux jobtap query perilog.so
to determine the state
of the epilog. Check for any unexpected active processes in the
active_ranks
key.
Housekeeping
After a job posts the clean
event to the job eventlog, resources
are released to the job manager, which then starts housekeeping on those
resources if configured.
Nodes in housekeeping are displayed in the output of flux resource status
$ flux resource status -s housekeeping
STATE UP NNODES NODELIST
housekeeping ✔ 8 tuolumne[1174,1747-1748,1751-1753,1775-1776]
More information is available with flux housekeeping list
$ flux housekeeping list
JOBID NNODES #ACTIVE RUNTIME NODELIST
fmkHggkEBMy 4 1 53.92s tuolumne[1174,1747-1748,1751]
fmkHiMPjMsm 4 4 16.14s tuolumne[1752-1753,1775-1776]
See flux-housekeeping(1) for details.
Like the prolog and epilog, housekeeping is executed in a systemd transient
unit, per-job, of the form flux-housekeeping@JOBID
. Use
systemd-cgls to list processes and journalctl -u
flux-housekeeping@JOBID to debug housekeeping scripts.