Troubleshooting Guide

This guide gives a quick overview of commands and strategies which might be useful when troubleshooting Flux jobs. It is organized by stage in the job lifecycle.

Job Submission

If jobs cannot be submitted to Flux, check the following:

Verify the target queue or queues are enabled with flux queue status or flux queue list:

$ flux  queue status pdebug
pdebug: Job submission is enabled
pdebug: Scheduling is started

Check the job-ingest module stats to ensure the module is loaded and that there is no backlog of requests for the job frobnicator or validator pipelines, if configured. The job-ingest module is loaded on every node, so an issue occurring on just one node may indicate an ingest issue:

# flux module stats job-ingest | jq .pipeline
{
  "frobnicator": {
    "running": 1,
    "requests": 2357,
    "errors": 3,
    "trash": 0,
    "backlog": 0,
    "pids": [
      3280386,
      0,
      0,
      0
    ]
  },
  "validator": {
    "running": 1,
    "requests": 2354,
    "errors": 5,
    "trash": 0,
    "backlog": 0,
    "pids": [
      3280387,
      0,
      0,
      0
    ]
  }
}

If there is a large errors count, check flux dmesg -H (or journalctl -u flux in a system instance) for errors.

If there is a nonzero backlog count, check for stuck frobnicator or validator processes. These will be children of the flux-broker, and can be found via pstree(1). For example, for a system instance:

# pstree -Tplu flux
flux-broker-114(42368)─┬─python3(3282327)
                       └─python3(3282328)

If validator or frobnicator processes appear to be stuck, then they may need to manually be killed (after collecting any possible debug).

If job submission is failing after ingest, a jobtap plugin may be involved. Jobtap plugins may further validate a job after it is processed by job-ingest.

Loaded jobtap plugins are listed with flux jobtap list. To include builtin plugins use flux jobtap list -a. Some jobtap plugins support a query with flux jobtap query <name>.so. Plugins may be temporarily removed with flux jobtap remove <name>.so if they are suspect.

Job Dependencies

After a job is ingested, a validate event transitions the job to the DEPEND state. If the job remains in the DEPEND state, then an unsatisfied dependency has been placed on the job. Outstanding dependencies are typically available in the output of flux-jobs(1). More information may be obtained by examining the eventlog for dependency-add events, and/or querying the jobtap plugin responsible for the dependency, e.g.

# flux jobtap query .dependency-after | jq .dependencies[0]
{
  "id": 328688799268861952,
  "depid": 328688808429221888,
  "type": "after-finish",
  "description": "after-finish=fmFckQ8UWjZ"
}

Dependencies may be added to job for numerous reasons, including holding the job in the depend state until some setup is complete. Consult with the specific component that placed the dependency for further details.

If an expected dependency is not added to a job, then ensure the associated plugin has not been removed. For example, if a job submitted with flux submit -N1 --begin-time=+1h command does not stay in the DEPEND state for 1 hour, then ensure the .begin-time builtin jobtap plugin is loaded:

# flux jobtap list -a | grep begin-time

Note

Without -a, flux jobtap list suppresses the output of builtin plugins, which always start with a single ..

If the plugin has somehow been removed. Try reloading it:

# flux jobtap load .begin-time
# flux jobtap list -a | grep begin-time
.begin-time

Job Prioritization

After all dependencies are resolved and a depend event is emitted, the job transitions to the PRIORITY state. In this state, the job is assigned an initial priority.

If jobs are stuck in the PRIORITY state then the currently loaded priority plugin may not be able to assign a priority. Check with the provider of the priority plugin for more details.

Job Scheduling

Once a job receives an initial priority it transitions to the SCHED state. In this state the job manager sends an allocation request to the scheduler, which will reply when the job has been assigned resources. The number of outstanding alloc requests can be viewed with flux queue status -v:

$ flux queue status -v
[snip]
0 alloc requests queued
88 alloc requests pending to scheduler
181 running jobs

If jobs are stuck in the SCHED state, obvious things to check are

A scheduler is loaded flux module list | grep sched

The associated queue is not stopped: flux queue list or flux queue status QUEUE

It can be challenging to determine why a particular job is not being scheduled if queues are started and the scheduler is loaded. Some things to check include:

Does the job have a specific constraint for resources that are not currently unavailable?

# flux job info JOBID jobspec | jq .attributes.system.constraints
{
  "and": [
    {
      "hostlist": [
        "host1071"
      ]
    },
    {
      "properties": [
        "pbatch"
      ]
    }
  ]
}

If host1071 is down, then this job can't currently be scheduled.

Is the job held or have a low priority?

$ flux jobs -o {priority} fmctRr2YQ8f
PRI
0

If these do not yield any information, it may be useful to consult the troubleshooting guide of the current scheduler module.

Job Prolog

After the scheduler responds to an alloc request, an alloc event is posted to the job eventlog:

# flux job eventlog -H fmctRr2YQ8f | grep alloc
[  +0.122636] alloc

At this time, one or more prolog-start events may be posted to the eventlog by jobtap plugins. These events prevent the start request from the job manager to the job execution system until a corresponding prolog-finish event is emitted.

Once all prolog-finish events have been posted, the start request is sent and a start event is posted to the job eventlog when the job execution system has launched all the job shells for the job.

# flux job eventlog -H fmctRr2YQ8f
[snip]
[  +0.122930] prolog-start description="job-manager.prolog"
[  +4.288898] prolog-finish description="job-manager.prolog" status=0
[  +4.328720] start

The job-manager.prolog is managed by the perilog.so jobtap plugin, and is responsible for invoking the per-node job prolog when configured (see flux-config-job-manager(5)). Other plugins may post prolog-start events to prevent the job from starting while they perform some kind of job prolog action.

If the prolog-finish event for the job-manager.prolog is not posted in a timely manner, debug information can be obtained directly from the perilog.so plugin:

# flux jobtap query perilog.so | jq .procs
{
  "ƒ22EmWZk1mkb": {
    "name": "prolog",
    "state": "running",
    "total": 4,
    "active": 1,
    "active_ranks": "0",
    "remaining_time": 55
  }
}

The above shows that a prolog for job ƒ22EmWZk1mkb is currently running. It was executed on a total of 4 broker ranks, and still currently active on 1 rank, rank 0. The prolog will time out in 55 seconds.

In this case there may be an issue on rank 0, since the prolog completed on the other involved ranks already. Log into the host associated with broker rank 0 and list the process tree of the flux-prolog@JOBID unit to see the process tree of the prolog:

# flux overlay lookup 0
host1
# ssh host1
# systemd-cgls -u flux-prolog@f22EqDxY5rrK.service
Unit flux-prolog@f22EqDxY5rrK.service
├─2056298 /bin/sh /etc/flux/system/prolog
├─2056301 /bin/sh /etc/flux/system/prolog.d/doit.sh
└─2056302 hang

Since the prolog and epilog are executed as ephemeral systemd units, the output from these scripts can be obtained from journalctl -u flux-prolog@*.

Job Execution

Once all prolog-start events have a corresponding prolog-finish event, the job manager sends a start request to the job execution system. The job-exec module launches job shells (via the IMP on a multi-user system). Once all shells have started a start event is posted to the main job eventlog.

When the start event is delayed, the job-exec module can be queried for the job to get some detail:

# flux module stats job-exec | jq .jobs.fmenpX365oV
{
  "implementation": "bulk-exec",
  "ns": "job-331656486798360576",
  "critical_ranks": "0-3",
  "multiuser": 1,
  "has_namespace": 1,
  "exception_in_progress": 0,
  "started": 1,
  "running": 0,
  "finalizing": 0,
  "kill_timeout": 5.0,
  "kill_count": 0,
  "kill_shell_count": 0,
  "total_shells": 4,
  "active_shells": 3,
  "active_ranks": "8-10"
}

This output shows that there should be a total of 4 job shells (total_shells), of which only 3 are active (active_shells) on ranks 8, 9, and 10 (active_ranks). In this case, the missing rank should be investigated (check output of flux jobs -no {ranks} JOBID for expected ranks). Note also that the job-exec module has marked the job shells as started but not yet running. This situation is highly unlikely but demonstrative. The use of active_ranks will be much more useful when jobs are stuck exiting in CLEANUP state due to a stuck job shell.

The exec eventlog may also be useful when jobs appear to be stuck at launch:

# flux job eventlog -Hp exec fmenpX365oV
[Apr02 18:48] init
[  +0.021699] starting
[  +0.346878] shell.init service="28220-shell-fmenpX365oV" leader-rank=1046 size=1
[  +0.405225] shell.start taskmap={"version":1,"map":[[0,1,1,1]]}

The above output shows a normal exec eventlog. The job exec module writes the init and starting events to the eventlog. The job shell writes the shell.init event after the first shell barrier has been reached. The shell.start event indicates all job shells have started all job tasks.

If the shell.init event is not posted, then one or more shells may be slow to start or otherwise are not reaching the first barrier. Eventually, the job execution system will time out the barrier and drain the affected nodes with a message:

job JOBID start timeout: possible node hang

The job will have a fatal exception raised of the form:

start barrier timeout waiting for N/M nodes (ranks X-Y)

Check the affected ranks for issues.

Job Cleanup

When a job appears to be stuck in CLEANUP state, first check for the finish event in the job eventlog:

# flux job eventlog -H JOBID | grep finish

If there is no finish event, then the job-exec module thinks there are still active job shells. Use flux module stats job-exec to find the active_ranks:

# flux module stats job-exec | jq .jobs.<JOBID>
{
  "implementation": "bulk-exec",
  "ns": "job-331656486798360576",
  "critical_ranks": "0-3",
  "multiuser": 1,
  "has_namespace": 1,
  "exception_in_progress": 1,
  "started": 1,
  "running": 1,
  "finalizing": 0,
  "kill_timeout": 5.0,
  "kill_count": 1,
  "kill_shell_count": 0,
  "total_shells": 4,
  "active_shells": 1,
  "active_ranks": "8"
}

In the output above, there is 1 (active_shells) out of 4 (total_shells) job shells still active. The shell is active on rank 8 (active_ranks). The hostname for rank 8 may be obtained via flux overlay lookup 8.

If there is a finish event posted to the job eventlog, then look for any epilog-start event without a corresponding epilog-finish. Consult documentation of the affected epilog action for further debugging. For a job-manager.epilog, use flux jobtap query perilog.so to determine the state of the epilog. Check for any unexpected active processes in the active_ranks key.

Housekeeping

After a job posts the clean event to the job eventlog, resources are released to the job manager, which then starts housekeeping on those resources if configured.

Nodes in housekeeping are displayed in the output of flux resource status

$ flux resource status -s housekeeping
       STATE UP NNODES NODELIST
housekeeping  ✔      8 tuolumne[1174,1747-1748,1751-1753,1775-1776]

More information is available with flux housekeeping list

$ flux housekeeping list
      JOBID NNODES #ACTIVE  RUNTIME NODELIST
fmkHggkEBMy      4       1   53.92s tuolumne[1174,1747-1748,1751]
fmkHiMPjMsm      4       4   16.14s tuolumne[1752-1753,1775-1776]

See flux-housekeeping(1) for details.

Like the prolog and epilog, housekeeping is executed in a systemd transient unit, per-job, of the form flux-housekeeping@JOBID. Use systemd-cgls to list processes and journalctl -u flux-housekeeping@JOBID to debug housekeeping scripts.