flux-housekeeping(1)

SYNOPSIS

flux housekeeping list [-n] [-o FORMAT]
flux housekeeping kill [--all] [-j JOBID] [-t HOSTS|RANKS] [-s SIGNUM]

DESCRIPTION

The housekeeping service provides similar functionality to a job epilog, with a few advantages

  • Housekeeping runs after the job, which is then allowed to exit CLEANUP state and become inactive once resources are released.

  • While housekeeping is running, the scheduler still thinks resources are allocated to the job, and will not allocate resources to other jobs.

  • Housekeeping supports partial release of resources back to the scheduler, such that a subset of stuck nodes do not hold up other nodes from being returned to service.

The flux housekeeping command is used to interact with the housekeeping service. It supports listing the resources currently executing housekeeping actions and a command to forcibly terminate actions on a per-job or per-node basis.

In a Flux system instance, housekeeping is configured by default to run as a one-shot systemd.unit(5). See TROUBLESHOOTING below.

COMMANDS

list

flux housekeeping list lists active housekeeping tasks by jobid.

-i, --include=TARGETS

Filter results to only include resources matching TARGETS, which may be specified either as an idset of broker ranks or a list of hosts in hostlist form. It is not an error to specify ranks or hosts that do not exist.

-o, --format=FORMAT

Customize the output format (See the OUTPUT FORMAT section below).

-n, --no-header

Suppress header from output.

kill

flux housekeeping kill can be used to terminate active housekeeping tasks. Housekeeping may be terminated by jobid, a set of targets such as broker ranks or hostnames, or all housekeeping may be terminated via the --all option.

-s, --signal=SIGNUM

Send signal SIGNUM instead of SIGTERM.

-t, --targets=RANK|HOSTS

Target a specific set of ranks or hosts.

-j, --jobid=JOBID

Target a specific job by JOBID. Without --targets this will kill all housekeeping tasks for the specified job.

--all

Target all housekeeping tasks for all jobs.

OUTPUT FORMAT

The --format option can be used to specify an output format using Python's string format syntax or a defined format by name. For a list of built-in and configured formats use -o help.

The following field names can be specified for flux housekeeping list:

id

The jobid that triggered housekeeping

runtime

The time since this housekeeping task started

nnodes

A synonym for allocated.nnodes

ranks

A synonym for allocated.ranks

nodelist

A synonym for allocated.nodelist

allocated.nnodes

The number of nodes still allocated to this housekeeping task.

allocated.ranks

The list of broker ranks still allocated to this housekeeping task.

allocated.ranks

The list of nodes still allocated to this housekeeping task.

pending.nnodes

The number of nodes that still need to complete housekeeping.

pending.ranks

The list of broker ranks that still need to complete housekeeping.

pending.ranks

The list of nodes that still need to complete housekeeping.

TROUBLESHOOTING

In a Flux system instance, housekeeping is configured by default to run as a systemd.unit(5) named flux-housekeeping@JOBID.

systemctl(1) can show the status of housekeeping units running on the local node:

$ systemctl status flux-housekeeping@*

journalctl(1) shows standard output and error of a housekeeping run:

$ journalctl -u flux-housekeeping@f4aTGTz2SN3

When housekeeping fails, the systemd unit script drains the failing nodes with the reason obtained from systemd. For example, housekeeping runs that failed due to a nonzero exit code are distinguished from those that were aborted early due to a signal. In addition, a failure message is logged to Flux and can be accessed with flux-dmesg(1).

When housekeeping hangs, no automated action is taken by Flux. Sending housekeeping a signal with flux housekeeping kill causes systemctl stop to be run on the housekeeping unit. Generally, it is best to let systemd take over from there. Its default action is to send SIGTERM to all processes in the control group, then SIGKILL if any processes have not terminated after a 90s delay.

Note

On systems with scheduler configurations that permit jobs to share nodes, multiple housekeeping units may execute concurrently on a single node. Housekeeping scripts must be crafted with that in mind on such systems.

CAVEATS

The flux-housekeeping@ systemd unit is responsible for draining nodes when housekeeping fails. Therefore if the system is configured to bypass the systemd unit file, or if housekeeping is misconfigured such that the the systemd unit file is not started, this draining does not occur.

RESOURCES

Flux: http://flux-framework.org

Flux RFC: https://flux-framework.readthedocs.io/projects/flux-rfc

Issue Tracker: https://github.com/flux-framework/flux-core/issues

SEE ALSO

flux-config-job-manager(5)