flux-config-rabbit(5)

DESCRIPTION

Flux system instance configuration is needed to enable Flux to interact with HPE rabbit software. No configuration is necessary in Flux sub-instances.

COMPONENTS

Jobtap Plugin

In order for a Flux system instance to be able to allocate rabbit storage, the flux-jobtap-dws(1) plugin must be loaded in the leader broker of the Flux system instance. The plugin can be loaded in a config file like so:

[job-manager]
plugins = [
  { load = "dws-jobtap.so", conf = { epilog-timeout = 0.0 }}
]

Systemd Service

Also, the flux-coral2-dws systemd service must be started on the same node as the rank 0 broker of the system instance. The flux user must have a kubeconfig file in its home directory granting it read and write access to, at a minimum, Storages, Workflows, Servers, and Computes resources (all of which are defined by dataworkflowservices). There are instructions for how to grant Flux the minimum permissions necessary by setting up role-based access control here.

Fluxion Configuration

The Fluxion scheduler must be configured to recognize rabbit resources. This can be done by generating a file describing the rabbit layout for the cluster and then running flux dws2jgf like so:

flux rabbitmapping > /tmp/rabbitmapping.json
flux dws2jgf [--no-validate] --from-config /etc/flux/system/conf.d/resource.toml --only-sched /tmp/rabbitmapping.json

The output (which may be large) must be saved to a file and pointed to with the resource.scheduling config key (see here).

In order to facilitate Fluxion restart when using this new JGF (as it is called), Fluxion must be configured to use a match-format of rv1 instead of the otherwise recommended default of rv1_nosched.

For example, in a config file:

[sched-fluxion-resource]
match-format = "rv1"

Prolog/Epilog Scripts

Prolog and epilog scripts, provided by the flux-coral2 package, automatically run during those phases of a job. The scripts stop and start the nnf-clientmount service, respectively. The prolog script also holds the job in that state until the rabbit file systems have been mounted.

Shell Plugin

A dws_environment shell plugin, responsible for managing the rabbit environment presented to applications, is loaded automatically for each job.

KEYS

The rabbit config table captures site-general policies and options for Flux's interactions with the rabbits. The following keys are valid:

mapping (string): (required) Path to rabbitmapping file for the cluster, as generated by flux-rabbitmapping(1).
kubeconfig (string): (optional) Path to kubeconfig file for Flux to use, ideally with restricted permissions. This can be left undefined if the file is placed at the path ~flux/.kube/config (assuming the flux user is the instance owner).
tc_timeout (integer): (optional) Time in seconds to tolerate a workflow stuck in TransientCondition state before killing the associated job. Defaults to 10 seconds.
teardown_after (float): (optional) Maximum time for a workflow to be in either PostRun or DataOut state before it is moved to Teardown. If unset or negative, allow the workflow to stay in those states indefinitely. See also the epilog-timeout option to flux-jobtap-dws(1), which is similar but takes more drastic action. It may be useful to set the teardown_after timeout to something smaller than the epilog-timeout, to give the NNF software time to clean up before the epilog-timeout takes effect.
setup_timeout (float): (optional) Maximum time for a workflow to be in the Setup state before the job is canceled. If unset or negative, do not set a timer.
prerun_timeout (float): (optional) Maximum time for a workflow to be in the PreRun state before the job is canceled. If unset or negative, do not set a timer.
postrun_timeout (float): (optional) Maximum time for a workflow to be in the PostRun state before it is moved to Teardown. If unset or negative, do not set a timer. If both postrun_timeout and teardown_after are set, postrun_timeout should be set to a smaller number.
drain_compute_nodes (boolean): (optional) Whether to automatically drain compute nodes that lose PCIe connection with their rabbit. Defaults to true.
save_datamovements (integer): (optional) Number of nnfdatamovement resources to save to jobs' KVS, may be useful for debugging but too many may degrade performance. Defaults to 0.
restrict_persistent_creation (boolean): (optional) Restrict the creation of persistent file systems to the instance owner (in most cases the flux user).
prolog_timeout (FSD): (optional) Maximum time in Flux Standard Duration format to wait for the dws_environment event in the prolog script.
policy.maximums (table): (optional) The maximum filesystem capacity per node, in GiB, that users may request. Leave undefined for no limit. See below for an example.
presets (table): (optional) Defines preset #DW strings. May potentially save users time and energy, allowing them to run, for instance, flux alloc -N1 -S dw=NAME rather than flux alloc -N1 -S "dw=#DW jobdw ..." See below for an example.

EXAMPLE

[rabbit]

kubeconfig = "/var/flux/.kube/config"
tc_timeout = 600
drain_compute_nodes = true
save_datamovements = 5
restrict_persistent_creation = true
teardown_after = 4800.0

# maximum filesystem capacity per node, in GiB
[rabbit.policy.maximums]
xfs = 1024
gfs2 = 2048
raw = 4096
lustre = 1024

# defines preset #DW strings
[rabbit.presets]

small_xfs = "#DW jobdw type=xfs capacity=100GiB name=smallxfs"
large_lustre = "#DW jobdw type=lustre capacity=50TiB name=largelustre"


[job-manager]
plugins = [
  { load = "dws-jobtap.so", conf = { epilog-timeout = 5400.0 }}
]


[sched-fluxion-resource]
match-format = "rv1"