Flux with Rabbits

How to Allocate Rabbit Storage

Request rabbit storage allocations for a job by setting the .attributes.system.dw field in a jobspec to a string containing one or more DW directives. A JSON list of DW directives is also accepted, but cannot be provided on the command line. (It is however possible when constructing jobspecs with Flux's Python bindings, or when creating jobspecs directly.)

On the command line, set the .attributes.system.dw field by passing flags like -S dw="my_string" or --setattr=dw="my_string".

The simplest way to use rabbits is to use preset DW strings that administrators may have configured on a system. For more flexibility, it may be desirable to write custom DW strings.

Preset DW Directives

Running

flux config get rabbit.presets | jq .

will display a mapping of the presets that have been configured on the current system. The keys of the mapping are the strings that you can pass to -S dw=..., the values of the mapping are the actual DW directives the strings correspond to.

Preset Example

Suppose flux config get rabbit.presets | jq . returns the following:

{
        "xfs_small": "#DW jobdw type=xfs capacity=256GiB name=xfssmall",
        "xfs_large": "#DW jobdw type=xfs capacity=1TiB name=xfslarge",
        "lustre_large": "#DW jobdw type=lustre capacity=10TiB name=lustrelarge",
}

Then you should be able to add the flag -S dw=lustre_large to flux run/batch/alloc/submit and have 10 Terabytes of lustre storage accessible to your job at a path given by the $DW_JOB_lustrelarge environment variable. Similarly, -S dw=xfs_small and -S dw=xfs_large would give your job 256 gigabytes and 1 terabyte of XFS storage at $DW_JOB_xfssmall and $DW_JOB_xfslarge, respectively.

Custom DW Directives

Writing custom DW directives instead of using preset strings allows for much greater flexibility.

DW directives are strings that start with #DW. Directives that begin with #DW jobdw are for requesting storage that lasts the lifetime of the associated flux job. Directives that begin with #DW copy_in and #DW copy_out are for describing data movement to and from the rabbits, respectively.

Full documentation of the DW directives and their arguments is available here.

The usage with Flux is most easily understood by example.

Examples of custom jobdw directives

Requesting a 10 gigabyte XFS file system per compute node on the command line:

$ flux alloc -N2 -S dw="#DW jobdw type=xfs capacity=10GiB name=project1"

Requesting both XFS and lustre file systems in a batch script:

#!/bin/bash

#FLUX: -N 2
#FLUX: -q pdebug
#FLUX: -S dw="""
#FLUX: #DW jobdw type=xfs capacity=1TiB name=xfsproject
#FLUX: #DW jobdw type=lustre capacity=10GiB name=lustreproject
#FLUX: """

echo "Hello World!" > $DW_JOB_lustreproject/world.txt

flux submit -N2 -n2 /bin/bash -c "echo 'Hello World!' > $DW_JOB_xfsproject/world.txt"

Data Movement Examples

Warning

When writing copy_in and copy_out directives on the command line, be careful to always escape the $ character when writing DW_JOB_[name] variables. Otherwise your shell will expand them. This warning does not apply to batch scripts.

Requesting a 10 gigabyte XFS file system per compute node on the command line with data movement both to and from the rabbits (the source directory is assumed to exist):

$ flux alloc -N2 -S dw="#DW jobdw type=xfs capacity=10GiB name=project1
#DW copy_in source=/p/lustre1/$USER/dir_in destination=\$DW_JOB_project1/
#DW copy_out source=\$DW_JOB_project1/ destination=/p/lustre1/$USER/dir_out/"

Requesting a lustre file system, with data movement out from the rabbits, in a batch script:

#!/bin/bash

#FLUX: -N 2
#FLUX: -q pdebug
#FLUX: -S dw="""
#FLUX: #DW jobdw type=lustre capacity=100GiB name=lustreproject
#FLUX: #DW copy_out source=$DW_JOB_lustreproject destination=/p/lustre1/$USER/lustreproject_results
#FLUX: """

echo "Hello World!" > $DW_JOB_lustreproject/world.txt

Enabling Rabbit Fault Tolerance

Imagine you submit a ten-thousand-node Flux job that requests a rabbit file system. The job sits in the queue for a long time while your job is scheduled. Finally your job is assigned resources, but then a single node fails to mount its file system, and the job fails before it even starts running. You might wish that Flux had just ignored the single node failure and proceeded with the remaining 9,999 nodes.

The .attributes.system.dw_failure_tolerance field in a jobspec can help you in cases like the one just described. Set the field to a positive integer N, and Flux will allow up to N nodes in your job to fail to create or to access rabbit file systems. For example, flux alloc [OPTIONS] -S dw_failure_tolerance=16 would allow the loss of up to 16 nodes from your job due to rabbit-related failures.

Unlike XFS and GFS2 rabbit file systems, ephemeral Lustre rabbit file systems cannot tolerate the failed creation of file systems. If any rabbits fail to create their Lustre targets, the whole job will fail. However, ephemeral Lustre rabbit jobs can still tolerate failed mounts.

If this attribute is set, and a job proceeds through some rabbit failures, the nodes that are missing file systems will be drained when the allocation is granted. The nodes may still be undrained and used at the user's discretion.

Fetching Rabbit Information

The flux-getrabbit(1) command can be used to look up the rabbits used by a job, as well as what rabbits have PCIe links to which compute nodes and vice versa.

For example, to list the rabbits used by a job:

$ flux getrabbit -j $JOBID
rabbit[1001,1003]

Additional Attributes of Rabbit Jobs

All rabbit jobs have some extra data stored on them to help with debugging and to help account for time spent on various stages.

Timing Attributes

The timing attributes a rabbit job may have are, in order:

rabbit_proposal_timing: time it takes for DWS to process the job's #DW strings and provide a breakdown of the resources required to Flux.
rabbit_setup_timing: time it takes to create the job's file systems on the rabbits chosen by Flux.
rabbit_datain_timing: time it takes to move data from Lustre to the rabbits. If no copy_in directives were provided, this state should be very fast.
rabbit_prerun_timing: time it takes to mount rabbit file systems on compute nodes.
rabbit_postrun_timing: time it takes to unmount rabbit file systems from compute nodes.
rabbit_dataout_timing: time it takes to move data from the rabbits to Lustre, should be very fast if no copy_out directives were provided.
rabbit_teardown_timing: time it takes to destroy the rabbit file system and clean up.

A job may skip to teardown if an exception occurs, e.g. a job may only have proposal, setup, datain, and teardown timings if the rabbit file systems fail to mount on the compute nodes. Fetch the timing for a state by running, e.g. for prerun,

flux job info ${jobid} rabbit_prerun_timing

If the job does not have the timing for a state, for instance because it has not completed the state yet, expect to see an error like flux-job: No such file or directory.

Debugging Attributes

All rabbit jobs also have a rabbit_workflow attribute that stores high-level but technical information about the status of the rabbit job. Fetch the data (which is in JSON format) with flux job info ${jobid} rabbit_workflow, potentially piping it to jq in order to pretty-print it.

It may be useful to check whether there is an error message set on the workflow, which can be singled out with

flux job info ${jobid} rabbit_workflow | jq .status.message

If that is still unhelpful, try displaying more information:

flux job info ${jobid} rabbit_workflow | jq .status

In addition, rabbit jobs may have an attribute storing a small collection of information about data movement. Fetch it with

flux job info ${jobid} rabbit_datamovements | jq .

Container Attributes

If a job launched a rabbit container with a #DW container directive, once the job is complete it will have an additional attribute rabbit_container_log storing the tail of the logs of one of the containers. Unfortunately, due to size limitations of Flux's KVS, the complete logs cannot be stored.

flux job info ${jobid} rabbit_container_log | less

Node Distribution

El Cap Rabbit systems have one rabbit per chassis. The --coral2-chassis flag to flux-batch(1), flux-alloc(1), flux-run(1), and flux-submit(1) may therefore be useful in controlling the allocation of rabbits, especially for node-local file systems like XFS.

The flag is only available on systems with the flux-coral2 package installed. On any such system, run flux alloc --help=coral2-chassis for documentation.