Flux CORAL-2 Administration

This supplements the Flux Administrator's Guide with specifics for CORAL-2 systems.

Background

The CORAL-2 systems at Livermore are running a variant of the TOSS operating system based on Red Hat Enterprise Linux rather than the SuSE based distribution normally provided by HPE.

Installing Software Packages

Besides the base required packages, install the following.

flux-coral2: Plugins for running Cray MPICH, managing the slingshot interconnect, and managing rabbit storage

Overlay Network Configuration

Experience siting El Capitan yields these recommendations:

The system instance should use the management ethernet for communication between Flux brokers, while user instances may use the Slingshot network.
The system overlay network should be configured with a flat topology.
A small amount of tuning helps performance at this scale and overlay fanout.

The following configuration snippet summarizes the above:

[bootstrap]
curve_cert = "/etc/flux/system/curve.cert"
default_port = 8050
default_bind = "tcp://en0:%p"
default_connect = "tcp://e%h:%p"

hosts = [
  { host = "elcap1", bind = "tcp://192.168.64.1:%p", connect = "tcp://eelcap1:%p" },
  { host = "elcap[201-896,1001-12136]" },
]

[tbon]
torpid_max = "5m"
tcp_user_timeout = "2m"
zmq_io_threads = 4
child_rcvhwm = 10

Slingshot Configuration

Flux-coral2 supports VNI tagging for RDMA isolation. To enable it, ensure that flux-coral2 version 0.28.0 or newer is installed on the system, and use the following procedure.

Configure VNI Tagging

First, arrange for the cray-slingshot.so jobtap plugin to be loaded on the leader broker (management node):

[job-manager]
plugins = [
  { load "cray-slingshot.so" },
]

To avoid a Flux restart, load the plugin manually:

flux jobtap load cray-slingshot.so

Next, enable the prolog, epilog, and housekeeping scriptlets that manage CXI services on behalf of jobs:

[cray-slingshot]
cxi-enable = true

After modifying the configuration files, execute flux config reload across the cluster. This ensures that the cxi-enable flag change is visible to the scriptlets.

flux config reload

The cray-slingshot Flux shell plugin will need to be active. It is active by default so no action is normally needed.

Note

There was an early bug in the cray-slingshot.so Flux shell plugin that required it to be temporarily disabled in the shell initrc.lua file. That problem was addressed in flux-coral2 0.28.0. Ensure that it is no longer disabled on the target system.

Finally, disable the default CXI service. It is disabled by default in recent releases of the Slingshot Host Software. Ensure that it is not being re-enabled with the disable_default_svc=0 option on the cxi_core kernel module. Changing this will require the SHS module stack to be reloaded on compute notes.

Testing VNI Tagging

Running flux-slingshot(1) in a job may be helpful to verify that VNI allocations and CXI service setup is happening.

As a baseline reference, a system that does not have VNI tagging enabled appears as follows. Svc 1 is the default CXI service. Svc 2-4 are system services.

$ flux run -q pdebug flux slingshot list
Name     Svc    UID   VNIs      PTEs  TXQs  TGQs  EQs   CTs   LEs   TLEs  ACs
cxi[0-3] 1      -     1,10      0     0     0     0     0     0     512   0
cxi[0-3] 2/sys  0               0     0     0     1     0     0     0     1
cxi[0-3] 3/sys  0               8     1     1     1     0     660   0     1
cxi[0-3] 4/sys  0               4     4     4     8     0     520   0     2

A system configured as described above looks like this:

$ flux run -q parrypeak flux slingshot list
Name     Svc    UID   VNIs      PTEs  TXQs  TGQs  EQs   CTs   LEs   TLEs  ACs
cxi[0-3] 1-     -     1,10      0     0     0     0     0     0     512   0
cxi[0-3] 2/sys  0               0     0     0     1     0     0     0     1
cxi[0-3] 3/sys  0               8     1     1     1     0     660   0     1
cxi[0-3] 4/sys  0               4     4     4     8     0     520   0     2
cxi[0-3] 5      5588  1061      576   192   96    192   96    1536  96    192

Note the hyphen after the default service indicating that it is disabled. Svc 5 is the service set up for the job owner (UID 5588) with a unique allocated VNI (1061).

The shell plugin ensures that the correct environment variables are set. This can also be checked

$ flux run -q parrypeak printenv | grep SLINGSHOT_
SLINGSHOT_VNIS=1062
SLINGSHOT_DEVICES=cxi0,cxi1,cxi2,cxi3
SLINGSHOT_SVC_IDS=5,5,5,5
SLINGSHOT_TCS=0xa

As a final check, run a cray MPICH hello world job that spans multiple nodes:

$ flux run -N13 -n1248 -q parrypeak ./hello
f2yUbiAAGQP: completed MPI_Init in 2.161s.  There are 1248 tasks
f2yUbiAAGQP: completed first barrier in 0.008s
f2yUbiAAGQP: completed MPI_Finalize in 0.028s

For more detail refer to Slingshot Interconnect and the Security section of the HPE Slingshot Operations Guide linked from that document.

Enabling Cray MPI

For convenience, the flux-shell-cray-pals(1) plugin should be loaded in all Flux instances. Edit /etc/flux/shell/initrc.lua to contain:

if shell.options['pmi'] == nil then
    shell.options['pmi'] = 'cray-pals,simple'
end

The cray-pmi-bootstrap.so jobtap plugin, required by the above, is loaded automatically via /etc/flux/rc1.d/01-coral2-rc.

Rabbits

To configure Flux with rabbits, see flux-config-rabbit(5).

Rabbit jobs may sometimes become stuck in the CLEANUP state, while they wait for kubernetes to report that rabbit file systems have unmounted and cleaned up.

The first thing to do is always to cancel the job and wait a short while to see if the job cleans up. There is a chance that a job may be stuck while moving data, and an exception (such as a cancel exception) occurring during the CLEANUP state will tell the Flux plugins to abandon data movement. However, if the reason the job is stuck is a hung unmount or a rabbit file system that won't clean up, canceling the job will not help.

In flux-coral2 version 0.22.0 and greater, Flux can be configured to end the epilog after a timeout (see flux-config-rabbit(5)). To remove the epilog manually without waiting for the timeout, run flux job raise --type=dws-epilog-timeout $JOBID.

If the rabbit job is still stuck in the dws-epilog action, or if the version of flux-coral2 is less than 0.22.0,

# see what nodes still have mounts, if any, and potentially drain them
kubectl get clientmounts -A -l "dataworkflowservices.github.io/workflow.name=
  fluxjob-$(flux job id $JOBID)" | grep Mounted
# see what rabbits still have allocations, if any, and potentially disable
# them.
kubectl get servers -A -l "dataworkflowservices.github.io/workflow.name=
  fluxjob-$(flux job id $JOBID)" -o json | jq .status.allocationSets
# remove the epilog action
flux post-job-event $JOBID epilog-finish name=dws-epilog

The above assumes you have read access to certain kubernetes resources. On LC machines, the administrator kubeconfig is usually kept at /etc/kubernetes/admin.conf. To use it, export KUBECONFIG=/etc/kubernetes/admin.conf.