Flux CORAL-2 Administration
This supplements the Flux Administrator's Guide with specifics for CORAL-2 systems.
Background
The CORAL-2 systems at Livermore are running a variant of the TOSS operating system based on Red Hat Enterprise Linux rather than the SuSE based distribution normally provided by HPE.
Installing Software Packages
Besides the base required packages, install the following.
- flux-coral2
Plugins for running Cray MPICH, managing the slingshot interconnect, and managing rabbit storage
Overlay Network Configuration
Experience siting El Capitan yields these recommendations:
The system instance should use the management ethernet for communication between Flux brokers, while user instances may use the Slingshot network.
The system overlay network should be configured with a flat topology.
A small amount of tuning helps performance at this scale and overlay fanout.
The following configuration snippet summarizes the above:
[bootstrap]
curve_cert = "/etc/flux/system/curve.cert"
default_port = 8050
default_bind = "tcp://en0:%p"
default_connect = "tcp://e%h:%p"
hosts = [
{ host = "elcap1", bind = "tcp://192.168.64.1:%p", connect = "tcp://eelcap1:%p" },
{ host = "elcap[201-896,1001-12136]" },
]
[tbon]
torpid_max = "5m"
tcp_user_timeout = "2m"
zmq_io_threads = 4
child_rcvhwm = 10
Enabling Cray MPI
For convenience, the flux-shell-cray-pals(1) plugin should be loaded
in all Flux instances. Edit /etc/flux/shell/initrc.lua
to contain:
if shell.options['pmi'] == nil then
shell.options['pmi'] = 'cray-pals,simple'
end
The cray_pals_port_distributor.so
jobtap plugin, required by the above,
is loaded automatically via /etc/flux/rc1.d/01-coral2-rc
.
Rabbits
To configure Flux with rabbits, see Configuring Flux with Rabbits.
Stuck Rabbit Jobs
Rabbit jobs may sometimes become stuck in the CLEANUP
state, while they
wait for kubernetes to report that rabbit file systems have unmounted and
cleaned up.
In flux-coral2
version 0.22.0
and greater, Flux can be configured to
end the epilog after a timeout (see Configuring Flux with Rabbits). To remove the
epilog manually without waiting for the timeout, run
flux job raise --type=dws-epilog-timeout $JOBID
.
If the rabbit job is still stuck in the dws-epilog
action, or if the version
of flux-coral2
is less than 0.22.0
,
# see what nodes still have mounts, if any, and potentially drain them
kubectl get clientmounts -A -l "dataworkflowservices.github.io/workflow.name=
fluxjob-$(flux job id $JOBID)" | grep Mounted
# see what rabbits still have allocations, if any, and potentially disable
# them.
kubectl get servers -A -l "dataworkflowservices.github.io/workflow.name=
fluxjob-$(flux job id $JOBID)" -o json | jq .status.allocationSets
# remove the epilog action
flux post-job-event $JOBID epilog-finish name=dws-epilog
The above assumes you have read access to certain kubernetes resources. On LC
machines, the administrator kubeconfig is usually kept at
/etc/kubernetes/admin.conf
. To use it,
export KUBECONFIG=/etc/kubernetes/admin.conf
.