CORAL2: Flux on Cray Shasta¶
The LLNL, LBNL, and ORNL systems like Tioga, Perlmutter, El Capitan, and Frontier are similar in that they all use the HPE Cray Shasta platform, which requires an additional component to integrate completely with Flux.
Note
Flux on CORAL2 is under active development. This document assumes flux-core >= 0.49.0, flux-sched >= 0.27.0, and flux-coral2 >= 0.4.1.
Getting Flux¶
At LLNL, Flux is part of the operating system and runs as the native resource manager on Cray Shasta systems. At other sites, Flux can be launched as a parallel job by the native resource manager, if desired.
If the minimum versions of the flux components are not already available at your site, you may consider building flux-core and flux-sched manually then building flux-coral2 with the same prefix.
Cray MPICH¶
Cray ships a variant of MPICH as the supported MPI library for the slingshot interconnect. There are two options for running parallel programs compiled with Cray MPICH under Flux.
One is to set LD_LIBRARY_PATH so that the MPI executable finds Flux’s
libpmi2.so
before any other, e.g.
$ flux run --env=LD_LIBRARY_PATH=/usr/lib64/flux:$LD_LIBRARY_PATH -n2 -opmi=simple ./hello
foWQCPHAu6f: completed MPI_Init in 1.050s. There are 2 tasks
foWQCPHAu6f: completed first barrier in 0.005s
foWQCPHAu6f: completed MPI_Finalize in 0.001s
The other is to use Cray PMI, which requires a the cray-pals
plugin from
the flux-coral2 package, e.g.
$ flux run -n2 -opmi=cray-pals ./hello
foP9Jyw5kjq: completed MPI_Init in 0.051s. There are 2 tasks
foP9Jyw5kjq: completed first barrier in 0.006s
foP9Jyw5kjq: completed MPI_Finalize in 0.002s
Cray PMI comes with additional complications - see below.
Sites that want to make cray-pals
available by default, so users don’t
have to specify -opmi=cray-pals
may add the following lines near the top
of the flux-shell’s initrc.lua
, before any call to load plugins:
if shell.options['pmi'] == nil then
shell.options['pmi'] = 'cray-pals,simple'
end
This alters the system default of pmi=simple
, applies to all Flux
instances, and has no effect if the user specifies -opmi=
on the command
line. Note that simple
is still the preferred way to bootstrap Flux
itself, so it is advised for it to be retained in the default.
Cray PMI Complications with Flux¶
Cray PMI requires Flux to allocate two unique network port numbers to each
multi-node job and communicate them via PMI_CONTROL_PORT
in the job
environment. The Cray PMI library uses these ports to establish temporary
network connections to exchange interconnect endpoints. Two jobs sharing
a node must not be allocated overlapping port numbers, or the jobs may fail.
flux-coral2 supplies the cray_pals_port_distributor
plugin to allocate
a unique pair of ports per job. However, each Flux instance has an
independent allocator for the same port range, so a complication arises
when multiple Flux instances are sharing a node’s port space. Therefore,
when Cray PMI is in use, Flux instances must not share nodes.
To minimize the possibility of batch jobs, which are fully independent Flux instances, handing out duplicate ports, it is recommended to configure node-exclusive scheduling for the top level resource manager on these systems. This leaves no opportunity for conflicting port numbers to be assigned among the top-level batch jobs. It doesn’t protect against batch jobs scheduling Flux sub-instances that conflict, however.
The port allocator defaults to using a pool of 1000 ports. This places an upper limit of 500 on the number of concurrently executing multi-node jobs per Flux instance. The system limit is much higher since each batch job is an independent Flux instance that can run many jobs. Also, single node jobs do not consume a port pair and are not subject to this limit.
Troubleshooting Cray MPICH¶
If Flux jobs that use Cray MPICH end up as a collection of singletons,
or fail in MPI_Init()
, that is usually a sign that something is wrong
in the PMI bootstrapping environment. When this happens it may be useful to:
Add
-o pmi=NAME[,NAME,...]
to control which PMI implementations are offered by the flux shell to jobs, e.g.simple
,cray-pals
,pmix
).Add
-o verbose=2
to request the shell to print tracing info from the PMI implementations.Launch
flux-pmi
as a parallel program to test PMI in isolation.
$ flux run -n2 --label-io flux pmi -v --method=libpmi2 barrier
1: libpmi2: using /opt/cray/pe/lib64/libpmi2.so (cray quirks enabled)
0: libpmi2: using /opt/cray/pe/lib64/libpmi2.so (cray quirks enabled)
1: libpmi2: initialize: rank=1 size=2 name=kvs_348520130306638848: success
0: libpmi2: initialize: rank=0 size=2 name=kvs_348520130306638848: success
0: fovUPqZ5dwM: completed pmi barrier on 2 tasks in 0.000s.
1: libpmi2: barrier: success
0: libpmi2: barrier: success
1: libpmi2: barrier: success
0: libpmi2: barrier: success
1: libpmi2: finalize: success
0: libpmi2: finalize: success