CORAL: Flux launched by IBM LSF

The LLNL and ORNL CORAL systems Lassen, Sierra, and Summit are pre-exascale supercomputers built by IBM. They run a specialized software stack that requires additional components to integrate properly with Flux. These components are provided as Lmod modules on all three systems.

To setup your environment to use these modules on the LLNL systems Lassen and Sierra, run:

module use /usr/tce/modulefiles/Core # if not already in use
module use /usr/global/tools/flux/blueos_3_ppc64le_ib/modulefiles

If you are using the ORNL system Summit, run:

module use /sw/summit/modulefiles/ums/gen007flux/linux-rhel8-ppc64le/Core

Launching Flux

You can load the latest Flux-team managed installation on LLNL and ORNL CORAL machines using:

module load flux

Note

If you are using an installation of Flux that is not provided by the Flux team and that is configured without --enable-pmix-bootstrap (e.g., a spack-installed Flux), launching it on CORAL systems requires a shim layer to provide PMI on top of the PMIx interface provided by the CORAL system launcher jsrun. To load this module alongside your side-installed Flux, run module load pmi-shim.

We also suggest that you launch Flux using jsrun with the following arguments:

jsrun -a 1 -c ALL_CPUS -g ALL_GPUS -n ${NUM_NODES} --bind=none --smpiargs="-disable_gpu_hooks" flux start

The ${NUM_NODES} variable is the number of nodes that you want to launch the Flux instance across. The remaining arguments ensure that all on-node resources are available to Flux for scheduling.

Note

If you are using the pmi-shim module mentioned above, you will need to set PMIX_MCA_gds="^ds12,ds21" in your environment before calling jsrun. The PMIX_MCA_gds environment variable works around a bug in OpenPMIx that causes a hang when using the PMI compatibility shim.

Launching Spectrum MPI within Flux

If you want to run MPI applications compiled with Spectrum MPI under Flux, then one additional step is required. When you run a Spectrum MPI binary under flux, you must enable Flux’s Spectrum MPI plugin. From the CLI, this looks like:

flux run -o mpi=spectrum my_mpi_binary

From the Python API, this looks like:

#!/usr/bin/env python3

import os
import flux
from flux import job

fh = flux.Flux()

jobspec = job.JobspecV1.from_command(['my_mpi_binary'])
jobspec.environment = dict(os.environ)
jobspec.setattr_shell_option('mpi', 'spectrum')

jobid = job.submit(fh, jobspec)
print(jobid)

Scheduling GPUs

On all systems, Flux relies on hwloc to auto-detect the on-node resources available for scheduling. The hwloc that Flux is linked against must be configured with --enable-cuda for Flux to be able to detect Nvidia GPUs.

The LLNL and ORNL CORAL flux modules automatically loads an hwloc configured against a system-provided cuda.

For all systems, you can test to see if the hwloc that Flux is linked against is CUDA-enabled by running:

$ flux start flux resource list
      STATE NNODES   NCORES    NGPUS
       free      1       40        4
  allocated      0        0        0
       down      0        0        0

If the number of free GPUs is 0, then the hwloc that Flux is linked against is not CUDA-enabled.

In addition, please refer to flux-run(1) or flux-submit(1) to run or to submit an MPI job with a specific CPU/GPU set and affinity using its shell options. For example, to run a job at 4 MPI processes each binding to 10 CPU cores and 1 GPU on a compute node:

flux run -N 1 -n 4 -c 10 -g 1 -o mpi=spectrum -o cpu-affinity=per-task -o gpu-affinity=per-task my_mpi_binary

Launching Flux Interactively

CORAL systems use IBM’s Spectrum LSF scheduler which, unlike Slurm, does not support running jsrun commands in a pseudo- terminal. This limits users’ ability to run Flux sessions interactively.

A workaround for this is to submit a script to run Flux for some amount of time as a job, and then connect to that Flux session remotely by resolving the URI. Below is an example script to create a 2 node Flux session in the debug queue, which will run for two hours.

#!/bin/bash
#BSUB -q pdebug
#BSUB -W 120
#BSUB -nnodes 3
#BSUB -J fluxsesh

module use /usr/tce/modulefiles/Core
module use /usr/global/tools/flux/blueos_3_ppc64le_ib/modulefiles
module load pmi-shim

PMIX_MCA_gds="^ds12,ds21" jsrun -a 1 -c ALL_CPUS -g ALL_GPUS -n 3 --bind=none --smpiargs="-disable_gpu_hooks" flux start sleep inf

When this is submitted, the system will print out a job ID. You can check the status of the job with bjobs:

[elvis@lassen709:~]$ bsub < flux_session_submit.sh
Job <3750480> is submitted to queue <pdebug>.
...
[elvis@lassen709:~]$ bjobs
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
3750480    hobbs17 RUN   pdebug     lassen709   1*launch_ho fluxsesh   Jul 25 12:44
                                                80*debug_hosts

Once the job starts to run, you can resolve the URI, and connect to the session remotely.

[elvis@lassen709:~]$ flux uri --remote lsf:3750480
ssh://lassen32/var/tmp/flux-aXQh0w/local-0

[elvis@lassen709:~]$ flux proxy ssh://lassen32/var/tmp/flux-aXQh0w/local-0
[elvis@lassen709:~]$ flux resource list
     STATE NNODES   NCORES    NGPUS NODELIST
      free      2       80        0 lassen[32,34]
 allocated      0        0        0
      down      0        0        0

Note

In order to connect to remote sessions via flux proxy SSH keys must be configured. To set up SSH keys on Livermore Computing resources, see this Confluence article (login required).


Last update: Apr 24, 2024