Slingshot Interconnect
CORAL-2 systems use the HPE Cray Slingshot Ethernet-compliant interconnect. Slingshot consists of Rosetta switches and Cassini NICs connected in a dragonfly topology.
The HPE-provided Slingshot software includes the Slingshot Fabric Manager and Slingshot Host Software products. The Slingshot Host Software is open source and includes:
kfabric and kfi_cxi provider |
The kernel API 1 for Slingshot. Example user: Lustre kfilnd 2. |
cxi_ss1 |
The kernel device driver 3 for the NIC. |
libcxi |
The user space API 4 for the NIC device driver. |
libfabric and fi_cxi provider |
The user space API 5 for Slingshot. Example user: Cray MPICH. |
Although Flux was made capable of launching Cray MPICH applications that use the interconnect early on without any specific Slingshot support, advanced capabilities such as VNI tagging, NIC resource management, and hardware collective offload require specialized Slingshot support from Flux, described here.
VNI Tagging
VNI tagging is a mechanism for isolating application RDMA traffic, enforced at the NIC level. A VNI tag is an integer value stamped on each RDMA message. A user may only send or receive messages with a given VNI tag if the local NIC driver has allocated a CXI service (a privileged operation) that grants that user access to that VNI. The user presents the CXI service id to the driver when requesting direct access to hardware resources for RDMA.
VNI tagging support in Flux is implemented as a two step process. First the system allocates a set of unique VNIs to the job, then a CXI service is created on each NIC that authorizes the job to send and receive messages using the reserved VNIs.
Note
When Flux VNI tagging is not enabled, applications fall back to a default CXI service that grants them access to two default VNIs (1 and 10). Use of these shared VNIs make applications vulnerable to message injection attacks from other users. Now that Flux supports VNI tagging, the default CXI service should be disabled as recommended by HPE 8.
VNI Reservation
VNI numbers range from 0 to 65535 with 1 and 10 reserved as defaults. Flux may be configured to use any range, typically 1024-65535.
Flux reserves unique VNIs (typically one) for each job at the Flux system instance level when it enters RUN state. Since only the Flux system instance can spawn jobs as other users, allocating each system instance job a unique VNI effectively isolates users from each other. Jobs launched in Flux sub-instances share the parent job's VNI and have no isolation from each other.
Note
In contrast, Slurm 7 allocates a block of VNIs to each job and isolates job steps from one another. This confers no security advantage but does allow NIC resources to be managed among steps. Implementing this in Flux is possible but not currently planned due to the extra challenges presented by the lack of coupling within the Flux instance hierarchy and lack of privileged capability in Flux sub-instances.
When a job enters CLEANUP state, all CXI service allocations for the job's reserved VNIs are destroyed and the job's VNI reservation is released and may be reused.
CXI Service Allocation
When a job in the Flux system instance enters RUN state, the VNI reservation is retrieved and CXI services are allocated on each node, one for each NIC. The CXI service authorizes only the job owner to use the job's reserved VNI. The job shell then passes the CXI service information to libfabric 6 via these environment variables:
- SLINGSHOT_VNIS
Comma-separated list of VNI numbers the job can use.
- SLINGSHOT_DEVICES
Comma separated list of local NICs the job can use. Flux always assigns all available NICs. Note that since nodes may have different numbers of operational NICs, this environment variable may have different values on different nodes of the job.
- SLINGSHOT_SVC_IDS
Comma-separated list of CXI service IDs the job can use, corresponding to the
SLINGSHOT_DEVICESlist. Note that since CXI services are allocated through the local NIC, this environment variable may have different values on different nodes of the job.
Example:
SLINGSHOT_VNIS=4034
SLINGSHOT_DEVICES=cxi0,cxi1,cxi2,cxi3
SLINGSHOT_SVC_IDS=11,11,12,11
When the job is a Flux instance, these environment variables are captured on each node so that the sub-instance can pass them through to its jobs, and so on if there are more Flux instance levels.
When the system instance job enters CLEANUP state, all CXI services that were created for the job are destroyed.
Exceptional Conditions
Rarely, CXI service destruction may need to be retried for up to several minutes while the NIC attempts to complete network operations on behalf of the CXI service user. Rather than delay the job from completing CLEANUP state and releasing its resources, Flux times out the initial destruction quickly and retries in housekeeping, after the job has entered INACTIVE state. The implementation must prevent these VNIs from being reused before destruction is successful.
Failures in VNI reservation causes a fatal job exception to be raised. For jobs that do not require Slingshot, VNI reservation can be disabled as a job submission option.
Failure to allocate a CXI service for a reservation causes a fatal job exception to be raised.
Failure to destroy a lingering CXI service in housekeeping drains the node.
Instance Restart
When a Flux system instance restarts, jobs may continue to use VNIs that were allocated before the restart. The pool allocation state is persisted in the KVS across Flux restarts.
Running under Slurm
Inherited VNI reservations and CXI services work the same in a Flux sub-instance, regardless of whether it was launched by Flux or Slurm. VNI tagging should thus work the same on El Capitan, which runs only Flux, and Summit, when Flux is used as a portable workflow layer under Slurm.
Traffic Classes
Slingshot users can request that messages use a quality of service profile
or traffic class. For example, Cray MPICH users can use
MPI_Info_set() on the traffic_class key to assign one to an MPI
communicator. The available Slingshot traffic classes are described
by Kandalla et al. 9 as follows:
- TC_BEST_EFFORT
The Best Effort traffic class is the default shared traffic class and provides each application a "fair share" of networking resources within the same class.
- TC_LOW_LATENCY
The Low Latency traffic class is best suited for applications that are vulnerable to the performance of small message collective operations. Such latency sensitive operations are given a higher priority in the network and this allows applications to benefit from lower latency and potentially lower jitter due to variability in network round trip times. However this traffic class is also associated with a specific bandwidth cap.
- TC_DEDICATED_ACCESS
The Dedicated Access traffic class allows network packets issued by the communications library to benefit from a guaranteed bandwidth allocation. This traffic class is ideally used for highly specialized users and very high priority jobs that run on production systems.
- TC_BULK_DATA
The Bulk Data traffic class allows for the system fabric to isolate I/O traffic from every other type of traffic in the fabric.
The list of traffic classes allowed by the CXI service determines
whether a user request would be honored by the Cassini device driver.
Currently, Flux allows TC_BEST_EFFORT and TC_LOW_LATENCY.
This is reflected in the job environment:
- SLINGSHOT_TCS
Bitmask of allowed traffic classes. The bit encoding is
DEDICATED ACCESS(1),LOW_LATENCY(2),BULK_DATA(4),BEST_EFFORT(8). This environment variable is interpreted by Cray MPICH.
Example:
SLINGSHOT_TCS=0x0a
NIC Resource Management
Some Cassini NIC resources can be managed using CXI services, so that each user sharing the NIC can be guaranteed a minimum quantity needed to make progress and is prevented from starving out other users.
Each resource can be assigned a reserved and a maximum quantity in the CXI service. A user of a CXI service is guaranteed to be able to obtain the reserved quantity of a resource, but cannot exceed the maximum quantity. HPE recommends the following values for each job, with the maximum quantity fixed and the reserved quantity scaled by the expected number of task ranks within the job that will share the CXI service on the node. Since the Flux system instance that creates the CXI service doesn't know how many task ranks will be launched on the node by Flux sub-instances, it uses ncores, the number of allocated cores, to calculate the reserved quantities instead.
Resource |
Description |
Reserved |
Maximum |
|---|---|---|---|
TXQs |
Transmit command queues |
2*ncores |
2048 |
TGQs |
Target command queues |
1*ncores |
1024 |
EQs |
Event queue |
2*ncores |
2047 |
CTs |
Counters |
1*ncores |
2047 |
TLEs |
Trigger list entries |
1*ncores |
1*ncores (special case) |
PTEs |
Portal table entries |
6*ncores |
2048 |
LEs |
List entries |
16*ncores |
16384 |
ACs |
Addressing contexts |
2*ncores |
1022 |
When Flux creates the CXI service, if insufficient NIC resources are available to fulfill the above quantities, the request is scaled back to fit what is available and a warning message is printed.
As noted above, a potential issue arises from Flux not subdividing CXI services for jobs run in Flux sub-instances, such as batch jobs. Although the batch job is constrained to its NIC resource allocation, jobs within it competing for local NIC resources have no protection from each other.
Hardware Collective Offload
Warning
As of Jan 2026, the Hardware Collective offload design discussed below is not yet implemented.
Slingshot implements hardware collective offload for barrier, broadcast (small payload), reduce, and allreduce MPI operations that may benefit large applications. Enabling them requires Flux and the user's application to interact with the Slingshot fabric manager to reserve multicast addresses and instantiate multicast trees to fit each job.
Multicast address reservations for eligible jobs are allocated and released (only) by the Flux system instance through fabric manager requests.
Multicast trees are instantiated and destroyed within a reservation by libfabric-enabled applications, using environment variables set by Flux, communicating directly with the fabric manager. These applications may run at any Flux instance level.
Multicast Address Reservation
The Flux system instance leader broker logs in to the fabric manager at
startup using credentials that are only available to the flux system user.
It then makes multicast address reservations for each eligible job that
enters the RUN state according to the system instance configuration.
Configurable parameters include
The number of multicast addresses to reserve for each job (it is not dependent on job size)
The minimum job size required for automatic reservation. If not set, users must explicitly request to enable hardware collectives for their job.
Note
HPE recommends 10 that the number of multicast addresses per job be calculated as follows. If \(M\) is the total available addresses for hardware collectives, \(S\) is the system size, \(s\) is the minimum job size, and \(j\) is the number of jobs expected to be sharing nodes, then the number of addresses per job is \((M / (S / s)) / j\).
Using \(M = 4086\), \(j = 1\), and \(s = 64\), a system the size of El Capitan with \(S = 11136\) would reserve \((4086 / (11136 / 64)) / 1 = 23\) multicast addresses per job.
The fabric manager returns a job sessionToken for each reservation
that allows the bearer to connect to the fabric manager and create or destroy
multicast trees within the job's reservation. The job sessionToken
becomes part of the address reservation and is set in the job's environment
for use by libfabric and Flux sub-instances. Note that specific multicast
addresses are not part of the reservation.
When the job enters the CLEANUP state, the Flux system instance requests that the fabric manager destroy any remaining multicast trees and release the address reservation.
Multicast Tree Instantiation
Multicast trees are instantiated by libfabric using the following information set in the environment:
- FI_CXI_HWCOLL_MIN_NODES
The configured minimum job size.
- FI_CXI_HWCOLL_ADDRS_PER_JOB
The configured number of multicast addresses allocated to each job.
- FI_CXI_COLL_JOB_ID
The
jobIDstring associated with the multicast address reservation. The reservation is inherited from the enclosing Flux instance and may not refer to the current job.
- FI_CXI_COLL_MCAST_TOKEN
The
sessionTokenstring associated with the multicast address reservation.
- FI_CXI_COLL_FABRIC_MGR_URL
The fully qualified URL of the fabric manager.
- FI_CXI_COLL_JOB_STEP_ID
A string identifier associated with the current job, that is unique within the multicast address reservation. For example, the job id path 11 of the current job.
When the job is a Flux instance, all environment variables but the last are captured so they can be passed through to its jobs, and so on if there are more Flux levels.
Multicast Tree Cleanup
Although the libfabric-enabled application instantiates multicast trees and destroys them on exit, cleanup can be missed if the application aborts. Multicast trees that are left behind will be cleaned up by the Flux system instance when the reservation is released, but until then, other sub-instance jobs may be unable to instantiate multicast trees if the reservation is used up by aborted jobs.
To resolve this, when a job enters CLEANUP state at any Flux instance level,
Flux connects to the fabric manager using the sessionToken and
deletes all multicast addresses within the reservation that are associated
with the job identifier that was used for FI_CXI_COLL_JOB_STEP_ID.
Exception Handling
If the system instance leader broker's connection to the fabric manager is interrupted, fabric manager operations are paused while the system instance reconnects.
Since jobs can trivially fall back to the unassisted collectives implementation, reservation requests to the fabric manager that take too long may be timed out quickly and treated as a non-fatal error by the job.
Requests by the system instance leader broker to the fabric manager to release reservations for jobs in CLEANUP state execute asynchronously so the job's transition to INACTIVE is not delayed by a slow fabric manager. If the fabric manager connection is lost, on reconnect, any reservations for INACTIVE jobs are discovered and released.
Requests by Flux sub-instances to the fabric manager to release reservations
using the sessionToken are also asynchronous, under timeout, and
treated as non-fatal to the job.
Flux Instance Restart
Upon restart, the Flux system instance reloads reservation state from the KVS that was saved at shutdown. It then re-connects to the fabric manager. If an active reservation has disappeared from the fabric manager, a fatal job exception is raised. Any reservations for INACTIVE jobs are discovered and released.
Running under Slurm
Inherited multicast address reservations and multicast tree cleanup
using the sessionToken work the same in a Flux sub-instance,
regardless of whether it was launched by Flux or Slurm.
Implementation
Phase I: Interfacing with the NIC
The first phase of implementation covers VNI tagging, traffic classes, and NIC resource management. Several Flux components work together to this phase:
- jobtap plugin
The cray-slingshot jobtap plugin is loaded only in the Flux system instance. It manages a configurable pool of VNI numbers and creates VNI reservations for jobs when they enter the RUN state. Reservations are posted to the job eventlog as a cray-slingshot event, e.g.
{ "timestamp": 1751386927.8443174, "name": "cray-slingshot", "context": { "reservation": { "vnis": [ 1024 ] } } }
Jobs that do not use slingshot can specify the
cray-slingshot=offshell option to suppress the reservation. Jobs that want more than one VNI may usecray-slingshot.vnicount=Nto request up to four, which is maximum that may be associated with one CXI service.The plugin releases reservations when the job enters CLEANUP state. To keep the initial implementation simple, yet make it unlikely that a VNI could be reused before CXI services are cleaned up, VNI numbers are allocated round-robin from the pool.
When the instance restarts, the jobtap plugin recovers the state of the VNI pool as the job eventlogs are replayed.
- prolog
CXI service allocation is a root-only operation. In the Flux system instance, the job prolog invokes
flux-slingshot prologon each allocated node, which retrieves the reservation from the job eventlog and allocates CXI services on each Slingshot NIC. The CXI services restrict access to the job owner, VNIs to the reserved VNI numbers, traffic classes toTC_BEST_EFFORTandTC_LOW_LATENCY, and resources to the HPE recommended values scaled by ncores.- shell plugin
The cray-slingshot shell plugin is responsible for setting up the environment to enable libfabric-enabled applications to use Slingshot. There are three modes:
In the Flux system instance, it fetches the reservation from the job eventlog, then finds matching CXI services on each Slingshot NIC.
In a Flux sub-instance, it asks the broker on the local node for the Slingshot environment variables to pass along to the job.
If there is no reservation and no inheritable environment, it clears the Slingshot environment so that libfabric-enabled applications will try to use the default CXI service.
- epilog
CXI service destruction is a root-only operation. In the Flux system instance, the job epilog invokes
flux-slingshot epilogon each allocated node, which retrieves the reservation from the job eventlog and destroys matching CXI services on each Slingshot NIC. Failure to remove a matching CXI service at this phase results in a warning but does not cause the job to fail.- housekeeping
Housekeeping again invokes
flux-slingshot epilog, but with an option to retry CXI service destruction for a longer period of time. If CXI services cannot be destroyed, the node is drained.In addition to cleaning up after the current job, housekeeping may invoke
flux-slingshot clean --allto remove all user CXI services from the node. This may be useful as an extra precaution on node-scheduled systems like El Capitan, in case CXI services were left on the NIC by another job that Flux was unable to clean up, for example if Flux was not running.
Example 1
An MPI program is run directly in the Flux system instance:
$ flux run -N2 ./mpi-hello
The jobtap plugin on rank 0 sees the transition of the job to RUN state and posts a VNI reservation to the job eventlog before the
startevent.The prolog script calls
flux-slingshot prolog, which fetches the reservation and creates matching CXI services on each Slingshot NIC.The job shell plugin fetches the reservation, then queries each Slingshot NIC for matching CXI services. The job tasks are launched with the appropriate Slingshot environment.
The job transitions to CLEANUP state.
The epilog script calls
flux-slingshot epilog, which fetches the reservation and destroys matching CXI services on each Slingshot NIC. Any retries are deferred to housekeeping so the job's resources can be returned to the scheduler.In parallel with the previous step, the jobtap plugin on rank 0 releases the reservation.
The housekeeping script calls
flux-slingshot epilog --timeout=5m, which fetches the reservation and destroys matching CXI services on each Slingshot NIC. Recalcitrant CXI services cause the node to be drained after the timeout expires.
Example 2
An MPI program is run in a batch sub-instance:
$ flux batch -N2 --wrap flux run -N2 ./mpi-hello
The sequence in Example 1 is followed, except the "job" is a Flux sub-instance. Within the sub-instance the following occurs:
At startup, the shell plugin on each rank reads the Slingshot environment variables from the local broker environment and creates an inherited reservation that will be re-used for all jobs.
That's it. The broker module acts independently on each rank using only the information that it received from the local environment. There is no prolog or housekeeping.
Phase II: Interfacing with the Fabric Manager
The second phase of implementation enables hardware collective offload, building on the infrastructure created in Phase II.
TODO
References
- 1
- 2
- 3
- 4
- 5
- 6
- 7
Slurm Slingshot Plugin Design, internal HPE document, received May 2025.
- 8
Section 6.3, HPE Slingshot Operations Guide 2.1.3 S-9000, Aug 2024
- 9
Section III.B, Designing the HPE Cray Message Passing Toolkit Software Stack for HPE Cray EX Supercomputers, Kandalla et al., CUG, 2023
- 10
Slurm Slingshot Collectives Design, internal HPE document, received May 2025.
- 11
flux-core github issue #6876: need unique identifier for jobs run at any level on a system