.. _slingshot_interconnect: ###################### Slingshot Interconnect ###################### :term:`CORAL-2` systems use the HPE Cray :term:`Slingshot` Ethernet-compliant interconnect. Slingshot consists of :term:`Rosetta` switches and :term:`Cassini` NICs connected in a :term:`dragonfly` topology. The HPE-provided Slingshot software includes the *Slingshot Fabric Manager* and *Slingshot Host Software* products. The *Slingshot Host Software* is open source and includes: .. list-table:: :header-rows: 0 * - kfabric and kfi_cxi provider - The kernel API [#kfabric]_ for Slingshot. Example user: Lustre kfilnd [#horn2023]_. * - cxi_ss1 - The kernel device driver [#cxi-driver]_ for the NIC. * - libcxi - The user space API [#libcxi]_ for the NIC device driver. * - libfabric and fi_cxi provider - The user space API [#libfabric]_ for Slingshot. Example user: Cray :term:`MPICH`. Although Flux was made capable of launching Cray MPICH applications that use the interconnect early on without any specific Slingshot support, advanced capabilities such as :term:`VNI` tagging, NIC resource management, and hardware collective offload require specialized Slingshot support from Flux, described here. *********** VNI Tagging *********** VNI tagging is a mechanism for isolating application RDMA traffic, enforced at the NIC level. A VNI tag is an integer value stamped on each RDMA message. A user may only send or receive messages with a given VNI tag if the local NIC driver has allocated a :term:`CXI service` (a privileged operation) that grants that user access to that VNI. The user presents the CXI service id to the driver when requesting direct access to hardware resources for RDMA. VNI tagging support in Flux is implemented as a two step process. First the system allocates a set of unique VNIs to the job, then a CXI service is created on each NIC that authorizes the job to send and receive messages using the reserved VNIs. .. note:: When Flux VNI tagging is not enabled, applications fall back to a default CXI service that grants them access to two default VNIs (1 and 10). Use of these shared VNIs make applications vulnerable to message injection attacks from other users. Now that Flux supports VNI tagging, the default CXI service should be disabled as recommended by HPE [#ssops2024]_. VNI Reservation =============== VNI numbers range from 0 to 65535 with 1 and 10 reserved as defaults. Flux may be configured to use any range, typically 1024-65535. Flux reserves unique VNIs (typically one) for each job at the Flux system instance level when it enters RUN state. Since only the Flux system instance can spawn jobs as other users, allocating each system instance job a unique VNI effectively isolates users from each other. Jobs launched in Flux sub-instances share the parent job's VNI and have no isolation from each other. .. note:: In contrast, Slurm [#slurmplug]_ allocates a block of VNIs to each job and isolates job steps from one another. This confers no security advantage but does allow NIC resources to be managed among steps. Implementing this in Flux is possible but not currently planned due to the extra challenges presented by the lack of coupling within the Flux instance hierarchy and lack of privileged capability in Flux sub-instances. When a job enters CLEANUP state, all CXI service allocations for the job's reserved VNIs are destroyed and the job's VNI reservation is released and may be reused. CXI Service Allocation ====================== When a job in the Flux system instance enters RUN state, the VNI reservation is retrieved and CXI services are allocated on each node, one for each NIC. The CXI service authorizes only the job owner to use the job's reserved VNI. The job shell then passes the CXI service information to libfabric [#fi_cxi]_ via these environment variables: .. envvar:: SLINGSHOT_VNIS Comma-separated list of VNI numbers the job can use. .. envvar:: SLINGSHOT_DEVICES Comma separated list of local NICs the job can use. Flux always assigns all available NICs. Note that since nodes may have different numbers of operational NICs, this environment variable may have different values on different nodes of the job. .. envvar:: SLINGSHOT_SVC_IDS Comma-separated list of CXI service IDs the job can use, corresponding to the :envvar:`SLINGSHOT_DEVICES` list. Note that since CXI services are allocated through the local NIC, this environment variable may have different values on different nodes of the job. Example:: SLINGSHOT_VNIS=4034 SLINGSHOT_DEVICES=cxi0,cxi1,cxi2,cxi3 SLINGSHOT_SVC_IDS=11,11,12,11 When the job is a Flux instance, these environment variables are captured on each node so that the sub-instance can pass them through to its jobs, and so on if there are more Flux instance levels. When the system instance job enters CLEANUP state, all CXI services that were created for the job are destroyed. Exceptional Conditions ====================== Rarely, CXI service destruction may need to be retried for up to several minutes while the NIC attempts to complete network operations on behalf of the CXI service user. Rather than delay the job from completing CLEANUP state and releasing its resources, Flux times out the initial destruction quickly and retries in housekeeping, after the job has entered INACTIVE state. The implementation must prevent these VNIs from being reused before destruction is successful. Failures in VNI reservation causes a fatal job exception to be raised. For jobs that do not require Slingshot, VNI reservation can be disabled as a job submission option. Failure to allocate a CXI service for a reservation causes a fatal job exception to be raised. Failure to destroy a lingering CXI service in housekeeping drains the node. Instance Restart ================ When a Flux system instance restarts, jobs may continue to use VNIs that were allocated before the restart. The pool allocation state is persisted in the KVS across Flux restarts. Running under Slurm =================== Inherited VNI reservations and CXI services work the same in a Flux sub-instance, regardless of whether it was launched by Flux or Slurm. VNI tagging should thus work the same on *El Capitan*, which runs only Flux, and *Summit*, when Flux is used as a portable workflow layer under Slurm. *************** Traffic Classes *************** Slingshot users can request that messages use a quality of service profile or :term:`traffic class`. For example, Cray MPICH users can use :func:`MPI_Info_set` on the ``traffic_class`` key to assign one to an MPI communicator. The available Slingshot traffic classes are described by Kandalla et al. [#kandalla2023]_ as follows: TC_BEST_EFFORT The Best Effort traffic class is the default shared traffic class and provides each application a "fair share" of networking resources within the same class. TC_LOW_LATENCY The Low Latency traffic class is best suited for applications that are vulnerable to the performance of small message collective operations. Such latency sensitive operations are given a higher priority in the network and this allows applications to benefit from lower latency and potentially lower jitter due to variability in network round trip times. However this traffic class is also associated with a specific bandwidth cap. TC_DEDICATED_ACCESS The Dedicated Access traffic class allows network packets issued by the communications library to benefit from a guaranteed bandwidth allocation. This traffic class is ideally used for highly specialized users and very high priority jobs that run on production systems. TC_BULK_DATA The Bulk Data traffic class allows for the system fabric to isolate I/O traffic from every other type of traffic in the fabric. The list of traffic classes allowed by the CXI service determines whether a user request would be honored by the Cassini device driver. Currently, Flux allows :const:`TC_BEST_EFFORT` and :const:`TC_LOW_LATENCY`. This is reflected in the job environment: .. envvar:: SLINGSHOT_TCS Bitmask of allowed traffic classes. The bit encoding is :const:`DEDICATED ACCESS` (1), :const:`LOW_LATENCY` (2), :const:`BULK_DATA` (4), :const:`BEST_EFFORT` (8). This environment variable is interpreted by Cray MPICH. Example:: SLINGSHOT_TCS=0x0a *********************** NIC Resource Management *********************** Some Cassini NIC resources can be managed using CXI services, so that each user sharing the NIC can be guaranteed a minimum quantity needed to make progress and is prevented from starving out other users. Each resource can be assigned a *reserved* and a *maximum* quantity in the CXI service. A user of a CXI service is guaranteed to be able to obtain the *reserved* quantity of a resource, but cannot exceed the *maximum* quantity. HPE recommends the following values for each job, with the maximum quantity fixed and the reserved quantity scaled by the expected number of task ranks within the job that will share the CXI service on the node. Since the Flux system instance that creates the CXI service doesn't know how many task ranks will be launched on the node by Flux sub-instances, it uses *ncores*, the number of allocated cores, to calculate the reserved quantities instead. .. list-table:: :header-rows: 1 * - Resource - Description - Reserved - Maximum * - TXQs - Transmit command queues - 2*ncores - 2048 * - TGQs - Target command queues - 1*ncores - 1024 * - EQs - Event queue - 2*ncores - 2047 * - CTs - Counters - 1*ncores - 2047 * - TLEs - Trigger list entries - 1*ncores - 1*ncores (special case) * - PTEs - Portal table entries - 6*ncores - 2048 * - LEs - List entries - 16*ncores - 16384 * - ACs - Addressing contexts - 2*ncores - 1022 When Flux creates the CXI service, if insufficient NIC resources are available to fulfill the above quantities, the request is scaled back to fit what is available and a warning message is printed. As noted above, a potential issue arises from Flux not subdividing CXI services for jobs run in Flux sub-instances, such as batch jobs. Although the batch job is constrained to its NIC resource allocation, jobs within it competing for local NIC resources have no protection from each other. *************************** Hardware Collective Offload *************************** .. warning:: As of Jan 2026, the Hardware Collective offload design discussed below is not yet implemented. Slingshot implements hardware collective offload for *barrier*, *broadcast* (small payload), *reduce*, and *allreduce* MPI operations that may benefit large applications. Enabling them requires Flux and the user's application to interact with the Slingshot :term:`fabric manager` to reserve multicast addresses and instantiate multicast trees to fit each job. Multicast address reservations for eligible jobs are allocated and released (only) by the Flux system instance through fabric manager requests. Multicast trees are instantiated and destroyed within a reservation by libfabric-enabled applications, using environment variables set by Flux, communicating directly with the fabric manager. These applications may run at any Flux instance level. Multicast Address Reservation ============================= The Flux system instance leader broker logs in to the fabric manager at startup using credentials that are only available to the ``flux`` system user. It then makes multicast address reservations for each eligible job that enters the RUN state according to the system instance configuration. Configurable parameters include - The number of multicast addresses to reserve for each job (it is not dependent on job size) - The minimum job size required for automatic reservation. If not set, users must explicitly request to enable hardware collectives for their job. .. note:: HPE recommends [#slurmcoll]_ that the number of multicast addresses per job be calculated as follows. If :math:`M` is the total available addresses for hardware collectives, :math:`S` is the system size, :math:`s` is the minimum job size, and :math:`j` is the number of jobs expected to be sharing nodes, then the number of addresses per job is :math:`(M / (S / s)) / j`. Using :math:`M = 4086`, :math:`j = 1`, and :math:`s = 64`, a system the size of *El Capitan* with :math:`S = 11136` would reserve :math:`(4086 / (11136 / 64)) / 1 = 23` multicast addresses per job. The fabric manager returns a job :class:`sessionToken` for each reservation that allows the bearer to connect to the fabric manager and create or destroy multicast trees within the job's reservation. The job :class:`sessionToken` becomes part of the address reservation and is set in the job's environment for use by libfabric and Flux sub-instances. Note that specific multicast addresses are not part of the reservation. When the job enters the CLEANUP state, the Flux system instance requests that the fabric manager destroy any remaining multicast trees and release the address reservation. Multicast Tree Instantiation ============================ Multicast trees are instantiated by libfabric using the following information set in the environment: .. envvar:: FI_CXI_HWCOLL_MIN_NODES The configured minimum job size. .. envvar:: FI_CXI_HWCOLL_ADDRS_PER_JOB The configured number of multicast addresses allocated to each job. .. envvar:: FI_CXI_COLL_JOB_ID The :class:`jobID` *string* associated with the multicast address reservation. The reservation is inherited from the enclosing Flux instance and may not refer to the current job. .. envvar:: FI_CXI_COLL_MCAST_TOKEN The :class:`sessionToken` *string* associated with the multicast address reservation. .. envvar:: FI_CXI_COLL_FABRIC_MGR_URL The fully qualified URL of the fabric manager. .. envvar:: FI_CXI_COLL_JOB_STEP_ID A *string* identifier associated with the *current* job, that is unique within the multicast address reservation. For example, the job id path [#jobidpath]_ of the current job. When the job is a Flux instance, all environment variables but the last are captured so they can be passed through to its jobs, and so on if there are more Flux levels. Multicast Tree Cleanup ====================== Although the libfabric-enabled application instantiates multicast trees and destroys them on exit, cleanup can be missed if the application aborts. Multicast trees that are left behind will be cleaned up by the Flux system instance when the reservation is released, but until then, other sub-instance jobs may be unable to instantiate multicast trees if the reservation is used up by aborted jobs. To resolve this, when a job enters CLEANUP state at *any* Flux instance level, Flux connects to the fabric manager using the :class:`sessionToken` and deletes all multicast addresses within the reservation that are associated with the job identifier that was used for :envvar:`FI_CXI_COLL_JOB_STEP_ID`. Exception Handling ================== If the system instance leader broker's connection to the fabric manager is interrupted, fabric manager operations are paused while the system instance reconnects. Since jobs can trivially fall back to the unassisted collectives implementation, reservation requests to the fabric manager that take too long may be timed out quickly and treated as a non-fatal error by the job. Requests by the system instance leader broker to the fabric manager to release reservations for jobs in CLEANUP state execute asynchronously so the job's transition to INACTIVE is not delayed by a slow fabric manager. If the fabric manager connection is lost, on reconnect, any reservations for INACTIVE jobs are discovered and released. Requests by Flux sub-instances to the fabric manager to release reservations using the :class:`sessionToken` are also asynchronous, under timeout, and treated as non-fatal to the job. Flux Instance Restart ===================== Upon restart, the Flux system instance reloads reservation state from the KVS that was saved at shutdown. It then re-connects to the fabric manager. If an active reservation has disappeared from the fabric manager, a fatal job exception is raised. Any reservations for INACTIVE jobs are discovered and released. Running under Slurm =================== Inherited multicast address reservations and multicast tree cleanup using the :class:`sessionToken` work the same in a Flux sub-instance, regardless of whether it was launched by Flux or Slurm. ************** Implementation ************** Phase I: Interfacing with the NIC ================================= The first phase of implementation covers VNI tagging, traffic classes, and NIC resource management. Several Flux components work together to this phase: jobtap plugin The cray-slingshot jobtap plugin is loaded only in the Flux system instance. It manages a configurable pool of VNI numbers and creates VNI reservations for jobs when they enter the RUN state. Reservations are posted to the job eventlog as a `cray-slingshot` event, e.g. .. code:: json { "timestamp": 1751386927.8443174, "name": "cray-slingshot", "context": { "reservation": { "vnis": [ 1024 ] } } } Jobs that do not use slingshot can specify the ``cray-slingshot=off`` shell option to suppress the reservation. Jobs that want more than one VNI may use ``cray-slingshot.vnicount=N`` to request up to four, which is maximum that may be associated with one CXI service. The plugin releases reservations when the job enters CLEANUP state. To keep the initial implementation simple, yet make it unlikely that a VNI could be reused before CXI services are cleaned up, VNI numbers are allocated round-robin from the pool. When the instance restarts, the jobtap plugin recovers the state of the VNI pool as the job eventlogs are replayed. prolog CXI service allocation is a root-only operation. In the Flux system instance, the job prolog invokes :option:`flux-slingshot prolog` on each allocated node, which retrieves the reservation from the job eventlog and allocates CXI services on each Slingshot NIC. The CXI services restrict access to the job owner, VNIs to the reserved VNI numbers, traffic classes to :const:`TC_BEST_EFFORT` and :const:`TC_LOW_LATENCY`, and resources to the HPE recommended values scaled by *ncores*. shell plugin The cray-slingshot shell plugin is responsible for setting up the environment to enable libfabric-enabled applications to use Slingshot. There are three modes: #. In the Flux system instance, it fetches the reservation from the job eventlog, then finds matching CXI services on each Slingshot NIC. #. In a Flux sub-instance, it asks the broker on the local node for the Slingshot environment variables to pass along to the job. #. If there is no reservation and no inheritable environment, it clears the Slingshot environment so that libfabric-enabled applications will try to use the default CXI service. epilog CXI service destruction is a root-only operation. In the Flux system instance, the job epilog invokes :option:`flux-slingshot epilog` on each allocated node, which retrieves the reservation from the job eventlog and destroys matching CXI services on each Slingshot NIC. Failure to remove a matching CXI service at this phase results in a warning but does not cause the job to fail. housekeeping Housekeeping again invokes :option:`flux-slingshot epilog`, but with an option to retry CXI service destruction for a longer period of time. If CXI services cannot be destroyed, the node is drained. In addition to cleaning up after the current job, housekeeping may invoke :option:`flux-slingshot clean --all` to remove all user CXI services from the node. This may be useful as an extra precaution on node-scheduled systems like *El Capitan*, in case CXI services were left on the NIC by another job that Flux was unable to clean up, for example if Flux was not running. Example 1 --------- An MPI program is run directly in the Flux system instance: .. code:: $ flux run -N2 ./mpi-hello #. The jobtap plugin on rank 0 sees the transition of the job to RUN state and posts a VNI reservation to the job eventlog before the ``start`` event. #. The prolog script calls :option:`flux-slingshot prolog`, which fetches the reservation and creates matching CXI services on each Slingshot NIC. #. The job shell plugin fetches the reservation, then queries each Slingshot NIC for matching CXI services. The job tasks are launched with the appropriate Slingshot environment. #. The job transitions to CLEANUP state. #. The epilog script calls :option:`flux-slingshot epilog`, which fetches the reservation and destroys matching CXI services on each Slingshot NIC. Any retries are deferred to housekeeping so the job's resources can be returned to the scheduler. #. In parallel with the previous step, the jobtap plugin on rank 0 releases the reservation. #. The housekeeping script calls :option:`flux-slingshot epilog --timeout=5m`, which fetches the reservation and destroys matching CXI services on each Slingshot NIC. Recalcitrant CXI services cause the node to be drained after the timeout expires. Example 2 --------- An MPI program is run in a batch sub-instance: .. code:: $ flux batch -N2 --wrap flux run -N2 ./mpi-hello The sequence in Example 1 is followed, except the "job" is a Flux sub-instance. Within the sub-instance the following occurs: #. At startup, the shell plugin on each rank reads the Slingshot environment variables from the local broker environment and creates an inherited reservation that will be re-used for all jobs. That's it. The broker module acts independently on each rank using only the information that it received from the local environment. There is no prolog or housekeeping. Phase II: Interfacing with the Fabric Manager ============================================= The second phase of implementation enables hardware collective offload, building on the infrastructure created in Phase II. TODO .. rubric:: References .. [#kfabric] https://github.com/HewlettPackard/shs-kfabric .. [#horn2023] `Kfabric Lustre Network Driver, Horn et al., CUG, May 2023 `_ .. [#cxi-driver] https://github.com/HewlettPackard/shs-cxi-driver .. [#libcxi] https://github.com/HewlettPackard/shs-libcxi .. [#libfabric] https://github.com/HewlettPackard/shs-libfabric .. [#fi_cxi] `fi_cxi(7) - The fi_cxi fabric provider fi_cxi `_ .. [#slurmplug] *Slurm Slingshot Plugin Design*, internal HPE document, received May 2025. .. [#ssops2024] Section 6.3, `HPE Slingshot Operations Guide 2.1.3 S-9000, Aug 2024 `_ .. [#kandalla2023] Section III.B, `Designing the HPE Cray Message Passing Toolkit Software Stack for HPE Cray EX Supercomputers, Kandalla et al., CUG, 2023 `_ .. [#slurmcoll] *Slurm Slingshot Collectives Design*, internal HPE document, received May 2025. .. [#jobidpath] `flux-core github issue #6876: need unique identifier for jobs run at any level on a system `_