.. github display
   GitHub is NOT the preferred viewer for this file. Please visit
   https://flux-framework.rtfd.io/projects/flux-rfc/en/latest/spec_49.html

49/TreePool Resource Set Extension
###################################

This specification defines the format of the scheduling key used by the
TreePool scheduler to encode sub-node topology in RFC 20 *R* version 1
objects.

.. list-table::
  :widths: 25 75

  * - **Name**
    - github.com/flux-framework/rfc/spec_49.rst
  * - **Editor**
    - Jim Garlick <garlick@llnl.gov>
  * - **State**
    - raw

Language
********

.. include:: common/language.rst

Related Standards
*****************

- :doc:`spec_14`
- :doc:`spec_20`
- :doc:`spec_22`
- :doc:`spec_40`
- :doc:`spec_46`

Background
**********

RFC 20 defines a scheduling key in *R* version 1 for scheduler-specific
extensions.  RFC 40 uses this key for the Fluxion graph scheduler.  This RFC
defines a complementary format for the TreePool scheduler.

The base Rv1 format is intentionally flat: it lists nodes and their resource
counts but carries no topology.  Fluxion (RFC 40) represents topology as an
arbitrary graph, enabling rich scheduling at the cost of complexity and large
*R* objects.  This format occupies the middle ground: it encodes topology as
trees rather than arbitrary graphs, which is sufficient for simple locality
aware scheduling at the node level, while keeping *R* objects compact and
easy to generate.

This RFC extends the base Rv1 object with sub-node topology describing how
cores, GPUs, memory, and storage are grouped within a node (e.g. by NUMA
domain or socket), enabling affinity-aware allocation of co-located resources.
Super-node grouping (rack, chassis) is not covered by this specification.

Implementation
**************

.. describe:: scheduling

  (*dictionary*, OPTIONAL) The scheduling key as defined in RFC 20 SHALL
  contain the following keys when this specification is used:

  .. describe:: writer

    (*string*, REQUIRED) SHALL be set to "TreePool".  Schedulers use this
    value to select the TreePool pool class automatically.

  .. describe:: children

    (*array of object*, OPTIONAL) Sub-node topology.  Each object SHALL
    contain exactly two structural keys:

    .. describe:: ranks

      (*string*, REQUIRED) An RFC 22 idset string identifying the broker
      ranks that share the described node topology.

    .. describe:: topo

      (*object*, REQUIRED) The node topology object.  This is a recursive
      structure of locality domains.  Each locality domain is represented as
      a key whose value is an array of child domain objects.  Only levels
      meaningful to the topology need be included; levels that carry no
      useful grouping information MAY be omitted.

      Defined locality names within *topo*:

      .. describe:: socket

        A processor package.  On multi-socket systems this groups the cores
        and GPUs physically attached to one socket.

      .. describe:: numa

        A NUMA memory domain.

      Other locality names MAY be used for site-specific topologies.

      Locality names defined in *topo* MAY be used as resource vertex types
      in RFC 14 jobspecs.  When such a vertex is marked exclusive with no
      child resources, the TreePool scheduler SHALL allocate all resources
      within that topology domain; per-domain resource counts are determined
      from *topo* at scheduling time.

      The following resource keys MAY appear in a locality domain object or
      directly in *topo* (at node scope) to describe resources local to that
      domain:

      .. describe:: cores

        (*string*) An RFC 22 idset string of node-local core IDs.

      .. describe:: gpus

        (*string*) An RFC 22 idset string of node-local GPU IDs.

      .. describe:: memory

        (*integer*) GiB of memory local to this domain.

      .. describe:: storage

        (*array of object*) Storage local to this domain.  This key SHOULD
        appear only at node scope (directly in *topo*).  Each object SHALL
        contain:

        .. describe:: path

          (*string*, REQUIRED) Mount point or logical identifier for the
          storage.

        .. describe:: capacity

          (*number*, REQUIRED) Numeric capacity value.

        .. describe:: unit

          (*string*, OPTIONAL) Unit string as defined in RFC 14
          (e.g. "GiB", "TiB").  Default is "GiB".

    Ranks with identical node topology MAY share a single *children* entry.

Key Summary
===========

.. list-table::
  :widths: 15 20 25 40
  :header-rows: 1

  * - Key
    - Key Type
    - HWLOC Type
    - Description
  * - *topo*
    - structural key
    - HWLOC_OBJ_MACHINE
    - Node topology object; peer of *ranks* in each children entry
  * - *socket*
    - locality name
    - HWLOC_OBJ_PACKAGE
    - A processor package
  * - *numa*
    - locality name
    - HWLOC_OBJ_NUMANODE
    - A NUMA memory domain
  * - *memory*
    - resource key
    - HWLOC_OBJ_NUMANODE
    - GiB of memory local to this domain
  * - *cores*
    - resource key
    - HWLOC_OBJ_CORE
    - core IDs
  * - *gpus*
    - resource key
    - HWLOC_OBJ_OSDEV_GPU
    - GPU IDs
  * - *storage*
    - resource key
    -
    - Storage device or mount point (node scope only)

Other locality names MAY be used for site-specific topologies.

Allocated R
===========

When resources are allocated, the scheduling key of the allocated *R* SHALL
be updated as follows and stored in the KVS job schema with parent-relative
rank values:

- *children* entries: the *ranks* value is trimmed to the intersection with
  the allocated rank set.  Entries with no allocated ranks are removed.

The *topo* structure within each surviving *children* entry is carried through
unchanged; it describes the full physical topology of the node, not the
allocated subset.  The authoritative record of which specific cores and GPUs
are allocated is the Rv1 *R_lite* ``children`` idsets.  The *memory* and
*storage* keys are capacity hints that inform scheduler placement decisions
but are not tracked per-job in the Rv1 record.  The scheduling key serves as
a topology hint for the sub-instance scheduler, enabling affinity-aware
placement within the allocated node set.

Sub-instance Initialization
----------------------------

When allocated *R* is used to initialize a sub-instance scheduler, rank
values in *children* entries SHALL be normalized to zero-origin, mapping
each parent rank to its zero-based position within the allocated rank set
ordered numerically.  This normalization mirrors the transformation already
applied to the Rv1 *R_lite* ranks during sub-instance startup.

Test Vectors
************

The following examples define two representative cluster topologies and
illustrate how resource requests are fulfilled by the TreePool scheduler.

Test vector tables map RFC 46 job shapes to the expected Rv1 *R_lite*
allocation.  Allocations are cumulative: each row assumes all prior
rows in the table are already in place.  Best-fit node selection is assumed.

Cluster A: Xeon/Nvidia Cluster
==============================

A 16-node cluster modeled on a dual-socket Intel Xeon system (e.g. Dell
PowerEdge XE9680) with Sub-NUMA Clustering (SNC4) enabled.  Each node has
two sockets of four NUMA domains, 15 cores and one NVIDIA GPU per NUMA
domain (8 GPUs per node, 120 cores per node), and a 2 TiB node-local NVMe
device.

.. code-block:: json

   {
     "writer": "TreePool",
     "children": [
       {
         "ranks": "0-15",
         "topo": {
           "storage": [{"path": "/mnt/nvme", "capacity": 2, "unit": "TiB"}],
           "socket": [
             {"numa": [
               {"cores": "0-14",    "gpus": "0"},
               {"cores": "15-29",   "gpus": "1"},
               {"cores": "30-44",   "gpus": "2"},
               {"cores": "45-59",   "gpus": "3"}
             ]},
             {"numa": [
               {"cores": "60-74",   "gpus": "4"},
               {"cores": "75-89",   "gpus": "5"},
               {"cores": "90-104",  "gpus": "6"},
               {"cores": "105-119", "gpus": "7"}
             ]}
           ]
         }
       }
     ]
   }

.. list-table::
  :widths: 5 38 57
  :header-rows: 1

  * - ID
    - Job Shape
    - *R_lite*
  * - TVA1
    - slot=1/node=1/[core=8;gpu=1]
    - {"rank": "0", "children": {"core": "0-7", "gpu": "0"}}
  * - TVA2
    - slot=1/node=1/[core=8;gpu=1]
    - {"rank": "0", "children": {"core": "15-22", "gpu": "1"}}
  * - TVA3
    - node/slot=8/[core=15;gpu=1]
    - {"rank": "1", "children": {"core": "0-119", "gpu": "0-7"}}
  * - TVA4
    - slot=4/node/core=120
    - {"rank": "2-5", "children": {"core": "0-119"}}
  * - TVA5
    - slot=1/node=1/core=4
    - {"rank": "0", "children": {"core": "8-11"}}
  * - TVA6
    - slot=1/node=1/[core=15;gpu=1]
    - {"rank": "0", "children": {"core": "30-44", "gpu": "2"}}
  * - TVA7
    - node/slot=6/[core=15;gpu=1]
    - {"rank": "6", "children": {"core": "0-89", "gpu": "0-5"}}
  * - TVA8
    - slot=1/node=1/[core=60;gpu=4]
    - {"rank": "0", "children": {"core": "60-119", "gpu": "4-7"}}
  * - TVA9
    - slot=1/numa{x}
    - {"rank": "0", "children": {"core": "45-59", "gpu": "3"}}
  * - TVA10
    - slot=1/socket{x}
    - {"rank": "7", "children": {"core": "0-59", "gpu": "0-3"}}
  * - TVA11
    - slot=1/node{x}
    - {"rank": "8", "children": {"core": "0-119", "gpu": "0-7"}}

- **TVA2**: After TVA1, NUMA 0 has 7 free cores but GPU 0 is gone.  A slot
  requires cores and GPU from the same NUMA domain, so the allocator skips
  NUMA 0 and takes cores 15–22 and GPU 1 from NUMA 1 on the same node.
- **TVA3**: Rank 0 has only 6 full NUMA domains after TVA1–2.  The 8-slot job
  needs 8, so the allocator skips rank 0 and uses rank 1.
- **TVA4**: Rank 0 is fragmented; rank 1 was fully consumed by TVA3.
- **TVA5**: CPU-only best-fit selects rank 0 (104 free cores) over a fresh node
  (120 free cores); within rank 0 the first NUMA with ≥ 4 free cores is
  NUMA 0 (cores 8–14), yielding cores 8–11.
- **TVA6**: TVA5 reduced NUMA 0 to 3 free cores; NUMA 1 has no free GPU.  The
  first NUMA on rank 0 with both 15 free cores and a free GPU is NUMA 2
  (cores 30–44, GPU 2).
- **TVA7**: After TVA6, rank 0 has only 5 intact NUMAs (3–7); the 6-slot
  request needs 6, so it advances to rank 6.
- **TVA8**: A 60-core/4-GPU slot fits exactly one socket (4 NUMAs × 15 cores).
  Rank 0's socket 1 (NUMAs 4–7, cores 60–119, GPUs 4–7) is still intact and
  beats a fully-free node (85 vs. 120 free cores) under best-fit scoring.
- **TVA9**: After TVA8, rank 0 has 25 free cores; NUMA 3 (cores 45–59, GPU 3)
  is the only intact NUMA domain remaining on rank 0.  Best-fit selects
  rank 0 over rank 6 (30 free cores) and fresh nodes (120 free cores); the
  exclusive NUMA slot claims all resources in that domain.
- **TVA10**: After TVA9, rank 0 has 10 free cores but no intact NUMA or socket.
  Rank 6 has two intact NUMAs (6–7) but both of its sockets span used and
  free NUMAs, so neither socket is fully free.  Rank 7 is the first node
  with an intact socket; socket 0 (NUMAs 0–3, cores 0–59, GPUs 0–3) is
  selected.
- **TVA11**: After TVA10, rank 7 is partially used (socket 0 claimed); ranks
  0–7 all have some resources allocated so none qualifies as fully free.
  Rank 8 is the first fully-free node.  No core constraint is specified;
  node-exclusive allocation claims all 120 cores and 8 GPUs.

Cluster B: HPE Cray EX Cluster
==============================

A 1152-node HPE Cray EX system (ranks 0–1151) in which each node is equipped
with four AMD Instinct MI300A APUs.  Each MI300A presents to the OS as a CPU
package with one NUMA node of HBM, 24 cores, and one GPU; memory varies
slightly due to firmware reservation (package 0: 125 GiB, packages 1–3:
126 GiB each).  A 200 GiB per-node burst-buffer allocation is pre-mounted
at ``/l/ssd``.

.. code-block:: json

   {
     "writer": "TreePool",
     "children": [
       {
         "ranks": "0-1151",
         "topo": {
           "storage": [{"path": "/l/ssd", "capacity": 200, "unit": "GiB"}],
           "socket": [
             {"cores": "0-23",  "gpus": "0", "memory": 125},
             {"cores": "24-47", "gpus": "1", "memory": 126},
             {"cores": "48-71", "gpus": "2", "memory": 126},
             {"cores": "72-95", "gpus": "3", "memory": 126}
           ]
         }
       }
     ]
   }

.. list-table::
  :widths: 5 38 57
  :header-rows: 1

  * - ID
    - Job Shape
    - *R_lite* entry
  * - TVB1
    - slot=1/node=1/[core=24;gpu=1]
    - {"rank": "0", "children": {"core": "0-23", "gpu": "0"}}
  * - TVB2
    - slot=1/node=1/[core=24;gpu=1]
    - {"rank": "0", "children": {"core": "24-47", "gpu": "1"}}
  * - TVB3
    - node/slot=4/[core=24;gpu=1]
    - {"rank": "1", "children": {"core": "0-95", "gpu": "0-3"}}
  * - TVB4
    - slot=4/node/[core=96;gpu=4]
    - {"rank": "2-5", "children": {"core": "0-95", "gpu": "0-3"}}
  * - TVB5
    - slot=1/node=1/core=8
    - {"rank": "0", "children": {"core": "48-55"}}
  * - TVB6
    - slot=1/node=1/[core=24;gpu=1]
    - {"rank": "0", "children": {"core": "72-95", "gpu": "3"}}
  * - TVB7
    - node/slot=3/[core=24;gpu=1]
    - {"rank": "6", "children": {"core": "0-71", "gpu": "0-2"}}
  * - TVB8
    - slot=1/socket{x}
    - {"rank": "6", "children": {"core": "72-95", "gpu": "3"}}
  * - TVB9
    - slot=1/node{x}
    - {"rank": "7", "children": {"core": "0-95", "gpu": "0-3"}}

- **TVB2**: After TVB1, package 0 is fully consumed.  The second slot takes
  package 1 on the same node.
- **TVB3**: Rank 0 has only 2 free packages after TVB1–2.  The 4-slot job needs
  4, so the allocator skips rank 0 and uses rank 1.
- **TVB4**: Rank 0 is fragmented; rank 1 was fully consumed by TVB3.
- **TVB5**: CPU-only best-fit selects rank 0 (48 free cores) over a fresh node
  (96 free cores); the first free package is package 2 (cores 48–71),
  yielding cores 48–55.
- **TVB6**: TVB5 consumed 8 cores from package 2, leaving only 16 free there.
  Package 3 remains intact with 24 cores and GPU 3.
- **TVB7**: After TVB6, rank 0 has no intact packages; the 3-slot request
  advances to rank 6.
- **TVB8**: After TVB7, rank 6 has one intact socket remaining (socket 3,
  cores 72–95 and GPU 3).  Best-fit selects rank 6 (24 free cores) over a
  fully-free node (96 free cores); the exclusive socket slot claims all
  resources in that domain.
- **TVB9**: After TVB8, rank 6 is fully consumed (all four sockets taken by
  TVB7 and TVB8).  Rank 7 is the first fully-free node.  No core constraint
  is specified; node-exclusive allocation claims all 96 cores and 4 GPUs.