20/Resource Set Specification Version 1

This specification defines the version 1 format of the resource-set representation or R in short.

Name	github.com/flux-framework/rfc/spec_20.rst
Editor	Jim Garlick <garlick@llnl.gov>
State	raw

Language

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Overview

This specification describes a JSON object R used to represents sets of specific resources. R is for representing concrete resources like “cores 2-5 of node 9”. It is distinct from the jobspec resources section (RFC 14) which is for abstract resource requirements like “one node with four cores”.

The following Flux subsystems must handle R:

resource module: A Flux instance has a resource inventory which it obtains from configuration, through allocation from the enclosing instance, or via dynamic probing. The end result is expressed as R. The resources in R are also monitored at the rank level for availability.
scheduler: The scheduler obtains the resource inventory at initialization and fulfills job requests by allocating R subsets to them. When jobs terminate their R subsets are returned to the scheduler and become available for fulfilling other job requests.
job manager: The job manager tracks R so it can pass it to the scheduler and execution subsystems as the job transitions through job states. In addition, R is made available to jobtap plugins which extend the job manager’s function. Finally, when R is updated, for example to extend the duration of a running job, the job manager coordinates R updates across subsystems.
execution: The job execution system uses R to determine where to launch Flux shells.
shell: The job shell uses R to determine where to launch tasks. Shell plugins may use R for various purposes such setting core and GPU affinity.

Design Goals

R is designed with the following goals:

Identify the specific resources where a job’s tasks are to be launched.
Identify the specific resources managed by a Flux instance.
Be suitable for inclusion in a job’s post-mortem record, and useful for answering forensic questions like “did my job run on the node that failed?”.
Handle nodes, cores, and GPUs simply to ease the initial Flux implementation.
Allow schedulers to add proprietary enhancements that are ignored by the rest of Flux.
Don’t explode in size on large clusters/jobs.
Handle resource properties as described in RFC 31
Build towards the general resource model of RFC 4.

Implementation

Note

In Flux documentation, the terms “node rank” and “execution target” are sometimes used interchangeably to refer to the Flux broker rank used to launch tasks on a given resource set. When Flux launches a new Flux instance on a subset of resources, the new broker ranks do not match those of the the enclosing instance and the inherited R must be re-ranked before it can be brought into inventory. Therefore, to avoid implying that a node rank uniquely identifies a physical node across instances, execution target is preferred in this document.

R Format

R SHALL be defined as a JSON dictionary with the following keys:

version

(integer, REQUIRED) The R specification version.

For this specification the value is always 1.

execution

(dictionary, REQUIRED) The resource set. It SHALL have following keys:

R_lite

(array of dictionary, REQUIRED) A list that identifies one or more execution targets and the specific cores and GPUs they control. The list entries need not appear in any particular order. Each entry SHALL have the following keys:

rank: (string, REQUIRED) An RFC 22 idset representing one or more execution targets.

children

(dictionary, REQUIRED) The specific resources controlled by the execution targets in rank. It SHALL have the following keys:

core: (string, REQUIRED) An RFC 22 idset representing one or more logical CPU cores IDs.

gpu: (string, OPTIONAL) An RFC 22 idset representing one or more logical GPU IDs.

nodelist

(array of string, REQUIRED) A list of hostnames corresponding to the execution targets in R_lite.

Each entry SHALL be either a single hostname or an RFC 29 hostlist.

The order of hostnames MUST correspond to the sorted list of execution targets ranks in R_lite so that they can be mapped one to one. However, the number of entries in each array need not be the same. For example, nodelist MAY contain one hostlist entry for all the execution targets spread over multiple R_lite entries.

nslots

(integer, OPTIONAL) The total number of slots in an allocation from the scheduler.

This will not be present if R is not such an allocation, e.g. for an instance resource inventory. A conforming scheduler SHALL include this in its allocations and if present it MUST be greater than 0.

properties

(dictionary of string, OPTIONAL) Each key maps a single property name to a RFC 22 idset string. The idset string SHALL represent a set of execution targets. A given target MAY appear in multiple property mappings. Property names SHALL be valid UTF-8, and MUST NOT contain the following illegal characters:

! & ' " ^ ` | ( )

Additionally, the @ character is reserved for scheduler specific property use. In this case, the literal property SHALL still apply to the defined execution target ranks, but the scheduler MAY use the suffix after @ to apply the property to children resources of the execution target or for another scheduler specific purpose. For example, the property amd-mi50@gpu SHALL apply to the defined execution target ranks, but a scheduler MAY use the gpu suffix to perform scheduling optimization for gpus of the corresponding ranks. This MAY result in both amd-mi50@gpu and amd-mi50 being valid properties for resources in the instance.

starttime

(number, OPTIONAL) The start time at which the resource set is valid.

A value of 0. SHALL be interpreted as “unset”.

The value SHALL be expressed as the number of seconds elapsed since the Unix Epoch (1970-01-01 00:00:00 UTC) with optional microsecond precision.

If starttime is unset, then the resource set has no specified start time and is valid beginning at any time up to expiration.

expiration

(number, OPTIONAL) The end or expiration time of the resource set, after which it becomes invalid.

A value of 0. SHALL be interpreted as “unset”.

The value SHALL be expressed as the number of seconds elapsed since the Unix Epoch (1970-01-01 00:00:00 UTC) with optional microsecond precision.

If starttime is also set, expiration MUST be greater than starttime.

If expiration is unset, the resource set has no specified end time.

scheduling

(dictionary, OPTIONAL) Scheduler-specific resource data with the following keys:

writer: (string, OPTIONAL) If provided, a URI whose scheme identifies the scheduler that produced the scheduling key. Remaining URI components SHALL be interpreted by the named scheduler. If not provided, a value of fluxion SHALL be assumed.

Remaining scheduler-specific keys SHOULD be ignored by other Flux components.

When used, scheduling SHALL ride along on the resource acquisition protocol (RFC 28) and resource allocation protocol (RFC 27) so that it may be included in static configuration, allocated to jobs, and passed down a Flux instance hierarchy.

Linkage to specific resources in R_lite SHOULD use hostnames rather than execution targets since the scheduler-agnostic re-ranking of R that occurs when a new Flux instance is started cannot do the same for the opaque scheduling key.

Example R

The following is an example of a version 1 resource specification. The example below indicates a resource set with the ranks 19 through 22. These ranks correspond to the nodes node186 through node189. Each of the nodes contains 48 cores (0-47) and 8 gpus (0-7). There are 32 slots in total, giving 8 slots per node, which results in 1 gpu and 6 cores per slot. The starttime and expiration indicate the resources were valid for about 30 minutes on February 16, 2023.

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "19-22",
        "children": {
          "core": "0-47",
          "gpu": "0-7"
        }
      }
    ],
    "nodelist": [
      "node[186-189]"
    ],
    "nslots": 32,
    "starttime": 1676560542,
    "expiration": 1676562342
  }
}