27/Flux Resource Allocation Protocol Version 1

This specification describes Version 1 of the Flux Resource Allocation Protocol implemented by the job manager and a compliant Flux scheduler.

Name	github.com/flux-framework/rfc/spec_27.rst
Editor	Jim Garlick <garlick@llnl.gov>
State	raw

Language

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Background

The Flux job manager’s role is managing the queue of pending job requests and transitioning jobs through the job states defined in RFC 21, actively initiating actions in each state. A Flux scheduler’s role is passively fulfilling resource allocation requests that the job manager makes on behalf of jobs.

A scheduler implementation registers the generic service name sched and provides several well known service methods. The job manager requests resources from the scheduler with a sched.alloc request when a job enters SCHED state. It releases resources with one or more sched.free requests after a job enters CLEANUP state.

The simplest imaginable scheduler satisfies sched.alloc requests in order until it is out of resources, then blocks until sched.free requests release enough resources to satisfy the next sched.alloc request. More complex schedulers consider multiple sched.alloc requests and satisfy them out of order to prioritize or balance measures of success such as resource utilization or fairness.

Abstract resource allocation requests are expressed as a jobspec object (RFC 14). Concrete resources assignments are expressed a an R object (RFC 20). These objects are stored in the KVS per the job schema (RFC 16).

This RFC describes the RPC messages outlined above. It also describes the initialization messages used to establish parameters for scheduler operation and identify resources that are already allocated at scheduler startup. It does not cover the mechanism by which a scheduler discovers the initial inventory of resources.

Design Criteria

Support multiple scheduler implementations, minimizing repeated code in schedulers.
Allow the maximum number of outstanding allocation requests sent by the job manager to be controlled by the scheduler.
Allow the scheduler module to be reloaded with recovery of resource allocations of running jobs.
Allow the scheduler module to abort (return from its module thread unexpectedly) without impacting running work.
Send allocation requests to scheduler in priority, submit time order.
Inform scheduler of job priority and submit time so it can reorder requests internally, combining these factors with others.
Support job cancellation in SCHED state.
Support job priority change in SCHED state.
The resource allocation protocol should not present obstacles to scaling to O(1M) jobs in SCHED state.
The protocol should not inhibit scaling job throughput to O(100) jobs per second.
Capture scheduler specific job annotations for display by the job listing tool (e.g. start time estimates).
Allow the expiration time of a resource allocation to be adjusted.
Detect unsatisfiable job requests at submission time.

Implementation

To escape scalability limitations of the Flux “tag pool”, sched.alloc and sched.free RPCs use the job ID to match requests and responses, and set the RFC 6 matchtag message field to zero. It follows that:

The job ID MUST appear in the sched.alloc and sched.free request and response message payloads.
There SHALL NOT be more than one sched.alloc or sched.free request in flight for each job, since otherwise a request could not be uniquely matched to a response using the job ID.
The errnum field in sched.alloc and sched.free response messages MUST be set to zero, even if the response indicates an error. Otherwise, the message payload could not include the job ID since RFC 6 defines the payload of an error response to be optional error text.
The job manager SHALL treat a conventional Flux error response to sched.alloc or sched.free with a nonzero errnum field as a scheduler fatality, and SHALL not send further requests to the scheduler until it receives a new job-manager.sched-ready request (see Finalization below).

The other RPCs behave conventionally.

A detailed description of these RPCs follows.

Hello

Before any other RPCs are sent to the job manager, the scheduler SHALL send a request to job-manager.sched-hello with the FLUX_MSGFLAG_STREAMING flag set. The request payload SHALL either be empty or consist of a JSON object with the following OPTIONAL keys:

partial-ok: (boolean) The scheduler SHALL set this flag to true if it can handle the free key in hello responses. That is, it can process jobs with partially released resource sets. If this key is missing it SHALL be interpreted as false.

The job manager SHALL send one response message for each job with allocated resources. Each response payload SHALL consist of a JSON object with the following REQUIRED keys:

id: (integer) job ID
priority: (integer) priority in the range of 0 through 4294967295
userid: (integer) job owner
t_submit: (double) job submission time

and the following OPTIONAL key:

free: (string) An RFC 22 idset representing the ranks (execution targets) of this job’s resource set that have already been freed. If this key is omitted, the scheduler SHALL assume the empty set.

Example:

{
  "id": 1552593348,
  "priority": 43444,
  "userid": 5588,
  "t_submit": 1552593348.073045,
  "free":"1-16,18",
}

For each job response, the scheduler SHALL mark its assigned resources allocated internally. It MAY look up R in the KVS by job ID according to the job schema (RFC 16).

The scheduler SHALL wait for an error response with ENODATA set, indicating the stream of responses has completed (RFC 6).

If an error response other than ENODATA is returned to the job-manager.sched-hello request, the scheduler SHALL log the error and exit its module thread.

Ready

Once the scheduler has processed the job-manager.sched-hello handshake, it SHALL notify the job manager that it is ready to accept allocation requests by sending a request to job-manager.sched-ready.

The request payload SHALL consist of a JSON object with the following REQUIRED key:

mode: (string) selected concurrency mode

The mode string SHALL be one of the following:

unlimited: The job manager SHALL send a sched.alloc request for all jobs in SCHED state, with no limit on concurrency.
limited: The job manager SHALL limit the number of concurrent sched.alloc requests to value specified by the limit key (described below).

The following key is REQUIRED for limited mode only:

limit: (integer) The number of concurrent sched.alloc requests that can be sent. limit can be in the range of 1 to 2147483647.

Example:

{"mode":"limited","limit":42}

The response payload is a JSON object with the following REQUIRED keys:

count: (integer) current queue depth

After responding to the job-manager.sched-ready request, the job manager MAY immediately begin sending sched.alloc and sched.free requests.

If an error response is returned to the job-manager.sched-ready request, the scheduler SHALL log the error and exit its module thread.

Alloc

The job manager SHALL send a sched.alloc request when a job enters SCHED state, and concurrency criteria established by the initialization handshake are met. The request payload consists of a JSON object with the following REQUIRED keys:

id: (integer) job ID
priority: (integer) priority in the range of 0 through 4294967295
userid: (integer) job owner
jobspec: (object) jobspec object (RFC 14)

Example:

{
  "id": 1552593348,
  "priority": 53444,
  "userid": 5588,
  "jobspec": {
    "resources": [
      {
        "type": "slot",
        "count": 1,
        "with": [{"type": "core", "count": 1}], "label": "task"
      }
    ],
    "tasks": [
      {
        "command": ["/bin//true"],
        "slot": "task",
        "count": {"per_slot": 1}
      }
    ],
    "attributes": {
      "system": {
        "duration": 0,
        "cwd": "/home/user/project",
      }
    },
    "version": 1
  }
}

The jobspec sent with sched.alloc MAY have its environment section redacted to reduce its size, since the environment is not needed by the scheduler. Should it be needed, the full jobspec SHALL be stable in the KVS per the job schema (RFC 16) when the sched.alloc request is received.

The response payload is a JSON object with the following REQUIRED keys:

id: (integer) job ID
type: (integer) response type in the range of 0 through 3

There are four response types:

SUCCESS (0): Resources have been allocated
ANNOTATE (1): The scheduler wishes to annotate the job (see below)
DENY (2): The job cannot be scheduled
CANCEL (3): The alloc request was canceled by a sched.cancel request (see below).

The alloc request MAY receive multiple responses.

Alloc Success

If resources can be allocated, the scheduler SHALL ensure that R has been successfully committed to the KVS per the job schema (RFC 16) before responding.

In addition to the above REQUIRED keys, the SUCCESS response includes the OPTIONAL key:

annotations: (object) key value pairs

Example:

{
  "id": 1552593348,
  "type": 0,
  "annotations": {
    "sched": {
      "resource_summary":"rank[0-1]/core0"
    }
  }
}

If present, the job manager SHALL update the job’s annotation dictionary as described in the next section. The scheduler MAY delete annotations such as sched.t_estimate that are not relevant now that the allocation request has been satisfied.

The job manager posts an alloc event in response to the successful allocation of resources. A snapshot of job’s annotation dictionary, after the above update, is included in the alloc event context per RFC 21, thus preserving it in job record when the allocation is successful.

After the SUCCESS response, the sched.alloc request is complete and may be retired by the job manager and scheduler.

Alloc Annotate

While a job is in SCHED state, the scheduler MAY send multiple ANNOTATE type responses to the sched.alloc request to update scheduler-defined information for display by the job listing tool.

In addition to the above REQUIRED keys, the ANNOTATE response includes the REQUIRED key:

annotations: (object) key value pairs

The job manager SHALL maintain a dictionary of annotations for each job.

Each ANNOTATE response and the SUCCESS response (if it contains annotations) SHALL update the dictionary according to the following rules:

If a key exists and is a dictionary, and the new value is a dictionary, the rules below SHALL be applied to the dictionary recursively.
If a key exists, its value SHALL be replaced with the new value.
If a key exists and the new value is JSON null, the key SHALL be removed.
If a key does not exist, the key SHALL be added with the new value.

The key MAY be one of the following:

sched: (dictionary) dictionary object containing scheduler specific annotations
sched.t_estimate: (double) estimated absolute start time in seconds since UNIX epoch
sched.reason_pending: (string) human readable reason job is pending
sched.resource_summary: (string) human readable overview of assigned resources
sched.queue: (string) human readable identification of job queue
user: (dictionary) dictionary object containing user specific annotations

A scheduler MAY define additional sched keys as needed.

A value MAY be any valid JSON value.

Example:

{
  "id": 1552593348,
  "type": 1,
  "annotations": {
    "sched": {
      "t_estimate": 593016000.0,
      "reason_pending": "requested GPUs are unavailable"
    }
  }
}

Annotations SHALL be considered volatile until a SUCCESS response is received to the sched.alloc request, as described in Alloc Success above. Annotations SHALL be discarded by the job manager if the allocation fails.

Alloc Deny

If the resource request can never be fulfilled, the scheduler SHALL respond to the sched.alloc with a DENY type response.

In addition to the above REQUIRED Keys, the DENY response includes the OPTIONAL key:

note: (string) the reason why the allocation cannot ever be granted

Example:

{
  "id": 1552593348,
  "type": 2,
  "note": "more nodes requested than configured"
}

If present, the note SHALL be added to the exception event context generated by the job manager when processing the allocation failure.

After the DENY response, the sched.alloc request is complete and may be retired by the job manager and scheduler.

Alloc Cancel

When the scheduler receives a sched.cancel request for a job (see below), it SHALL respond to the corresponding sched.alloc request with response type CANCEL. Only the REQUIRED keys above are allowed in a CANCEL response.

Example:

{
  "id": 1552593348,
  "type": 3
}

After the CANCEL response, the sched.alloc request is complete and may be retired by the job manager and scheduler.

Cancel

The job manager may cancel a pending sched.alloc request by sending a request to sched.cancel with payload consisting of a JSON object with the following REQUIRED key:

id: (integer) job ID

Example:

{
  "id": 1552593348
}

The scheduler SHALL NOT respond directly to the sched.cancel request. Instead, if a sched.alloc request is pending for the specified job, it SHALL respond to the sched.alloc request with a CANCEL response as described above. If the specified job does not have a pending sched.alloc request, the request SHALL be ignored by the scheduler.

Note that receipt of a sched.cancel does not necessarily indicate that the job is canceled. For example, the job manager may cancel all outstanding sched.alloc requests in response to the queue being administratively disabled, or to make room for higher priority jobs in single mode.

Prioritize

When jobs with outstanding sched.alloc requests are re-prioritized, the job manager notifies the scheduler by sending a sched.prioritize request. The request payload consists of a JSON object with the following REQUIRED key:

jobs: (array) list of [id, priority] tuples

Each tuple SHALL consist of a two element array, containing:

[0]: (integer) job ID
[1]: (integer) priority in the range of 0 through 4294967295

Example:

{
  "jobs":[
    [49056579584, 444],
    [57428410368, 298],
    [63988301824, 343205],
    [69675778048, 99]
  ]
}

Job IDs which cannot be correlated to a pending sched.alloc request may be safely ignored.

No response is sent to the sched.prioritize request.

Note

A job manager priority plugin MAY initiate a priority update of many jobs at once. The job manager captures these updates in a single sched.prioritize request.

Expiration

The job manager MAY request an adjustment to the expiration time of an existing allocation by sending a sched.expiration request. The request payload consists of a JSON object with the following REQUIRED keys:

id: (integer) job ID
expiration: (integer) the proposed new expiration time, in seconds since the Unix Epoch (1970-01-01 UTC). It MAY reduce or extend the current expiration time.

The response consists of an empty payload on success.

The request MAY fail, for example if:

The job ID is invalid or does not currently have an allocation.
The new expiration time would invalidate an advance reservation.
The new expiration is less than the instance lifetime.
The scheduler does not implement sched.expiration.

Note

The job-manager SHALL interpret an ENOSYS error response to sched.expiration as “not implemented” and process the new expiration time as though the request were successful. Schedulers that require accurate expiration times SHOULD implement this RPC to avoid making schedules that are based on outdated information. Note that enforcement of expiration times is the responsibility of the execution system, not the scheduler.

Free

The job manager SHALL send one or more sched.free requests to release allocated resources to the scheduler. The request payload consists of a JSON object with the following REQUIRED keys:

id: (integer) job ID
R: (object) RFC 20 resource set from which the scheduling key MAY be omitted. This resource set MAY be a subset of the original allocation.

and the following OPTIONAL key:

final: (boolean) If present and true, this free request is the final one for the job. If absent or false, more free requests are forthcoming for the job.

Example:

{
  "id": 1552593348,
  "R": {
    "version": 1,
    "execution": {
      "R_lite": [
        { "rank": "0", "children": { "core": "0-3" } }
      ],
      "nodelist": [ "test0" ],
      "starttime": 1710076092,
      "expiration": 1710076122
    }
  }
  "final":true
}

Upon receipt of a sched.free request, the scheduler SHOULD mark the designated resources as available for reuse.

When a job’s resources are released with multiple sched.free requests, the scheduler MAY assume that the resources associated with a given rank in the original allocation are never split over multiple free requests.

There is no response.

Finalization

If the job manager receives a conventional Flux error response to a sched.alloc or sched.free request, it SHALL log the error and suspend scheduling operations. This ensures that, if the scheduler is not loaded, and the broker responds with an ENOSYS error on its behalf, the job manager behaves appropriately.

Similarly, if the job manager receives a disconnect request from the scheduler, it SHALL suspend scheduler operations.

Operations MAY resume if the scheduler re-establishes itself with the job-manager.sched-hello and job-manager.sched-ready handshakes.

Exceptions

When a job encounters a fatal exception, the job manager transitions it to CLEANUP state.

Upon the job entering CLEANUP state, the job manager sends a sched.cancel request on its behalf if the job has an outstanding sched.alloc request. If the job is holding resources when it enters CLEANUP, the job manager sends a sched.free request.

If the scheduler is monitoring job exceptions, it SHOULD NOT react in ways that might conflict with the job manager’s actions.

Feasibility

A scheduler or other entity MAY register a generic feasibility service name through which unsatisfiable jobs may be detected at job submission.

The feasibility service MAY be registered on one or more nodes to distribute the load of feasibility checks. The job ingest validator (the main user of the feasibility service) runs on all ranks and issues the feasibility.check RPC with FLUX_NODEID_ANY. The request is routed to a local feasibility service if available, or an upstream one per RFC 3.

To determine the feasibility of a job request, a feasibility.check request is sent with the following REQUIRED key:

jobspec: (object) The jobspec for which feasibility should be checked.

If the included jobspec could not ever be satisfied, even if all resources were available and ready, then the feasibility.check service SHALL respond with a error response including a human readable error string.

The response SHALL consist of an empty payload on success.

27/Flux Resource Allocation Protocol Version 1

Language

Related Standards

Background

Design Criteria

Implementation

Hello

Ready

Alloc

Alloc Success

Alloc Annotate

Alloc Deny

Alloc Cancel

Cancel

Prioritize

Expiration

Free

Finalization

Exceptions

Feasibility