25/Job Specification Version 1

A domain specific language based on YAML is defined to express the resource requirements and other attributes of one or more programs submitted to a Flux instance for execution. This RFC describes the version 1 of jobspec, which represents a request to run exactly one program. This version is a simplified version of the canonical jobspec format described in RFC 14.

Name

github.com/flux-framework/rfc/spec_25.rst

Editor

Stephen Herbein <herbein1@llnl.gov>

State

raw

Language

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.

Goals

  • Express the resource requirements of a program to the scheduler.

  • Allow resource requirements to be expressed simply in terms of Nodes, CPUs, and GPUs.

  • Express program attributes such as arguments, run time, and task layout, to be considered by the execution service

Overview

This RFC describes the version 1 form of “jobspec”, a domain specific language based on YAML [1]. The version 1 of jobspec SHALL consist of a single YAML document representing a reusable request to run exactly one program. Hereafter, “jobspec” refers to the version 1 form, and “non-canonical jobspec” refers to the non-canonical form.

Jobspec Language Definition

A jobspec V1 YAML document SHALL consist of a dictionary defining the resources, tasks and other attributes of a single program. The dictionary MUST contain the keys resources, tasks, attributes, and version.

Each of the listed jobspec keys SHALL meet the form and requirements listed in detail in the sections below. For reference, a ruleset for compliant jobspec V1 is provided in the Schema section below.

Resources

The value of the resources key SHALL be a strict list which MUST define either node or slot as the first and only resource. Each list element SHALL represent a resource vertex (described below).

A resource vertex SHALL contain only the following keys:

  • type

  • count

  • unit

  • with

  • label

a node type resource vertex MAY also contain the following optional keys:

  • exclusive

The definitions of unit, with, exclusive and label SHALL match those found in RFC14. The others are redefined and simplified to mean the following:

type

The type key for a resource SHALL indicate the type of resource to be matched. In V1, only four resource types are valid: [node, slot, core, and gpu]. slot types are described in the Reserved Resource Types section below.

count

The count key SHALL indicate the desired number of resources matching the current vertex. The count SHALL be a single integer value representing a fixed count

V1-Specific Resource Graph Restrictions

In V1, the resources list MUST contain exactly one element, which MUST be either node or slot. Additionally, the resource graph MUST contain the core type.

In V1, there are also restrictions on which resources can have out edges to other resources. Specifically, a node can have an out edge to a slot, and a slot can have an out edge to a core. If a slot has an out edge to a core, it can also, optionally, have an out edge to a gpu as well. Therefore, the complete enumeration of valid resource graphs in V1 is:

  • slot>core

  • node>slot>core

  • slot>(core,gpu)

  • node>slot>(core,gpu)

Tasks

The value of the tasks key SHALL be a strict list which MUST define exactly one task. The list element SHALL be a dictionary representing a task to run as part of the program. A task descriptor SHALL contain the following keys, whose definitions SHALL match those provided in RFC14:

  • command

  • slot

  • count

    • per_slot

    • total

Attributes

The attributes key SHALL be a dictionary of dictionaries. The attributes dictionary MUST contain system key and MAY contain the user key. Common system keys are listed below, and their definitions can be found in RFC14. Values MAY have any valid YAML type.

  • user

  • system

    • duration

    • environment

    • cwd

    • queue

    • dependencies

    • constraints

Most system attributes are optional, but the duration attribute is required in jobspec V1.

Example Jobspec

Under the description above, the following is an example of a fully compliant version 1 jobspec. The example below declares a request for 4 “nodes” each of which with 1 task slot consisting of 2 cores each, for a total of 4 task slots. A single copy of the command app will be run on each task slot for a total of 4 tasks.

version: 1
resources:
  - type: node
    count: 4
    with:
      - type: slot
        count: 1
        label: default
        with:
          - type: core
            count: 2
tasks:
  - command: [ "app" ]
    slot: default
    count:
      per_slot: 1
attributes:
  system:
    duration: 3600.
    cwd: "/home/flux"
    environment:
      HOME: "/home/flux"

Basic Use Cases

To implement basic resource manager functionality, the following use cases SHALL be supported by the jobspec:

Section 1: Node-level Requests

The following “node-level” requests are all requests to start an instance, i.e. run a single copy of flux start per allocated node. Many of these requests are similar to existing resource manager batch job submission or allocation requests, i.e. equivalent to oarsub, qsub, and salloc.

Use Case 1.1

Request nodes outside of a slot

Specific Example

Request 4 nodes, each with 1 slot

Existing Equivalents

Slurm

salloc -N4

PBS

qsub -l nodes=4

Jobspec YAML
version: 1
resources:
  - type: node
    count: 4
    with:
    - type: slot
      count: 1
      label: default
      with:
        - type: core
          count: 1
tasks:
  - command: [ "flux", "start" ]
    slot: default
    count:
      per_slot: 1
attributes:
  system:
    duration: 3600.
    cwd: "/home/flux"
    environment:
      HOME: "/home/flux"

Section 2: General Requests

The following use cases are more general and include more complex slot placement and task counts.

Use Case 2.1

Run N tasks across M nodes, unequal distribution

Specific Example

Run 5 copies of hostname across 4 nodes, default distribution

Existing Equivalents

Slurm

srun -n5 -N4 hostname

Jobspec YAML
version: 1
resources:
  - type: node
    count: 4
    with:
      - type: slot
        count: 1
        label: myslot
        with:
          - type: core
            count: 1
tasks:
  - command: [ "hostname" ]
    slot: myslot
    count:
      total: 5
attributes:
  system:
    duration: 3600.
    cwd: "/home/flux"
    environment:
      HOME: "/home/flux"
Use Case 2.2

Run N tasks, Require M cores per task

Specific Example

Run 10 copies of myapp, require 2 cores per copy, for a total of 20 cores

Existing Equivalents

Slurm

srun -n10 -c 2 myapp

Jobspec YAML
version: 1
resources:
  - type: slot
    label: default
    count: 10
    with:
      - type: core
        count: 2
tasks:
  - command: [ "myapp" ]
    slot: default
    count:
      per_slot: 1
attributes:
  system:
    duration: 3600.
    cwd: "/home/flux"
    environment:
      HOME: "/home/flux"
Use Case 2.3

Run N tasks, Require M cores and J gpus per task

Specific Example

Run 10 copies of myapp, require 2 cores and 1 gpu per copy, for a total of 20 cores and 10 gpus

Jobspec YAML
version: 1
resources:
  - type: slot
    count: 10
    label: default
    with:
      - type: core
        count: 2
      - type: gpu
        count: 1
tasks:
  - command: [ "myapp" ]
    slot: default
    count:
      per_slot: 1
attributes:
  system:
    duration: 3600.
    cwd: "/home/flux"
    environment:
      HOME: "/home/flux"
Use Case 2.4

Run N tasks across M nodes, each task with 1 core and 1 gpu

Specific Example

Run 16 copies of myapp across 4 nodes, each copy with 1 core and 1 gpu

Existing Equivalents

Slurm

srun -n16 -N4 --gpus-per-task=1 myapp

Jobspec YAML
version: 1
resources:
  - type: node
    count: 4
    with:
    - type: slot
      count: 4
      label: default
      with:
        - type: core
          count: 1
        - type: gpu
          count: 1
tasks:
  - command: [ "myapp" ]
    slot: default
    count:
      per_slot: 1
attributes:
  system:
    duration: 3600.
    cwd: "/home/flux"
    environment:
      HOME: "/home/flux"

Schema

A jobspec conforming to version 1 of the language definition SHALL adhere to the following ruleset, described using JSON Schema [2].

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "http://github.com/flux-framework/rfc/tree/master/data/spec_24/schema.json",
  "title": "jobspec-01",

  "description":         "Flux jobspec version 1",

  "definitions": {
    "intranode_resource_vertex": {
      "description": "schema for resource vertices within a node, cannot have child vertices",
      "type": "object",
      "required": ["type", "count"],
      "properties": {
        "type": { "enum": ["core", "gpu"]},
        "count": { "type": "integer", "minimum" : 1 },
        "unit": { "type": "string" }
      },
      "additionalProperties": false
    },
    "node_vertex": {
      "description": "schema for the node resource vertex",
      "type": "object",
      "required": ["type", "count", "with"],
      "properties": {
        "type": { "enum" : ["node"] },
        "count": { "type": "integer", "minimum" : 1 },
        "unit": { "type": "string" },
        "with": {
          "type": "array",
          "minItems": 1,
          "maxItems": 1,
          "items": {
            "oneOf": [
              {"$ref": "#/definitions/slot_vertex"}
            ]
          }
        }
      },
      "additionalProperties": false
    },
    "slot_vertex": {
      "description": "special slot resource type - label assigns to task slot",
      "type": "object",
      "required": ["type", "count", "with", "label"],
      "properties": {
        "type": { "enum" : ["slot"] },
        "count": { "type": "integer", "minimum" : 1 },
        "unit": { "type": "string" },
        "label": { "type": "string" },
        "exclusive": { "type": "boolean" },
        "with": {
          "type": "array",
          "minItems": 1,
          "maxItems": 2,
          "items": {
            "oneOf": [
              {"$ref": "#/definitions/intranode_resource_vertex"}
            ]
          }
        }
      },
      "additionalProperties": false
    }
  },
  "type": "object",
  "required": ["version", "resources", "attributes", "tasks"],
  "properties": {
    "version": {
      "description": "the jobspec version",
      "type": "integer",
      "enum": [1]
    },
    "resources": {
      "description": "requested resources",
      "type": "array",
      "minItems": 1,
      "maxItems": 1,
      "items": {
        "oneOf": [
          { "$ref": "#/definitions/node_vertex" },
          { "$ref": "#/definitions/slot_vertex" }
        ]
      }
    },
    "attributes": {
      "description": "system and user attributes",
      "type": ["object", "null"],
      "properties": {
        "system": {
          "type": "object",
          "properties": {
            "duration": { "type": "number", "minimum": 0 },
            "cwd": { "type": "string" },
            "environment": { "type": "object" }
          }
        },
        "user": {
          "type": "object"
        }
      },
      "additionalProperties": false
    },
    "tasks": {
      "description": "task configuration",
      "type": "array",
      "maxItems": 1,
      "items": {
        "type": "object",
        "required": ["command", "slot", "count" ],
        "properties": {
          "command": {
            "type": ["string", "array"],
            "minItems": 1,
            "items": { "type": "string" }
          },
          "slot": { "type": "string" },
          "count": {
            "type": "object",
            "additionalProperties": false,
            "properties": {
              "per_slot": { "type": "integer", "minimum" : 1 },
              "total": { "type": "integer", "minimum" : 1 }
            }
          }
        },
	    "additionalProperties": false
      }
    }
  }
}

References