25/Job Specification Version 1
A domain specific language based on YAML is defined to express the resource requirements and other attributes of one or more programs submitted to a Flux instance for execution. This RFC describes the version 1 of jobspec, which represents a request to run exactly one program. This version is a simplified version of the canonical jobspec format described in RFC 14.
Name |
github.com/flux-framework/rfc/spec_25.rst |
Editor |
Stephen Herbein <herbein1@llnl.gov> |
State |
raw |
Language
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.
Goals
Express the resource requirements of a program to the scheduler.
Allow resource requirements to be expressed simply in terms of Nodes, CPUs, and GPUs.
Express program attributes such as arguments, run time, and task layout, to be considered by the execution service
Overview
This RFC describes the version 1 form of “jobspec”, a domain specific language based on YAML [1]. The version 1 of jobspec SHALL consist of a single YAML document representing a reusable request to run exactly one program. Hereafter, “jobspec” refers to the version 1 form, and “non-canonical jobspec” refers to the non-canonical form.
Jobspec Language Definition
A jobspec V1 YAML document SHALL consist of a dictionary
defining the resources, tasks and other attributes of a single
program. The dictionary MUST contain the keys resources
, tasks
,
attributes
, and version
.
Each of the listed jobspec keys SHALL meet the form and requirements listed in detail in the sections below. For reference, a ruleset for compliant jobspec V1 is provided in the Schema section below.
Resources
The value of the resources
key SHALL be a strict list which MUST define either
node
or slot
as the first and only resource. Each list element SHALL represent a
resource vertex (described below).
A resource vertex SHALL contain only the following keys:
type
count
unit
with
label
a node
type resource vertex MAY also contain the following optional keys:
exclusive
The definitions of unit
, with
, exclusive
and label
SHALL match
those found in RFC14. The others are redefined and simplified to mean the
following:
- type
The
type
key for a resource SHALL indicate the type of resource to be matched. In V1, only four resource types are valid: [node
,slot
,core
, andgpu
].slot
types are described in the Reserved Resource Types section below.- count
The
count
key SHALL indicate the desired number of resources matching the current vertex. Thecount
SHALL be a single integer value representing a fixed count
V1-Specific Resource Graph Restrictions
In V1, the resources
list MUST contain exactly one element, which MUST be
either node
or slot
. Additionally, the resource graph MUST contain the
core
type.
In V1, there are also restrictions on which resources can have out
edges to
other resources. Specifically, a node
can have an out edge to a slot
, and a
slot
can have an out
edge to a core
. If a slot
has an out
edge to a
core
, it can also, optionally, have an out
edge to a gpu
as
well. Therefore, the complete enumeration of valid resource graphs in V1 is:
slot>core
node>slot>core
slot>(core,gpu)
node>slot>(core,gpu)
Tasks
The value of the tasks
key SHALL be a strict list which MUST define exactly
one task. The list element SHALL be a dictionary representing a task to run as
part of the program. A task descriptor SHALL contain the following keys, whose
definitions SHALL match those provided in RFC14:
command
slot
count
per_slot
total
Attributes
The attributes
key SHALL be a dictionary of
dictionaries. The attributes
dictionary MUST contain system
key and MAY
contain the user
key. Common system
keys are listed below, and their
definitions can be found in RFC14. Values MAY have any valid YAML type.
user
system
duration
preemptible-after
environment
cwd
queue
dependencies
constraints
Most system attributes are optional, but the duration
attribute is required in
jobspec V1.
Example Jobspec
Under the description above, the following is an example of a fully compliant
version 1 jobspec. The example below declares a request for 4 “nodes”
each of which with 1 task slot consisting of 2 cores each, for a total
of 4 task slots. A single copy of the command app
will be run on each
task slot for a total of 4 tasks.
version: 1
resources:
- type: node
count: 4
with:
- type: slot
count: 1
label: default
with:
- type: core
count: 2
tasks:
- command: [ "app" ]
slot: default
count:
per_slot: 1
attributes:
system:
duration: 3600.
cwd: "/home/flux"
environment:
HOME: "/home/flux"
Basic Use Cases
To implement basic resource manager functionality, the following use cases SHALL be supported by the jobspec:
Section 1: Node-level Requests
The following “node-level” requests are all requests to start an instance,
i.e. run a single copy of flux start
per allocated node. Many of these
requests are similar to existing resource manager batch job submission or
allocation requests, i.e. equivalent to oarsub
, qsub
, and salloc
.
- Use Case 1.1
Request nodes outside of a slot
- Specific Example
Request 4 nodes, each with 1 slot
- Existing Equivalents
Slurm
salloc -N4
PBS
qsub -l nodes=4
- Jobspec YAML
version: 1 resources: - type: node count: 4 with: - type: slot count: 1 label: default with: - type: core count: 1 tasks: - command: [ "flux", "start" ] slot: default count: per_slot: 1 attributes: system: duration: 3600. cwd: "/home/flux" environment: HOME: "/home/flux"
Section 2: General Requests
The following use cases are more general and include more complex slot placement and task counts.
- Use Case 2.1
Run N tasks across M nodes, unequal distribution
- Specific Example
Run 5 copies of
hostname
across 4 nodes, default distribution- Existing Equivalents
Slurm
srun -n5 -N4 hostname
- Jobspec YAML
version: 1 resources: - type: node count: 4 with: - type: slot count: 1 label: myslot with: - type: core count: 1 tasks: - command: [ "hostname" ] slot: myslot count: total: 5 attributes: system: duration: 3600. cwd: "/home/flux" environment: HOME: "/home/flux"
- Use Case 2.2
Run N tasks, Require M cores per task
- Specific Example
Run 10 copies of
myapp
, require 2 cores per copy, for a total of 20 cores- Existing Equivalents
Slurm
srun -n10 -c 2 myapp
- Jobspec YAML
version: 1 resources: - type: slot label: default count: 10 with: - type: core count: 2 tasks: - command: [ "myapp" ] slot: default count: per_slot: 1 attributes: system: duration: 3600. cwd: "/home/flux" environment: HOME: "/home/flux"
- Use Case 2.3
Run N tasks, Require M cores and J gpus per task
- Specific Example
Run 10 copies of
myapp
, require 2 cores and 1 gpu per copy, for a total of 20 cores and 10 gpus- Jobspec YAML
version: 1 resources: - type: slot count: 10 label: default with: - type: core count: 2 - type: gpu count: 1 tasks: - command: [ "myapp" ] slot: default count: per_slot: 1 attributes: system: duration: 3600. cwd: "/home/flux" environment: HOME: "/home/flux"
- Use Case 2.4
Run N tasks across M nodes, each task with 1 core and 1 gpu
- Specific Example
Run 16 copies of
myapp
across 4 nodes, each copy with 1 core and 1 gpu- Existing Equivalents
Slurm
srun -n16 -N4 --gpus-per-task=1 myapp
- Jobspec YAML
version: 1 resources: - type: node count: 4 with: - type: slot count: 4 label: default with: - type: core count: 1 - type: gpu count: 1 tasks: - command: [ "myapp" ] slot: default count: per_slot: 1 attributes: system: duration: 3600. cwd: "/home/flux" environment: HOME: "/home/flux"
Schema
A jobspec conforming to version 1 of the language definition SHALL adhere to the following ruleset, described using JSON Schema [2].
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://github.com/flux-framework/rfc/tree/master/data/spec_24/schema.json",
"title": "jobspec-01",
"description": "Flux jobspec version 1",
"definitions": {
"intranode_resource_vertex": {
"description": "schema for resource vertices within a node, cannot have child vertices",
"type": "object",
"required": ["type", "count"],
"properties": {
"type": { "enum": ["core", "gpu"]},
"count": { "type": "integer", "minimum" : 1 },
"unit": { "type": "string" }
},
"additionalProperties": false
},
"node_vertex": {
"description": "schema for the node resource vertex",
"type": "object",
"required": ["type", "count", "with"],
"properties": {
"type": { "enum" : ["node"] },
"count": { "type": "integer", "minimum" : 1 },
"unit": { "type": "string" },
"with": {
"type": "array",
"minItems": 1,
"maxItems": 1,
"items": {
"oneOf": [
{"$ref": "#/definitions/slot_vertex"}
]
}
}
},
"additionalProperties": false
},
"slot_vertex": {
"description": "special slot resource type - label assigns to task slot",
"type": "object",
"required": ["type", "count", "with", "label"],
"properties": {
"type": { "enum" : ["slot"] },
"count": { "type": "integer", "minimum" : 1 },
"unit": { "type": "string" },
"label": { "type": "string" },
"exclusive": { "type": "boolean" },
"with": {
"type": "array",
"minItems": 1,
"maxItems": 2,
"items": {
"oneOf": [
{"$ref": "#/definitions/intranode_resource_vertex"}
]
}
}
},
"additionalProperties": false
}
},
"type": "object",
"required": ["version", "resources", "attributes", "tasks"],
"properties": {
"version": {
"description": "the jobspec version",
"type": "integer",
"enum": [1]
},
"resources": {
"description": "requested resources",
"type": "array",
"minItems": 1,
"maxItems": 1,
"items": {
"oneOf": [
{ "$ref": "#/definitions/node_vertex" },
{ "$ref": "#/definitions/slot_vertex" }
]
}
},
"attributes": {
"description": "system and user attributes",
"type": ["object", "null"],
"properties": {
"system": {
"type": "object",
"properties": {
"duration": { "type": "number", "minimum": 0 },
"cwd": { "type": "string" },
"environment": { "type": "object" }
}
},
"user": {
"type": "object"
}
},
"additionalProperties": false
},
"tasks": {
"description": "task configuration",
"type": "array",
"maxItems": 1,
"items": {
"type": "object",
"required": ["command", "slot", "count" ],
"properties": {
"command": {
"type": ["string", "array"],
"minItems": 1,
"items": { "type": "string" }
},
"slot": { "type": "string" },
"count": {
"type": "object",
"additionalProperties": false,
"properties": {
"per_slot": { "type": "integer", "minimum" : 1 },
"total": { "type": "integer", "minimum" : 1 }
}
}
},
"additionalProperties": false
}
}
}
}