25/Job Specification Version 1
A domain specific language based on YAML is defined to express the resource requirements and other attributes of one or more programs submitted to a Flux instance for execution. This RFC describes the version 1 of jobspec, which represents a request to run exactly one program. This version is a simplified version of the canonical jobspec format described in RFC 14.
Name |
github.com/flux-framework/rfc/spec_25.rst |
Editor |
Stephen Herbein <herbein1@llnl.gov> |
State |
raw |
Language
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119.
Goals
Express the resource requirements of a program to the scheduler.
Allow resource requirements to be expressed simply in terms of Nodes, CPUs, and GPUs.
Express program attributes such as arguments, run time, and task layout, to be considered by the execution service
Overview
This RFC describes the version 1 form of “jobspec”, a domain specific language based on YAML [1]. The version 1 of jobspec SHALL consist of a single YAML document representing a reusable request to run exactly one program. Hereafter, “jobspec” refers to the version 1 form, and “non-canonical jobspec” refers to the non-canonical form.
Jobspec Language Definition
A jobspec V1 YAML document SHALL consist of a dictionary
defining the resources, tasks and other attributes of a single
program. The dictionary MUST contain the keys resources, tasks,
attributes, and version.
Each of the listed jobspec keys SHALL meet the form and requirements listed in detail in the sections below. For reference, a ruleset for compliant jobspec V1 is provided in the Schema section below.
Resources
The value of the resources key SHALL be a strict list which MUST define either
node or slot as the first and only resource. Each list element SHALL represent a
resource vertex (described below).
A resource vertex SHALL contain only the following keys:
type
count
unit
with
label
a node type resource vertex MAY also contain the following optional keys:
exclusive
The definitions of unit, with, exclusive and label SHALL match
those found in RFC14. The others are redefined and simplified to mean the
following:
- type
The
typekey for a resource SHALL indicate the type of resource to be matched. In V1, only four resource types are valid: [node,slot,core, andgpu].slottypes are described in the Reserved Resource Types section below.- count
The
countkey SHALL indicate the desired number of resources matching the current vertex. ThecountSHALL be a single integer value representing a fixed count
V1-Specific Resource Graph Restrictions
In V1, the resources list MUST contain exactly one element, which MUST be
either node or slot. Additionally, the resource graph MUST contain the
core type.
In V1, there are also restrictions on which resources can have out edges to
other resources. Specifically, a node can have an out edge to a slot, and a
slot can have an out edge to a core. If a slot has an out edge to a
core, it can also, optionally, have an out edge to a gpu as
well. Therefore, the complete enumeration of valid resource graphs in V1 is:
slot>corenode>slot>coreslot>(core,gpu)node>slot>(core,gpu)
Tasks
The tasks key SHALL be a strict list which MUST define exactly one task.
The list element SHALL be a dictionary representing a task to run as part of
the program. A task descriptor SHALL contain the following keys, whose
definitions SHALL match those provided in RFC 14, with the
restriction in V1 that the count key does not support the per_resource
key:
command
slot
count
The count key SHALL contain exactly one of the following keys:
per_slot
total
The definitions of these keys SHALL match those provided in
RFC 14, with the following restrictions in V1: if per_slot
is used, its value MUST be one, and if total is used its value MUST be less
than or equal to the count key of the associated slot and greater than or
equal to the number of allocated nodes (if no node resource vertex is
explicitly given in the jobspec, then this minimum value will depend on the
instance resource configuration and/or scheduler used).
Attributes
The attributes key SHALL be a dictionary of dictionaries. The attributes
dictionary MUST contain the system key and MAY contain the user key.
Common system keys are listed below, and their definitions can be found in
RFC 14. Values MAY have any valid YAML type.
user
system
duration
preemptible-after
environment
cwd
queue
dependencies
constraints
Most system attributes are optional, but the duration attribute is required in
jobspec V1.
Example Jobspec
Under the description above, the following is an example of a fully compliant
version 1 jobspec. The example below declares a request for 4 “nodes”
each of which with 1 task slot consisting of 2 cores each, for a total
of 4 task slots. A single copy of the command app will be run on each
task slot for a total of 4 tasks.
version: 1
resources:
- type: node
count: 4
with:
- type: slot
count: 1
label: default
with:
- type: core
count: 2
tasks:
- command: [ "app" ]
slot: default
count:
per_slot: 1
attributes:
system:
duration: 3600.
cwd: "/home/flux"
environment:
HOME: "/home/flux"
Basic Use Cases
To implement basic resource manager functionality, the following use cases SHALL be supported by the jobspec:
Section 1: Node-level Requests
The following “node-level” requests are all requests to start an instance,
i.e. run a single copy of flux start per allocated node. Many of these
requests are similar to existing resource manager batch job submission or
allocation requests, i.e. equivalent to oarsub, qsub, and salloc.
- Use Case 1.1
Request nodes outside of a slot
- Specific Example
Request 4 nodes, each with 1 slot
- Existing Equivalents
Slurm
salloc -N4PBS
qsub -l nodes=4- Jobspec YAML
version: 1 resources: - type: node count: 4 with: - type: slot count: 1 label: default with: - type: core count: 1 tasks: - command: [ "flux", "start" ] slot: default count: per_slot: 1 attributes: system: duration: 3600. cwd: "/home/flux" environment: HOME: "/home/flux"
Section 2: General Requests
The following use cases are more general and include more complex slot placement and task counts.
- Use Case 2.1
Run N tasks across M nodes, unequal distribution
- Specific Example
Run 5 copies of
hostnameacross 4 nodes, default distribution- Existing Equivalents
Slurm
srun -n5 -N4 hostname- Jobspec YAML
version: 1 resources: - type: node count: 4 with: - type: slot count: 1 label: myslot with: - type: core count: 1 tasks: - command: [ "hostname" ] slot: myslot count: total: 5 attributes: system: duration: 3600. cwd: "/home/flux" environment: HOME: "/home/flux"
- Use Case 2.2
Run N tasks, Require M cores per task
- Specific Example
Run 10 copies of
myapp, require 2 cores per copy, for a total of 20 cores- Existing Equivalents
Slurm
srun -n10 -c 2 myapp- Jobspec YAML
version: 1 resources: - type: slot label: default count: 10 with: - type: core count: 2 tasks: - command: [ "myapp" ] slot: default count: per_slot: 1 attributes: system: duration: 3600. cwd: "/home/flux" environment: HOME: "/home/flux"
- Use Case 2.3
Run N tasks, Require M cores and J gpus per task
- Specific Example
Run 10 copies of
myapp, require 2 cores and 1 gpu per copy, for a total of 20 cores and 10 gpus- Jobspec YAML
version: 1 resources: - type: slot count: 10 label: default with: - type: core count: 2 - type: gpu count: 1 tasks: - command: [ "myapp" ] slot: default count: per_slot: 1 attributes: system: duration: 3600. cwd: "/home/flux" environment: HOME: "/home/flux"
- Use Case 2.4
Run N tasks across M nodes, each task with 1 core and 1 gpu
- Specific Example
Run 16 copies of
myappacross 4 nodes, each copy with 1 core and 1 gpu- Existing Equivalents
Slurm
srun -n16 -N4 --gpus-per-task=1 myapp- Jobspec YAML
version: 1 resources: - type: node count: 4 with: - type: slot count: 4 label: default with: - type: core count: 1 - type: gpu count: 1 tasks: - command: [ "myapp" ] slot: default count: per_slot: 1 attributes: system: duration: 3600. cwd: "/home/flux" environment: HOME: "/home/flux"
Schema
A jobspec conforming to version 1 of the language definition SHALL adhere to the following ruleset, described using JSON Schema [2].
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "http://github.com/flux-framework/rfc/tree/master/data/spec_24/schema.json",
"title": "jobspec-01",
"description": "Flux jobspec version 1",
"definitions": {
"intranode_resource_vertex": {
"description": "schema for resource vertices within a node, cannot have child vertices",
"type": "object",
"required": ["type", "count"],
"properties": {
"type": { "enum": ["core", "gpu"]},
"count": { "type": "integer", "minimum" : 1 },
"unit": { "type": "string" }
},
"additionalProperties": false
},
"node_vertex": {
"description": "schema for the node resource vertex",
"type": "object",
"required": ["type", "count", "with"],
"properties": {
"type": { "const" : "node" },
"count": { "type": "integer", "minimum" : 1 },
"unit": { "type": "string" },
"with": {
"type": "array",
"minItems": 1,
"maxItems": 1,
"items": {
"oneOf": [
{"$ref": "#/definitions/slot_vertex"}
]
}
}
},
"additionalProperties": false
},
"slot_vertex": {
"description": "special slot resource type - label assigns to task slot",
"type": "object",
"required": ["type", "count", "with", "label"],
"properties": {
"type": { "const" : "slot" },
"count": { "type": "integer", "minimum" : 1 },
"unit": { "type": "string" },
"label": { "type": "string" },
"exclusive": { "type": "boolean" },
"with": {
"type": "array",
"minItems": 1,
"maxItems": 2,
"items": {
"oneOf": [
{"$ref": "#/definitions/intranode_resource_vertex"}
]
}
}
},
"additionalProperties": false
}
},
"type": "object",
"required": ["version", "resources", "attributes", "tasks"],
"properties": {
"version": {
"description": "the jobspec version",
"type": "integer",
"const": 1
},
"resources": {
"description": "requested resources",
"type": "array",
"minItems": 1,
"maxItems": 1,
"items": {
"oneOf": [
{ "$ref": "#/definitions/node_vertex" },
{ "$ref": "#/definitions/slot_vertex" }
]
}
},
"attributes": {
"description": "system and user attributes",
"type": ["object", "null"],
"properties": {
"system": {
"type": "object",
"properties": {
"duration": { "type": "number", "minimum": 0 },
"cwd": { "type": "string" },
"environment": { "type": "object" }
}
},
"user": {
"type": "object"
}
},
"additionalProperties": false
},
"tasks": {
"description": "task configuration",
"type": "array",
"maxItems": 1,
"items": {
"type": "object",
"required": ["command", "slot", "count"],
"properties": {
"command": {
"type": ["string", "array"],
"minItems": 1,
"items": { "type": "string" }
},
"slot": { "type": "string" },
"count": {
"type": "object",
"minProperties": 1,
"maxProperties": 1,
"additionalProperties": false,
"properties": {
"per_slot": { "type": "integer", "const" : 1 },
"total": { "type": "integer", "minimum" : 1 }
}
}
},
"additionalProperties": false
}
}
}
}