flux-shell-options(7)

DESCRIPTION

On startup, flux shell examines the jobspec for shell-specific options under the attributes.system.shell.options key. These options control shell behavior and features including I/O redirection, CPU/GPU affinity, signal handling, plugin configuration, and more.

Shell options may be set via flux submit -o, --setopt=OPT option, or explicitly added to the jobspec by other means.

Options may be simple boolean switches (e.g., verbose) or may take arguments. Since jobspec is a JSON document, shell options can accept complex JSON objects as arguments, enabling flexible runtime configuration.

Option Format

Options specified without a value get the default value of 1:

$ flux run -o verbose myapp
$ flux run -o pty myapp

To specify a boolean value use true or false explicitly:

$ flux run -o bool-option=true

Options with simple scalar arguments use = syntax:

$ flux run -o verbose=2 myapp
$ flux run -o exit-timeout=5m myapp

Options with object arguments specify JSON:

$ flux run -o 'cpu-affinity={"verbose":true}' myapp

Note: Most options documented here use simplified syntax. See individual option descriptions for object-based configuration when available.

Command-Line Convenience Options

Many shell options have corresponding command-line options in the Flux submission commands (flux-run(1), flux-submit(1), flux-batch(1), flux-alloc(1)) that provide convenient syntax for common use cases. When available, these command-line options should be preferred over setting shell options directly with -o.

For example, use:

$ flux submit --signal=SIGUSR1@60s myapp

rather than:

$ flux submit -o signal.signum=10 -o signal.timeleft=60 myapp

Throughout this manual, cross-references to command-line options indicate when a more convenient interface is available.

CORE OPTIONS

Shell Configuration

verbose[=INT]

Set the shell verbosity to INT. Higher values increase verbosity:

0: Errors only (default)
1: Informational messages
2: Debug messages including periodic resource monitoring

$ flux run -o verbose=2 myapp

initrc=FILE

Load flux shell initrc.lua file from FILE instead of the default system initrc path ($sysconfdir/flux/shell/initrc.lua).

Warning: This completely replaces the system initrc, potentially bypassing default plugin loading. Use userrc instead to extend the system configuration.

$ flux run -o initrc=/custom/initrc.lua myapp

userrc=FILE

Load an additional initrc.lua file after the system initrc. This is the recommended way to customize shell behavior without bypassing system defaults.

For details of the initrc file format and available functions, see flux-shell-initrc(5).

$ flux run -o userrc=$HOME/.flux/shell-initrc.lua myapp

PROCESS MANAGEMENT

nosetpgrp

Disable the use of setpgrp(2) to launch each job task in its own process group.

By default, the shell places each task in its own process group to ensure signals can be delivered independently. With nosetpgrp, tasks remain in the shell's process group, meaning signals will only be delivered to direct children of the shell.

This option is rarely needed but may be useful for debugging or when working with tools that expect a specific process group structure.

stop-tasks-in-exec

Stop tasks in exec() using PTRACE_TRACEME. This causes each task to stop immediately before execve(2), allowing a debugger to attach.

Used internally by debugging tools. Users should not need to set this option directly.

oom.adjust=VALUE

Adjust each task's OOM (Out Of Memory) score to influence Linux's OOM killer behavior when system memory is critically low.

Value range: -1000 to 1000
1000: Maximize probability of being killed (prefer killing this job)
0: Default system behavior
-1000: Minimize probability of being killed (requires privilege)

Setting negative values typically requires CAP_SYS_RESOURCE capability.

For more information, refer to oom_score_adj in proc(5).

$ flux run -o oom.adjust=500 myapp

rlimit

A dictionary of soft process resource limits to apply before starting tasks. Resource limits are specified by lowercase name without the RLIMIT_ prefix, with integer values.

Common limits include: core, cpu, data, fsize, nofile, nproc, stack.

See setrlimit(2) for available limits and flux-submit --rlimit for command-line syntax.

EXIT BEHAVIOR

exit-timeout=VALUE

When the first task in a job exits, a timer starts with duration specified by VALUE (in Flux Standard Duration format). If the timer expires before all tasks complete, a fatal exception is raised.

Default: 30s for normal parallel jobs (flux-submit(1), flux-run(1))
Default: none for Flux instances (flux-batch(1), flux-alloc(1))

Valid VALUE formats: - FSD string: 30s, 5m, 1h - none: Disable timeout completely

This timeout helps detect hung jobs where some tasks fail to exit properly.

$ flux run -o exit-timeout=5m myapp
$ flux run -o exit-timeout=none myapp

exit-on-error

Raise a fatal job exception immediately if the first task to exit was signaled or exited with a nonzero status.

Without this option, the shell waits for exit-timeout to expire or all tasks to exit before exiting. With this option, the job fails fast on the first error.

Useful for parallel workflows where one task failure invalidates the entire computation.

$ flux run -o exit-on-error myapp

SIGNAL HANDLING

signal=OPTION

Configure delivery of warning signal before job time limit expiration.

This option is most easily set via flux-submit --signal, which provides a convenient syntax for common use cases. --signal option should be preferred over setting this shell option directly.

signal.signum=NUMBER: Signal number to send before time limit expiration. Must be used with signal.timeleft.

signal.timeleft=TIME: Send signal.signum this amount of time before job expiration. TIME as an integer number of seconds or a string in Flux Standard Duration format.

CPU AFFINITY

cpu-affinity=OPT

Control CPU affinity binding for tasks. If unspecified, defaults to on (bind each task to all allocated cores).

OPT may be:

on

Bind each task to the full set of cores allocated to the job. This is the default and prevents tasks from being scheduled on cores outside the job allocation.

off

Disable CPU affinity completely. Tasks may float across all system cores. Useful when using external affinity tools like mpibind.

$ flux run -o cpu-affinity=off myapp

per-task

Divide allocated cores evenly among local tasks. Each task is bound to its own subset of cores. If there are more tasks than cores, tasks share cores as evenly as possible.

$ flux run -n 4 -o cpu-affinity=per-task myapp

For a 4-task job on a node with 8 cores, each task receives 2 cores.

map:LIST

Explicitly specify CPU binding for each task using a semicolon-delimited list. Each entry can use hwloc(7) list, bitmask, or taskset format.

See hwlocality_bitmap(3) for format details.

$ flux run -n 3 -o 'cpu-affinity=map:0-1;2-3;4-5' myapp

Task 0 binds to cores 0-1, task 1 to cores 2-3, task 2 to cores 4-5.

verbose

Log the cpuset assigned to the shell and each task. Must be combined with other options using comma separation, and must appear first.

$ flux run -o cpu-affinity=verbose,per-task myapp

dry-run

Print cpusets without actually applying affinity bindings. Implies verbose. Useful for testing affinity configurations. Must appear first when combined with other options.

$ flux run -o cpu-affinity=dry-run,per-task myapp

GPU AFFINITY

gpu-affinity=OPT

Control GPU device visibility via CUDA_VISIBLE_DEVICES. If unspecified, defaults to on (each task sees all GPUs allocated to the job).

OPT may be:

on

Set CUDA_VISIBLE_DEVICES to include all GPUs allocated to the job. All tasks see the same GPU set.

$ flux run -o gpu-affinity=on myapp

off

Disable the gpu-affinity plugin. CUDA_VISIBLE_DEVICES will not be set by the shell.

$ flux run -o gpu-affinity=off myapp

per-task

Divide allocated GPUs evenly among local tasks. Each task's CUDA_VISIBLE_DEVICES includes only its assigned GPUs. If there are more tasks than GPUs, tasks share GPUs as evenly as possible.

$ flux run -n 4 -o gpu-affinity=per-task myapp

map:LIST

Explicitly specify GPU assignment for each task using a semicolon-delimited list. Format is the same as cpu-affinity=map:LIST.

$ flux run -n 2 -o 'gpu-affinity=map:0;1' myapp

Task 0 sees GPU 0, task 1 sees GPU 1.

INPUT/OUTPUT

pty

Allocate a pseudo-terminal (pty) to all task ranks. Output is captured to the same location as stdout.

Equivalent to: pty.ranks=all pty.capture

$ flux run -o pty myapp

pty.ranks=OPT

Specify which task ranks should have a pty allocated. OPT may be:

An RFC 22 IDset (e.g., 0-3,5)
A single integer rank
The string all for all ranks

$ flux run -o pty.ranks=0-3 myapp

pty.capture: Capture pty output to the KVS alongside stdout. This is the default unless pty.interactive is set.

pty.interactive

Enable an interactive pty on rank 0, suitable for use with flux job attach(7).

By default, only rank 0 gets a pty and output is not captured. Override these defaults with additional pty options:

$ flux run -o pty.interactive -o pty.capture myapp

output.{stdout,stderr}.type=TYPE

Set output destination for stdout/stderr. TYPE may be:

kvs: Store output in the KVS (default). Retrieved via flux job attach(7).
term: Write directly to terminal (bypasses KVS).
file: Write to a file. Requires output.<stream>.path to be set.

If only output.stdout.type is set, it applies to both streams.

output.{stdout,stderr}.path=PATH

Set file path for stdout/stderr when output.<stream>.type=file.

Supports mustache templates for dynamic paths. See MUSTACHE TEMPLATES in flux-submit(1) for full documentation.

output.limit=SIZE

Limit KVS output to SIZE bytes per stream. Once exceeded, output is truncated.

SIZE format: number with optional SI suffix (k, K, M, G)
Maximum: 1G
Default: 10M (multi-user instance), 1G (single-user instance)
Ignored for file output

$ flux run -o output.limit=50M myapp

output.mode=MODE

Set file opening mode when writing output to files. MODE may be:

truncate: Overwrite existing files (default).
append: Append to existing files.

output.{stdout,stderr}.buffer.type=[none|line]: Set buffer type for stdout or stderr to line buffered or unbuffered (none). The default is line-buffered for stdout and unbuffered for stderr. See also the flux-submit --unbuffered option.

output.client.{lwm,hwm}=N

Configure flow control for output aggregation on the leader shell.

Flow control prevents unbounded memory growth when tasks produce output faster than it can be consumed. The shell uses a credit-based protocol:

Each shell starts with credits equal to hwm
When credits drop to lwm, request more from leader
At zero credits, output handling stops (tasks may block)
When credits arrive, output handling resumes

Default values: lwm=100, hwm=1000

These defaults are suitable for most cases. Adjust if experiencing output stalls or excessive memory use.

$ flux run -o output.client.lwm=50 -o output.client.hwm=500 myapp

input.stdin.type=TYPE

Set input source for stdin. TYPE may be:

service: Read stdin from the shell's stdin service (default for interactive jobs).
file: Read stdin from a file.

TASK MAPPING

taskmap

Request custom task-to-node mapping. This option is an object with required key scheme and optional key value.

The shell invokes taskmap.scheme plugin callbacks to generate the mapping. If value is set, it's passed to the plugin.

Built-in schemes: block, cyclic, hostfile, manual

See flux-submit --taskmap for command-line syntax and flux-shell-plugins(7) for implementing custom taskmap plugins.

$ flux submit --taskmap=cyclic myapp

PMI CONFIGURATION

pmi=LIST

Specify comma-separated list of PMI implementations to enable. Default: simple

Available implementations:

simple (alias: pmi1, pmi2): The simple PMI-1 wire protocol. Passes an open file descriptor via PMI_FD. Required for Flux's libpmi.so and libpmi2.so. Preferred when Flux launches Flux (e.g., flux-batch(1)).
off: Disable PMI completely.
cray-pals: External plugin from flux-coral2.
pmix: Provided via external plugin from the flux-pmix. project

pmi-simple.nomap

Skip pre-populating flux.taskmap and PMI_process_mapping keys in the simple PMI implementation.

Reduces PMI setup overhead when these keys are not needed.

pmi-simple.exchange.k=N

Configure PMI key exchange to use a virtual tree with fanout N. Default: 2

Higher fanout reduces exchange time but increases message size and leader load.

STAGE-IN

The stage-in feature copies files from archived content into the job's temporary directory before task execution. Files must be previously archived using flux-archive(1).

stage-in

Enable stage-in. Copy files to the directory referenced by FLUX_JOB_TMPDIR that were previously archived with flux-archive(1).

$ flux run -o stage-in myapp

stage-in.names=LIST

Comma-separated list of archive names to extract. If no names are specified, main is assumed.

$ flux run -o stage-in.names=main,data myapp

stage-in.pattern=PATTERN

Filter extracted files using a glob(7) pattern.

$ flux run -o 'stage-in.pattern=*.dat' myapp

stage-in.destination=[SCOPE:]PATH

Extract to PATH instead of FLUX_JOB_TMPDIR.

SCOPE may be:

local: Local file system (default). Extraction occurs on all nodes.
global: Global file system. Extraction occurs only on first node.

Warning

When using custom destinations, you must handle cleanup yourself. FLUX_JOB_TMPDIR is automatically cleaned up and guaranteed to be unique.

$ flux run -o 'stage-in.destination=global:/scratch/job-data' myapp

HWLOC CONFIGURATION

hwloc.xmlfile

Export the job shell's hwloc XML topology to a file and set HWLOC_XMLFILE for all tasks.

This allows tasks to query the hardware topology without scanning the system directly. Also unsets HWLOC_COMPONENTS which may interfere with HWLOC_XMLFILE.

$ flux run -o hwloc.xmlfile myapp

hwloc.restrict

Restrict the exported hwloc XML to only resources allocated to the job.

Must be used with hwloc.xmlfile. Without this option, the full node topology is exported.

$ flux run -o hwloc.xmlfile -o hwloc.restrict myapp

MONITORING

sysmon

Enable system resource monitoring. Logs peak memory usage and CPU load average for each shell at job completion.

With verbose=2, also traces memory and CPU usage periodically during execution.

$ flux run -o sysmon myapp

sysmon.period=FSD

Set monitoring sample period in Flux Standard Duration format. Default: follows Flux heartbeat (typically 2 seconds)

$ flux run -o sysmon -o sysmon.period=5s myapp

REXEC OPTIONS

rexec-shutdown-timeout=FSD

Set timeout for processes launched via the rexec server to exit gracefully after all tasks complete.

When all tasks exit, rexec processes receive SIGTERM. If they don't exit within this timeout, they receive SIGKILL.

Default: 60s
$ flux run -o rexec-shutdown-timeout=30s myapp

rexec.sign-required

Require RFC 42 request signatures on all requests to the rexec subprocess server. By default, signatures are only required when the job owner differs from the instance owner (i.e., the shell is running as a guest user). This option forces signature verification in all cases, which is primarily useful for testing the signing code paths in a single-user Flux instance.

$ flux submit -o rexec.sign-required myapp

EXTERNAL PLUGIN OPTIONS

External job shell plugins may define additional options. Refer to plugin-specific documentation for details. Plugin options typically use a namespace prefix matching the plugin name (e.g., myplugin.enabled).

See flux-shell-plugins(7) for plugin development and configuration details.

RESOURCES

Flux: http://flux-framework.org

Flux RFC: https://flux-framework.readthedocs.io/projects/flux-rfc

Issue Tracker: https://github.com/flux-framework/flux-core/issues