flux-config-exec(5)

DESCRIPTION

The exec system is highly configurable. If configuring a Flux system instance for the first time, it may be helpful to consult the Flux Administrator's Guide (see RESOURCES) and start with a simple configuration. See also EXAMPLES below.

The Flux system instance job-exec service requires additional configuration via the exec table, for example to enlist the services of a setuid helper to launch jobs as guests.

The exec table may contain the following keys:

KEYS

imp

(optional) Set the path to the IMP (Independent Minister of Privilege) helper program, as described in RFC 15, so that jobs may be launched with the credentials of the guest user that submitted them. If unset, only jobs submitted by the instance owner may be executed.

service

(optional) Set the remote subprocess service name. (Default: rexec). Note that systemd.enable must be set to true if sdexec is configured. See flux-config-systemd(5).

service-override

(optional) Allow service to be overridden on a per-job basis with --setattr system.exec.bulkexec.service=NAME. (Default: false).

job-shell

(optional) Override the compiled-in default job shell path.

shell-exit-timeout

(optional) Time to wait after the leader shell (rank 0) exits normally before raising a fatal exception if other shells remain active. Set to "none" to disable. The default is "30s".

This keeps a job from hanging when a remote shell or wrapper fails to exit after the leader terminates. When the timeout expires, the resulting exception begins the job termination sequence, cleaning up the presumably stuck shells.

Note

The timeout does not apply if the job already has a fatal exception (cancel, timeout, etc) when rank 0 exits. In that case the normal termination sequence already ensures that remaining shells exit.

kill-timeout

(optional) The amount of time in Flux Standard Duration (FSD) to wait after SIGTERM is sent to a job before sending SIGKILL. FSD is a human-readable time format supporting units like "5s" (seconds), "2m" (minutes), "1h" (hours), "3d" (days). See RFC 23 for complete specification. The default is "5s" (5 seconds). See JOB TERMINATION below for details.

max-kill-count

(optional) The maximum number of times kill-signal will be sent to the job shell before the execution system considers the job unkillable and drains the node. The default is 8. Note that the node is drained immediately after the final kill attempt without waiting an additional timeout period. See JOB TERMINATION below for details.

max-kill-timeout

(optional) The maximum amount of time in FSD to wait for a job to terminate before draining nodes with unkillable processes. When set, this overrides max-kill-count by continuing the kill signal escalation sequence until the specified duration has elapsed since termination began. The default is unset (use max-kill-count instead). See JOB TERMINATION below for details.

Example: max-kill-timeout = "30m" gives jobs up to 30 minutes to respond to termination signals before affected nodes are drained.

Note

Choosing between max-kill-count and max-kill-timeout:

Use max-kill-timeout when you want to specify how long to wait before giving up on a job, such as enforcing a site policy that jobs get 30 minutes to clean up before nodes are drained. This is simpler and more intuitive than calculating the number of kill attempts needed.

Use max-kill-count when you want fine-grained control over the number of escalation attempts regardless of timing, or when maintaining backward compatibility with existing configurations.

When both are set, max-kill-timeout takes precedence.

term-signal: (optional) A string specifying an alternate signal to SIGTERM when terminating job tasks. Mainly used for testing.
kill-signal: (optional) A string specifying an alternate signal to SIGKILL when killing tasks and the job shell. Mainly used for testing.
barrier-timeout: (optional) Specify the default job shell start barrier timeout in FSD. All multi-node jobs enter a barrier at startup once the Flux job shell completes initialization tasks such as changing the working directory and processing the initrc file. Once the first node enters this barrier, the job execution system starts a timer, and if the timer expires before the barrier is complete, raises a job exception and drains the nodes on which the barrier is waiting. To disable the barrier timeout, set this value to "0". (Default: 30m).
max-start-delay-percent: (optional) Specify the maximum allowed delay, as a percentage of a job's duration, between when a job is allocated (i.e. the starttime recorded in _R_) and when the execution system receives the start request from the job manager. If the delay exceeds this percentage, then extend the job's effective expiration by the delay. This prevents short duration jobs from having their runtime significantly reduced, while avoiding a differential between the actual resource set expiration and the time at which a timeout exception is raised for longer running jobs, where any runtime impact will be negligible. The default is 25 percent.
testexec: (optional) A table of keys (see TESTEXEC) for configuring the job-exec test execution implementation (used mainly for testing).

SDEXEC CONFIGURATION

When using the systemd execution service (service = "sdexec"), additional configuration options control how systemd manages job units and interacts with job-exec's termination sequence. See SDEXEC AND JOB TERMINATION INTERACTION for important information about how these settings interact with the job termination sequence described in JOB TERMINATION.

Tip

The current configuration values and derived settings can be inspected at runtime using flux module stats job-exec. See CONFIGURATION INTROSPECTION for details.

sdexec-constrain-resources

(optional) Boolean value that enables resource containment for jobs. When enabled, the sdexec-mapper module translates job resource allocations (cores, GPUs) into systemd unit properties that constrain jobs to their allocated resources, preventing access to resources owned by other jobs. (Default: false).

When enabled, the mapper sets:

AllowedCPUs - Restricts job to allocated CPU cores
AllowedMemoryNodes - Restricts job to NUMA nodes for allocated cores
DeviceAllow - Grants access only to allocated GPUs
DevicePolicy=closed - Blocks access to physical devices except those explicitly allowed, while permitting standard pseudo devices like /dev/null, /dev/zero, etc.

If memory cap properties (MemoryHigh, MemoryMax, MemorySwapMax) are set in exec.sdexec-properties, the default mapper scales them by the ratio of allocated to total processing units (hardware threads) on the node, so each co-located job is limited proportionally to its CPU share.

After each job unit starts, Flux verifies that the expected CPU set is enforced. This check is necessary because systemd may silently accept AllowedCPUs without enforcing it if the cpuset controller is not properly delegated. If the check fails, Flux drains the node as a likely misconfiguration that would affect all subsequent jobs.

See SDEXEC RESOURCE MAPPER for information about customizing the resource mapping behavior.

Requirements:

cgroups v2 (unified hierarchy)
cpuset cgroup controller delegated to user systemd instance
flux-security >= 0.14.0

See the Systemd and cgroup unified hierarchy section of the Flux Administrator's Guide for the required systemd override file configuration.

Example:

[exec]
service = "sdexec"
sdexec-constrain-resources = true

sdexec-properties

(optional) A table of systemd properties to set for all jobs. All values must be strings. See SDEXEC PROPERTIES below.

sdexec-stop-timer-sec

(optional) Configure the length of time in seconds after a unit enters deactivating state when it will be sent the sdexec-stop-timer-signal. Deactivating state is entered by imp-shell units when the flux-shell(1) terminates. The unit may remain there as long as user processes remain in the unit's cgroup.

After the same length of time, if the unit hasn't terminated, for example due to unkillable processes, the unit is abandoned and the node is drained.

Default: The effective max-kill-timeout value rounded up to the nearest integer (see JOB TERMINATION and SDEXEC AND JOB TERMINATION INTERACTION). For example, if the effective max-kill-timeout is 1220.5 seconds, the default sdexec-stop-timer-sec will be 1221 seconds. This ensures systemd waits at least 2*max-kill-timeout (one period before sending SIGKILL, one period before abandoning the unit) before draining the node, allowing job-exec's normal termination sequence to complete. This can be overridden by explicitly setting this value.

Use flux module stats job-exec to inspect the current effective value. See CONFIGURATION INTROSPECTION.

sdexec-stop-timer-signal

(optional) Configure the signal used by the stop timer. By default, 10 (SIGUSR1, the IMP proxy for SIGKILL) is used.

SDEXEC PROPERTIES

When the sdexec service is selected, systemd unit properties may be set by adding them to the sdexec-properties sub-table. All values must be specified as TOML strings. Properties that require other value types can only be specified if Flux knows about them so it can perform type conversion. Those are:

MemoryMax

Specify the node memory budget available to jobs. The value may be an absolute size in bytes (with optional K, M, G, or T suffix using base-1024 multipliers), a percentage of physical memory, or "infinity" to apply no limit.

With node-exclusive scheduling (one job per node), this value applies directly to the job's systemd unit, capping total memory use to protect system processes from being OOM-killed by a runaway job.

When sdexec-constrain-resources is enabled and jobs share a node, the default HwlocMapper scales this value by the ratio of allocated to total processing units so that co-located jobs collectively stay within the budget. Setting a fixed per-job value is not meaningful in this context since the number of co-located jobs is not known at configuration time. For example, MemoryMax = "95%" on a node with 64 hardware threads gives a 4-thread job a limit of round(95 * 4/64) = 6% of physical memory.

MemoryHigh

Specify the throttling limit on memory used by the job. Values are formatted as described above. Also scaled proportionally by the default mapper when sdexec-constrain-resources is enabled; see the MemoryMax entry above for scaling details.

MemoryMin, MemoryLow

Specify the memory usage protection of the job. Values are formatted as described above.

MemorySwapMax

Specify the absolute limit on swap used by the job. Values are formatted as described above. Also scaled proportionally by the default mapper when sdexec-constrain-resources is enabled; see the MemoryMax entry above for scaling details.

OOMScoreAdjust

Sets the adjustment value for the Linux kernel's OOM killer score. Values range from -1000 to 1000, with 1000 making a process most likely to be selected, and -1000 preventing a process from being selected. See systemd.exec(5) for more information. Setting a negative value is likely a privileged operation in the Flux systemd instance.

The following unit properties are reserved for use by Flux and should not be added to sdexec-properties: AllowedCPUs, AllowedMemoryNodes, DeviceAllow, DevicePolicy, Description, Environment, ExecStart, KillMode, RemainAfterExit, SendSIGKILL, StandardInputFileDescriptor, StandardOutputFileDescriptor, StandardErrorFileDescriptor, TimeoutStopUSec, Type, WorkingDirectory.

SDEXEC RESOURCE MAPPER

When sdexec-constrain-resources is enabled, the sdexec-mapper module translates job resource allocations into systemd unit properties. The mapper is implemented as a Python class that can be customized for site-specific requirements.

The mapper is configured under the [sdexec] TOML table:

mapper

(optional) Fully-qualified Python class name for the resource mapper. (Default: flux.sdexec.map.HwlocMapper).

The mapper class must be a subclass of flux.sdexec.map.ResourceMapper and implement map_<type> methods for each resource type. See Custom Mappers for details on implementing custom mappers.

mapper-searchpath

(optional) Colon-separated list of directories to search for mapper modules. This allows loading custom mappers from site-specific locations without modifying the Python system path. (Default: empty).

Default Mapper Behavior

The default HwlocMapper uses hwloc topology information to map resources:

Core mapping:

Translates logical core IDs to physical CPU IDs using hwloc topology
Sets AllowedCPUs to the physical CPU set for allocated cores
Sets AllowedMemoryNodes to NUMA nodes associated with allocated cores

GPU mapping:

Translates logical GPU IDs to PCI addresses using hwloc
Discovers GPU device nodes via sysfs for each allocated GPU
Sets DeviceAllow to grant access to discovered devices

GPU device discovery is vendor-aware and opportunistic:

NVIDIA GPUs: Includes /dev/nvidia* devices, /dev/nvidiactl (control), /dev/nvidia-uvm (CUDA UVM), /dev/nvidia-uvm-tools (optional), and /dev/dri/renderD* (DRM) devices
AMD GPUs: Includes /dev/dri/renderD*, /dev/dri/card*, and /dev/kfd (ROCm Kernel Fusion Driver) devices
Shared devices like /dev/kfd and /dev/nvidiactl are automatically deduplicated when multiple GPUs are allocated

Device containment:

Always sets DevicePolicy=closed to enforce device containment
Allows standard pseudo devices (/dev/null, /dev/zero, /dev/urandom, etc.)
Blocks physical devices unless explicitly granted via DeviceAllow

Memory cap property scaling:

Scales MemoryHigh, MemoryMax, and MemorySwapMax from exec.sdexec-properties by the ratio of allocated to total processing units (hardware threads) on the node, when those properties are present
Supports both absolute sizes (K/M/G/T suffixes) and percentage values; percentages are rounded to the nearest integer percent
"infinity" values are left unchanged (no limit applied)
Scaling is skipped if AllowedCPUs was not set (e.g., when cores were not part of the allocation)
Protection properties (MemoryMin, MemoryLow) are not scaled

Custom Mappers

Sites can customize resource mapping by providing a Python class that extends flux.sdexec.map.ResourceMapper or flux.sdexec.map.HwlocMapper.

The mapper provides two extension points:

Override map_<type>() methods to customize resource-specific mapping:

class CustomMapper(HwlocMapper):
    def map_gpus(self, gpus):
        # Custom GPU device discovery
        if not gpus:
            return {}
        return {"DeviceAllow": f"/dev/my-gpu-device rw"}

Override finalize_properties() to add properties not tied to specific resource types:
```
class CustomMapper(HwlocMapper):
    def finalize_properties(self, properties, R, extra_properties=None):
        # Add resource accounting
        properties.update({
            "CPUAccounting": "true",
            "MemoryAccounting": "true",
        })
        # Always call super() to preserve default behavior
        return super().finalize_properties(
            properties, R, extra_properties=extra_properties
        )
```
finalize_properties() is called after all map_<type>() methods complete. It receives:
- properties: the property dict built by map_<type>() methods
- R: the original ResourceSet for conditional logic
- extra_properties: the exec.sdexec-properties config dict, available for resource-aware scaling
The default HwlocMapper sets DevicePolicy=closed and scales memory cap properties (MemoryHigh, MemoryMax, MemorySwapMax) from extra_properties by the processing-unit ratio. Always call super() to preserve this behavior.

Configuration example:

[exec]
service = "sdexec"
sdexec-constrain-resources = true

[sdexec]
mapper = "site.mappers.AccountingMapper"
mapper-searchpath = "/etc/flux/mappers"

The mapper file at /etc/flux/mappers/site/mappers.py:

from flux.sdexec.map import HwlocMapper

class AccountingMapper(HwlocMapper):
    """Enable resource accounting for all jobs."""
    def finalize_properties(self, properties, R, extra_properties=None):
        properties.update({
            "CPUAccounting": "true",
            "MemoryAccounting": "true",
        })
        return super().finalize_properties(
            properties, R, extra_properties=extra_properties
        )

Replacing a scaled property with a computed value:

A custom finalize_properties() can override any property set by super(), including values scaled from sdexec-properties. For example, to enforce a fixed per-core memory limit instead of using a node-level budget:

from flux.sdexec.map import HwlocMapper

class PerCoreMemoryMapper(HwlocMapper):
    """Enforce a fixed 8 GB-per-core memory limit."""
    def finalize_properties(self, properties, R, extra_properties=None):
        super().finalize_properties(
            properties, R, extra_properties=extra_properties
        )
        local = R.copy_ranks(self._rank)
        # Override after super() so this value wins over any scaled MemoryMax.
        properties["MemoryMax"] = f"{local.ncores * 8}G"
        return properties

Common customization patterns:

Resource accounting: Add CPUAccounting, MemoryAccounting, etc.
Security policies: Set PrivateTmp, ProtectSystem, etc.
Computed memory limits: Override MemoryMax after super() to replace proportional scaling with a fixed per-core or per-GPU value
Conditional properties: Use R or properties to set limits conditionally (e.g., TasksMax only when GPUs are allocated)
Custom device discovery: Override map_gpus() for site-specific hardware or drivers

See the flux.sdexec.map module documentation and t/python/t0043-sdexec-map.py for more examples.

TESTEXEC

allow-guests: Boolean value enables access to the testexec implementation from guest users. By default, guests cannot use this implementation.

CONFIGURATION INTROSPECTION

The current effective configuration and some other runtime statistics can be queried using flux module stats job-exec. This is useful for:

Verifying configuration after changes
Understanding derived/calculated values
Debugging timing issues
Monitoring job execution system behavior

Key configuration settings available:

Termination Settings:

kill-timeout

The configured kill timeout value

term-signal, kill-signal

The signals used for termination

max-kill-count

The configured maximum kill attempt count

max-kill-timeout

The configured max-kill-timeout (0 if unset)

effective-max-kill-timeout

The calculated maximum time the execution system will wait before draining nodes. This is either the configured max-kill-timeout or a value derived from max-kill-count and the exponential backoff sequence. This value is reported as a floating-point number (e.g., 1220.5 seconds).

When using max-kill-count, this represents the time from the start of job termination until the final kill attempt is sent. The node is drained immediately after this final attempt without waiting an additional timeout period.

Sdexec Settings (under bulk-exec.config):

sdexec_stop_timer_sec: The effective stop timer value in seconds. If not explicitly configured, this will equal effective-max-kill-timeout.
sdexec_stop_timer_signal: The signal number used by the stop timer

Example:

$ flux module stats job-exec
{
 "kill-timeout": 5.0,
 "term-signal": "SIGTERM",
 "kill-signal": "SIGKILL",
 "max-kill-count": 8,
 "max-kill-timeout": -1.0,
 "effective-max-kill-timeout": 640.0,
 "jobs": {},
 "bulk-exec": {
  "config": {
   "default_cwd": "/tmp",
   "default_job_shell": "/home/grondo/git/f.git/src/shell/flux-shell",
   "exec_service": "rexec",
   "exec_service_override": 0,
   "default_barrier_timeout": 1800.0,
   "sdexec_stop_timer_sec": 640,
   "sdexec_stop_timer_signal": 10
  }
 }
}

In this example: - No max-kill-timeout is configured (shown as -1.0) - The effective-max-kill-timeout is 640.0 seconds, calculated from

max-kill-count=8 with exponential backoff: (5 * 5) + 5 + 10 + 20 + 40 + 80 + 160 + 300 = 640s (representing 5 * kill-timeout until the first kill, then the amount of time until the eighth and final kill attempt)

The sdexec_stop_timer_sec defaults to 1220 seconds (the effective max-kill-timeout rounded up, though in this case it's already an integer)

Example with explicit max-kill-timeout:

$ echo exec.max-kill-timeout=\"30m\" | flux config load
$ flux module stats job-exec
{
 "kill-timeout": 5.0,
 "term-signal": "SIGTERM",
 "kill-signal": "SIGKILL",
 "max-kill-count": 8,
 "max-kill-timeout": 1800.0,
 "effective-max-kill-timeout": 1800.0,
 "jobs": {},
 "bulk-exec": {
  "config": {
   "default_cwd": "/tmp",
   "default_job_shell": "/home/grondo/git/f.git/src/shell/flux-shell",
   "exec_service": "rexec",
   "exec_service_override": 0,
   "default_barrier_timeout": 1800.0,
   "sdexec_stop_timer_sec": 1800,
   "sdexec_stop_timer_signal": 10
  }
 }
}

Here the explicit max-kill-timeout setting (1800.0 seconds) determines the effective-max-kill-timeout, and the default sdexec stop timer is set to this value rounded up (1800 seconds, already an integer in this case).

JOB TERMINATION

When a job is canceled or gets a fatal exception it is terminated using the following sequence:

The job shells are notified to send term-signal to job tasks, unless the job is being terminated due to a time limit, in which case SIGALRM is sent instead.

After kill-timeout, job shells are notified to send kill-signal to tasks. This repeats every kill-timeout seconds.

After a delay of 5*kill-timeout, the job execution system transitions to sending kill-signal to the job shells directly.

This continues with an exponential backoff starting at kill-timeout, with the timeout doubling after each attempt (capped at 300s).

If max-kill-timeout is set, the execution system continues sending kill-signal to job shells until the specified duration has elapsed since termination began, then drains the nodes immediately.

If max-kill-timeout is not set, the execution system uses max-kill-count to limit the number of kill attempts. After the final kill attempt, nodes are drained immediately without waiting an additional timeout period.

In either case, any nodes still running processes for the job are drained with the reason: "unkillable user processes for job JOBID."

Note

Timing calculation with max-kill-count: The effective max-kill-timeout represents the time from termination start until the final kill attempt. For example, with max-kill-count=4 and kill-timeout=1s, kill attempts occur at 5s, 6s, 8s, and 12s, giving an effective max-kill-timeout of 12s. The node is drained at 12s immediately after the final attempt, not at 20s (12s + 8s timeout period).

Note

When using sdexec, see SDEXEC AND JOB TERMINATION INTERACTION for information about how the systemd unit lifecycle interacts with this termination sequence.

SDEXEC AND JOB TERMINATION INTERACTION

When using sdexec, the systemd unit lifecycle adds an additional layer to the job termination process. Understanding this interaction is essential for configuring appropriate timeouts.

Job completion with sdexec:

When a job shell exits (either normally after tasks complete, or due to job-exec sending signals during exception-based termination):

The IMP notifies systemd that the unit is stopping.
If no processes remain in the unit's cgroup, the IMP exits and the job completes successfully.
If processes remain in the cgroup, systemd starts the sdexec-stop-timer-sec countdown.
After sdexec-stop-timer-sec seconds, if processes are still present, systemd sends sdexec-stop-timer-signal (SIGKILL) to them.
After another sdexec-stop-timer-sec seconds, if the unit still hasn't terminated, systemd abandons the unit and job-exec drains the node.

Why the sdexec stop timer is necessary:

The stop timer is essential for handling the case where the job shell terminates normally after all tasks exit, but unkillable processes remain in the cgroup. Without this timer, the job would remain in RUN state indefinitely (or until its time limit expires), with no mechanism to detect or handle the problem.

Why the stop timer must exceed max-kill-timeout:

During exception-based job termination (cancellation, timeout, etc.), job-exec may kill the job shell before all tasks have exited. This triggers the sdexec stop timer sequence described above. If sdexec-stop-timer-sec were shorter than the effective max-kill-timeout, systemd would abandon units and drain nodes before job-exec's termination sequence completes, prematurely giving up on jobs that might still respond to signals.

By defaulting sdexec-stop-timer-sec to the effective max-kill-timeout (rounded up), Flux ensures systemd waits at least 2*max-kill-timeout total (one period before sending SIGKILL, another before abandoning the unit). This gives job-exec's termination sequence (which takes up to max-kill-timeout) time to complete, while still providing timely cleanup for the normal termination edge case.

Example: If max-kill-timeout = "30m", then sdexec-stop-timer-sec defaults to 1800 seconds (30 minutes). Systemd will wait:

30 minutes before sending SIGKILL to remaining processes
Another 30 minutes (60 minutes total) before abandoning the unit

This ensures the full 30-minute job-exec termination sequence can complete before systemd intervenes.

Example with max-kill-count: If max-kill-count = 8 (default) and kill-timeout = 5s (default), the effective max-kill-timeout is 640 seconds. The sdexec-stop-timer-sec defaults to 640 seconds, giving systemd 1280 seconds (about 21 minutes) total before abandoning the unit. Since job-exec drains immediately after the final kill attempt at 640 seconds, the longer systemd timeout ensures it doesn't interfere with job-exec's termination sequence.

Sites with jobs that require extended cleanup time should set max-kill-timeout appropriately rather than tuning sdexec-stop-timer-sec directly, as this maintains proper coordination between both systems.

EXAMPLES

[exec]
imp = "/usr/libexec/flux/flux-imp"
job-shell = "/usr/libexec/flux/flux-shell-special"

[exec]
service = "sdexec"
sdexec-constrain-resources = true
[exec.sdexec-properties]
MemoryMax = "90%"

[exec]
service = "sdexec"
sdexec-constrain-resources = true
[sdexec]
mapper = "site.mappers.AccountingMapper"
mapper-searchpath = "/etc/flux/mappers"

[exec]
# Give jobs 30 minutes to terminate before draining nodes
max-kill-timeout = "30m"
service = "sdexec"
# sdexec-stop-timer-sec will default to 1800 (30 minutes)
# giving systemd 60 minutes total before abandoning units

[exec]
service = "sdexec"
max-kill-timeout = "15m"
# Override the default if jobs need even more time for cleanup
sdexec-stop-timer-sec = 1800  # 30 minutes instead of 15

[exec.testexec]
allow-guests = true

RESOURCES

Flux: http://flux-framework.org

Flux RFC: https://flux-framework.readthedocs.io/projects/flux-rfc

Issue Tracker: https://github.com/flux-framework/flux-core/issues

Flux Administrator's Guide: https://flux-framework.readthedocs.io/projects/flux-core/en/latest/guide/admin.html

FLUX RFC

15/Independent Minister of Privilege for Flux: The Security IMP, 23/Flux Standard Duration

flux-config-exec(5)

DESCRIPTION

KEYS

SDEXEC CONFIGURATION

SDEXEC PROPERTIES

SDEXEC RESOURCE MAPPER

Default Mapper Behavior

Custom Mappers

TESTEXEC

CONFIGURATION INTROSPECTION

JOB TERMINATION

SDEXEC AND JOB TERMINATION INTERACTION

EXAMPLES

RESOURCES

FLUX RFC

SEE ALSO