flux-config-exec(5)
DESCRIPTION
The exec system is highly configurable. If configuring a Flux system instance for the first time, it may be helpful to consult the Flux Administrator's Guide (see RESOURCES) and start with a simple configuration. See also EXAMPLES below.
The Flux system instance job-exec service requires additional
configuration via the exec table, for example to enlist the services
of a setuid helper to launch jobs as guests.
The exec table may contain the following keys:
KEYS
- imp
(optional) Set the path to the IMP (Independent Minister of Privilege) helper program, as described in RFC 15, so that jobs may be launched with the credentials of the guest user that submitted them. If unset, only jobs submitted by the instance owner may be executed.
- service
(optional) Set the remote subprocess service name. (Default:
rexec). Note thatsystemd.enablemust be set totrueifsdexecis configured. See flux-config-systemd(5).- service-override
(optional) Allow
serviceto be overridden on a per-job basis with--setattr system.exec.bulkexec.service=NAME. (Default:false).- job-shell
(optional) Override the compiled-in default job shell path.
- kill-timeout
(optional) The amount of time in Flux Standard Duration (FSD) to wait after
SIGTERMis sent to a job before sendingSIGKILL. FSD is a human-readable time format supporting units like "5s" (seconds), "2m" (minutes), "1h" (hours), "3d" (days). See RFC 23 for complete specification. The default is "5s" (5 seconds). See JOB TERMINATION below for details.- max-kill-count
(optional) The maximum number of times
kill-signalwill be sent to the job shell before the execution system considers the job unkillable and drains the node. The default is 8. Note that the node is drained immediately after the final kill attempt without waiting an additional timeout period. See JOB TERMINATION below for details.- max-kill-timeout
(optional) The maximum amount of time in FSD to wait for a job to terminate before draining nodes with unkillable processes. When set, this overrides
max-kill-countby continuing the kill signal escalation sequence until the specified duration has elapsed since termination began. The default is unset (usemax-kill-countinstead). See JOB TERMINATION below for details.Example:
max-kill-timeout = "30m"gives jobs up to 30 minutes to respond to termination signals before affected nodes are drained.
Note
Choosing between max-kill-count and max-kill-timeout:
Use max-kill-timeout when you want to specify how long to wait
before giving up on a job, such as enforcing a site policy that jobs
get 30 minutes to clean up before nodes are drained. This is simpler
and more intuitive than calculating the number of kill attempts needed.
Use max-kill-count when you want fine-grained control over the
number of escalation attempts regardless of timing, or when maintaining
backward compatibility with existing configurations.
When both are set, max-kill-timeout takes precedence.
- term-signal
(optional) A string specifying an alternate signal to
SIGTERMwhen terminating job tasks. Mainly used for testing.- kill-signal
(optional) A string specifying an alternate signal to
SIGKILLwhen killing tasks and the job shell. Mainly used for testing.- barrier-timeout
(optional) Specify the default job shell start barrier timeout in FSD. All multi-node jobs enter a barrier at startup once the Flux job shell completes initialization tasks such as changing the working directory and processing the initrc file. Once the first node enters this barrier, the job execution system starts a timer, and if the timer expires before the barrier is complete, raises a job exception and drains the nodes on which the barrier is waiting. To disable the barrier timeout, set this value to
"0". (Default:30m).- max-start-delay-percent
(optional) Specify the maximum allowed delay, as a percentage of a job's duration, between when a job is allocated (i.e. the starttime recorded in _R_) and when the execution system receives the start request from the job manager. If the delay exceeds this percentage, then extend the job's effective expiration by the delay. This prevents short duration jobs from having their runtime significantly reduced, while avoiding a differential between the actual resource set expiration and the time at which a
timeoutexception is raised for longer running jobs, where any runtime impact will be negligible. The default is 25 percent.- testexec
(optional) A table of keys (see TESTEXEC) for configuring the job-exec test execution implementation (used mainly for testing).
SDEXEC CONFIGURATION
When using the systemd execution service (service = "sdexec"), additional
configuration options control how systemd manages job units and interacts with
job-exec's termination sequence. See SDEXEC AND JOB TERMINATION INTERACTION for important
information about how these settings interact with the job termination sequence
described in JOB TERMINATION.
Tip
The current configuration values and derived settings can be inspected at
runtime using flux module stats job-exec. See CONFIGURATION INTROSPECTION
for details.
- sdexec-constrain-resources
(optional) Boolean value that enables resource containment for jobs. When enabled, the
sdexec-mappermodule translates job resource allocations (cores, GPUs) into systemd unit properties that constrain jobs to their allocated resources, preventing access to resources owned by other jobs. (Default:false).When enabled, the mapper sets:
AllowedCPUs- Restricts job to allocated CPU coresAllowedMemoryNodes- Restricts job to NUMA nodes for allocated coresDeviceAllow- Grants access only to allocated GPUsDevicePolicy=closed- Blocks access to physical devices except those explicitly allowed, while permitting standard pseudo devices like/dev/null,/dev/zero, etc.
If memory cap properties (
MemoryHigh,MemoryMax,MemorySwapMax) are set inexec.sdexec-properties, the default mapper scales them by the ratio of allocated to total processing units (hardware threads) on the node, so each co-located job is limited proportionally to its CPU share.After each job unit starts, Flux verifies that the expected CPU set is enforced. This check is necessary because systemd may silently accept
AllowedCPUswithout enforcing it if thecpusetcontroller is not properly delegated. If the check fails, Flux drains the node as a likely misconfiguration that would affect all subsequent jobs.See SDEXEC RESOURCE MAPPER for information about customizing the resource mapping behavior.
Requirements:
cgroups v2 (unified hierarchy)
cpuset cgroup controller delegated to user systemd instance
flux-security >= 0.14.0
See the Systemd and cgroup unified hierarchy section of the Flux Administrator's Guide for the required systemd override file configuration.
Example:
[exec] service = "sdexec" sdexec-constrain-resources = true
- sdexec-properties
(optional) A table of systemd properties to set for all jobs. All values must be strings. See SDEXEC PROPERTIES below.
- sdexec-stop-timer-sec
(optional) Configure the length of time in seconds after a unit enters deactivating state when it will be sent the
sdexec-stop-timer-signal. Deactivating state is entered byimp-shellunits when the flux-shell(1) terminates. The unit may remain there as long as user processes remain in the unit's cgroup.After the same length of time, if the unit hasn't terminated, for example due to unkillable processes, the unit is abandoned and the node is drained.
Default: The effective max-kill-timeout value rounded up to the nearest integer (see JOB TERMINATION and SDEXEC AND JOB TERMINATION INTERACTION). For example, if the effective max-kill-timeout is 1220.5 seconds, the default sdexec-stop-timer-sec will be 1221 seconds. This ensures systemd waits at least
2*max-kill-timeout(one period before sending SIGKILL, one period before abandoning the unit) before draining the node, allowing job-exec's normal termination sequence to complete. This can be overridden by explicitly setting this value.Use
flux module stats job-execto inspect the current effective value. See CONFIGURATION INTROSPECTION.- sdexec-stop-timer-signal
(optional) Configure the signal used by the stop timer. By default, 10 (SIGUSR1, the IMP proxy for SIGKILL) is used.
SDEXEC PROPERTIES
When the sdexec service is selected, systemd unit properties may be set by
adding them to the sdexec-properties sub-table. All values must be
specified as TOML strings. Properties that require other value types
can only be specified if Flux knows about them so it can perform type
conversion. Those are:
- MemoryMax
Specify the node memory budget available to jobs. The value may be an absolute size in bytes (with optional K, M, G, or T suffix using base-1024 multipliers), a percentage of physical memory, or "infinity" to apply no limit.
With node-exclusive scheduling (one job per node), this value applies directly to the job's systemd unit, capping total memory use to protect system processes from being OOM-killed by a runaway job.
When
sdexec-constrain-resourcesis enabled and jobs share a node, the defaultHwlocMapperscales this value by the ratio of allocated to total processing units so that co-located jobs collectively stay within the budget. Setting a fixed per-job value is not meaningful in this context since the number of co-located jobs is not known at configuration time. For example,MemoryMax = "95%"on a node with 64 hardware threads gives a 4-thread job a limit ofround(95 * 4/64) = 6%of physical memory.- MemoryHigh
Specify the throttling limit on memory used by the job. Values are formatted as described above. Also scaled proportionally by the default mapper when
sdexec-constrain-resourcesis enabled; see theMemoryMaxentry above for scaling details.- MemoryMin, MemoryLow
Specify the memory usage protection of the job. Values are formatted as described above.
- MemorySwapMax
Specify the absolute limit on swap used by the job. Values are formatted as described above. Also scaled proportionally by the default mapper when
sdexec-constrain-resourcesis enabled; see theMemoryMaxentry above for scaling details.- OOMScoreAdjust
Sets the adjustment value for the Linux kernel's OOM killer score. Values range from -1000 to 1000, with 1000 making a process most likely to be selected, and -1000 preventing a process from being selected. See systemd.exec(5) for more information. Setting a negative value is likely a privileged operation in the Flux systemd instance.
The following unit properties are reserved for use by Flux and should not be
added to sdexec-properties: AllowedCPUs, AllowedMemoryNodes, DeviceAllow,
DevicePolicy, Description, Environment, ExecStart, KillMode, RemainAfterExit,
SendSIGKILL, StandardInputFileDescriptor, StandardOutputFileDescriptor,
StandardErrorFileDescriptor, TimeoutStopUSec, Type, WorkingDirectory.
SDEXEC RESOURCE MAPPER
When sdexec-constrain-resources is enabled, the sdexec-mapper module
translates job resource allocations into systemd unit properties. The mapper
is implemented as a Python class that can be customized for site-specific
requirements.
The mapper is configured under the [sdexec] TOML table:
- mapper
(optional) Fully-qualified Python class name for the resource mapper. (Default:
flux.sdexec.map.HwlocMapper).The mapper class must be a subclass of
flux.sdexec.map.ResourceMapperand implementmap_<type>methods for each resource type. See Custom Mappers for details on implementing custom mappers.- mapper-searchpath
(optional) Colon-separated list of directories to search for mapper modules. This allows loading custom mappers from site-specific locations without modifying the Python system path. (Default: empty).
Default Mapper Behavior
The default HwlocMapper uses hwloc topology information to map resources:
Core mapping:
Translates logical core IDs to physical CPU IDs using hwloc topology
Sets
AllowedCPUsto the physical CPU set for allocated coresSets
AllowedMemoryNodesto NUMA nodes associated with allocated cores
GPU mapping:
Translates logical GPU IDs to PCI addresses using hwloc
Discovers GPU device nodes via sysfs for each allocated GPU
Sets
DeviceAllowto grant access to discovered devices
GPU device discovery is vendor-aware and opportunistic:
NVIDIA GPUs: Includes
/dev/nvidia*devices,/dev/nvidiactl(control),/dev/nvidia-uvm(CUDA UVM),/dev/nvidia-uvm-tools(optional), and/dev/dri/renderD*(DRM) devicesAMD GPUs: Includes
/dev/dri/renderD*,/dev/dri/card*, and/dev/kfd(ROCm Kernel Fusion Driver) devicesShared devices like
/dev/kfdand/dev/nvidiactlare automatically deduplicated when multiple GPUs are allocated
Device containment:
Always sets
DevicePolicy=closedto enforce device containmentAllows standard pseudo devices (
/dev/null,/dev/zero,/dev/urandom, etc.)Blocks physical devices unless explicitly granted via
DeviceAllow
Memory cap property scaling:
Scales
MemoryHigh,MemoryMax, andMemorySwapMaxfromexec.sdexec-propertiesby the ratio of allocated to total processing units (hardware threads) on the node, when those properties are presentSupports both absolute sizes (K/M/G/T suffixes) and percentage values; percentages are rounded to the nearest integer percent
"infinity" values are left unchanged (no limit applied)
Scaling is skipped if
AllowedCPUswas not set (e.g., when cores were not part of the allocation)Protection properties (
MemoryMin,MemoryLow) are not scaled
Custom Mappers
Sites can customize resource mapping by providing a Python class that extends
flux.sdexec.map.ResourceMapper or flux.sdexec.map.HwlocMapper.
The mapper provides two extension points:
Override
map_<type>()methods to customize resource-specific mapping:class CustomMapper(HwlocMapper): def map_gpus(self, gpus): # Custom GPU device discovery if not gpus: return {} return {"DeviceAllow": f"/dev/my-gpu-device rw"}
Override
finalize_properties()to add properties not tied to specific resource types:class CustomMapper(HwlocMapper): def finalize_properties(self, properties, R, extra_properties=None): # Add resource accounting properties.update({ "CPUAccounting": "true", "MemoryAccounting": "true", }) # Always call super() to preserve default behavior return super().finalize_properties( properties, R, extra_properties=extra_properties )
finalize_properties()is called after allmap_<type>()methods complete. It receives:properties: the property dict built bymap_<type>()methodsR: the originalResourceSetfor conditional logicextra_properties: theexec.sdexec-propertiesconfig dict, available for resource-aware scaling
The default
HwlocMappersetsDevicePolicy=closedand scales memory cap properties (MemoryHigh,MemoryMax,MemorySwapMax) fromextra_propertiesby the processing-unit ratio. Always callsuper()to preserve this behavior.
Configuration example:
[exec]
service = "sdexec"
sdexec-constrain-resources = true
[sdexec]
mapper = "site.mappers.AccountingMapper"
mapper-searchpath = "/etc/flux/mappers"
The mapper file at /etc/flux/mappers/site/mappers.py:
from flux.sdexec.map import HwlocMapper
class AccountingMapper(HwlocMapper):
"""Enable resource accounting for all jobs."""
def finalize_properties(self, properties, R, extra_properties=None):
properties.update({
"CPUAccounting": "true",
"MemoryAccounting": "true",
})
return super().finalize_properties(
properties, R, extra_properties=extra_properties
)
Replacing a scaled property with a computed value:
A custom finalize_properties() can override any property set by
super(), including values scaled from sdexec-properties. For
example, to enforce a fixed per-core memory limit instead of using a
node-level budget:
from flux.sdexec.map import HwlocMapper
class PerCoreMemoryMapper(HwlocMapper):
"""Enforce a fixed 8 GB-per-core memory limit."""
def finalize_properties(self, properties, R, extra_properties=None):
super().finalize_properties(
properties, R, extra_properties=extra_properties
)
local = R.copy_ranks(self._rank)
# Override after super() so this value wins over any scaled MemoryMax.
properties["MemoryMax"] = f"{local.ncores * 8}G"
return properties
Common customization patterns:
Resource accounting: Add
CPUAccounting,MemoryAccounting, etc.Security policies: Set
PrivateTmp,ProtectSystem, etc.Computed memory limits: Override
MemoryMaxaftersuper()to replace proportional scaling with a fixed per-core or per-GPU valueConditional properties: Use
Rorpropertiesto set limits conditionally (e.g.,TasksMaxonly when GPUs are allocated)Custom device discovery: Override
map_gpus()for site-specific hardware or drivers
See the flux.sdexec.map module documentation and
t/python/t0043-sdexec-map.py for more examples.
TESTEXEC
- allow-guests
Boolean value enables access to the testexec implementation from guest users. By default, guests cannot use this implementation.
CONFIGURATION INTROSPECTION
The current effective configuration and some other runtime statistics can
be queried using flux module stats job-exec. This is useful for:
Verifying configuration after changes
Understanding derived/calculated values
Debugging timing issues
Monitoring job execution system behavior
Key configuration settings available:
Termination Settings:
- kill-timeout
The configured kill timeout value
- term-signal, kill-signal
The signals used for termination
- max-kill-count
The configured maximum kill attempt count
- max-kill-timeout
The configured max-kill-timeout (0 if unset)
- effective-max-kill-timeout
The calculated maximum time the execution system will wait before draining nodes. This is either the configured
max-kill-timeoutor a value derived frommax-kill-countand the exponential backoff sequence. This value is reported as a floating-point number (e.g., 1220.5 seconds).When using
max-kill-count, this represents the time from the start of job termination until the final kill attempt is sent. The node is drained immediately after this final attempt without waiting an additional timeout period.
Sdexec Settings (under bulk-exec.config):
- sdexec_stop_timer_sec
The effective stop timer value in seconds. If not explicitly configured, this will equal
effective-max-kill-timeout.- sdexec_stop_timer_signal
The signal number used by the stop timer
Example:
$ flux module stats job-exec
{
"kill-timeout": 5.0,
"term-signal": "SIGTERM",
"kill-signal": "SIGKILL",
"max-kill-count": 8,
"max-kill-timeout": -1.0,
"effective-max-kill-timeout": 640.0,
"jobs": {},
"bulk-exec": {
"config": {
"default_cwd": "/tmp",
"default_job_shell": "/home/grondo/git/f.git/src/shell/flux-shell",
"exec_service": "rexec",
"exec_service_override": 0,
"default_barrier_timeout": 1800.0,
"sdexec_stop_timer_sec": 640,
"sdexec_stop_timer_signal": 10
}
}
}
In this example:
- No max-kill-timeout is configured (shown as -1.0)
- The effective-max-kill-timeout is 640.0 seconds, calculated from
max-kill-count=8with exponential backoff: (5 * 5) + 5 + 10 + 20 + 40 + 80 + 160 + 300 = 640s (representing 5 * kill-timeout until the first kill, then the amount of time until the eighth and final kill attempt)
The
sdexec_stop_timer_secdefaults to 1220 seconds (the effective max-kill-timeout rounded up, though in this case it's already an integer)
Example with explicit max-kill-timeout:
$ echo exec.max-kill-timeout=\"30m\" | flux config load
$ flux module stats job-exec
{
"kill-timeout": 5.0,
"term-signal": "SIGTERM",
"kill-signal": "SIGKILL",
"max-kill-count": 8,
"max-kill-timeout": 1800.0,
"effective-max-kill-timeout": 1800.0,
"jobs": {},
"bulk-exec": {
"config": {
"default_cwd": "/tmp",
"default_job_shell": "/home/grondo/git/f.git/src/shell/flux-shell",
"exec_service": "rexec",
"exec_service_override": 0,
"default_barrier_timeout": 1800.0,
"sdexec_stop_timer_sec": 1800,
"sdexec_stop_timer_signal": 10
}
}
}
Here the explicit max-kill-timeout setting (1800.0 seconds) determines
the effective-max-kill-timeout, and the default sdexec stop timer is set to
this value rounded up (1800 seconds, already an integer in this case).
JOB TERMINATION
When a job is canceled or gets a fatal exception it is terminated using the following sequence:
The job shells are notified to send
term-signalto job tasks, unless the job is being terminated due to a time limit, in which caseSIGALRMis sent instead.After
kill-timeout, job shells are notified to sendkill-signalto tasks. This repeats everykill-timeoutseconds.After a delay of
5*kill-timeout, the job execution system transitions to sendingkill-signalto the job shells directly.This continues with an exponential backoff starting at
kill-timeout, with the timeout doubling after each attempt (capped at 300s).If
max-kill-timeoutis set, the execution system continues sendingkill-signalto job shells until the specified duration has elapsed since termination began, then drains the nodes immediately.If
max-kill-timeoutis not set, the execution system usesmax-kill-countto limit the number of kill attempts. After the final kill attempt, nodes are drained immediately without waiting an additional timeout period.In either case, any nodes still running processes for the job are drained with the reason: "unkillable user processes for job JOBID."
Note
Timing calculation with max-kill-count: The effective max-kill-timeout
represents the time from termination start until the final kill attempt.
For example, with max-kill-count=4 and kill-timeout=1s, kill
attempts occur at 5s, 6s, 8s, and 12s, giving an effective max-kill-timeout
of 12s. The node is drained at 12s immediately after the final attempt,
not at 20s (12s + 8s timeout period).
Note
When using sdexec, see SDEXEC AND JOB TERMINATION INTERACTION for information about how the systemd unit lifecycle interacts with this termination sequence.
SDEXEC AND JOB TERMINATION INTERACTION
When using sdexec, the systemd unit lifecycle adds an additional layer to the job termination process. Understanding this interaction is essential for configuring appropriate timeouts.
Job completion with sdexec:
When a job shell exits (either normally after tasks complete, or due to job-exec sending signals during exception-based termination):
The IMP notifies systemd that the unit is stopping.
If no processes remain in the unit's cgroup, the IMP exits and the job completes successfully.
If processes remain in the cgroup, systemd starts the
sdexec-stop-timer-seccountdown.After
sdexec-stop-timer-secseconds, if processes are still present, systemd sendssdexec-stop-timer-signal(SIGKILL) to them.After another
sdexec-stop-timer-secseconds, if the unit still hasn't terminated, systemd abandons the unit and job-exec drains the node.
Why the sdexec stop timer is necessary:
The stop timer is essential for handling the case where the job shell terminates normally after all tasks exit, but unkillable processes remain in the cgroup. Without this timer, the job would remain in RUN state indefinitely (or until its time limit expires), with no mechanism to detect or handle the problem.
Why the stop timer must exceed max-kill-timeout:
During exception-based job termination (cancellation, timeout, etc.), job-exec
may kill the job shell before all tasks have exited. This triggers the sdexec
stop timer sequence described above. If sdexec-stop-timer-sec were shorter
than the effective max-kill-timeout, systemd would abandon units and drain
nodes before job-exec's termination sequence completes, prematurely giving up
on jobs that might still respond to signals.
By defaulting sdexec-stop-timer-sec to the effective max-kill-timeout
(rounded up), Flux ensures systemd waits at least 2*max-kill-timeout
total (one period before sending SIGKILL, another before abandoning the unit).
This gives job-exec's termination sequence (which takes up to max-kill-timeout)
time to complete, while still providing timely cleanup for the normal
termination edge case.
Example: If max-kill-timeout = "30m", then sdexec-stop-timer-sec
defaults to 1800 seconds (30 minutes). Systemd will wait:
30 minutes before sending SIGKILL to remaining processes
Another 30 minutes (60 minutes total) before abandoning the unit
This ensures the full 30-minute job-exec termination sequence can complete before systemd intervenes.
Example with max-kill-count: If max-kill-count = 8 (default) and
kill-timeout = 5s (default), the effective max-kill-timeout is 640
seconds. The sdexec-stop-timer-sec defaults to 640 seconds, giving systemd
1280 seconds (about 21 minutes) total before abandoning the unit. Since
job-exec drains immediately after the final kill attempt at 640 seconds,
the longer systemd timeout ensures it doesn't interfere with job-exec's
termination sequence.
Sites with jobs that require extended cleanup time should set
max-kill-timeout appropriately rather than tuning sdexec-stop-timer-sec
directly, as this maintains proper coordination between both systems.
EXAMPLES
[exec]
imp = "/usr/libexec/flux/flux-imp"
job-shell = "/usr/libexec/flux/flux-shell-special"
[exec]
service = "sdexec"
sdexec-constrain-resources = true
[exec.sdexec-properties]
MemoryMax = "90%"
[exec]
service = "sdexec"
sdexec-constrain-resources = true
[sdexec]
mapper = "site.mappers.AccountingMapper"
mapper-searchpath = "/etc/flux/mappers"
[exec]
# Give jobs 30 minutes to terminate before draining nodes
max-kill-timeout = "30m"
service = "sdexec"
# sdexec-stop-timer-sec will default to 1800 (30 minutes)
# giving systemd 60 minutes total before abandoning units
[exec]
service = "sdexec"
max-kill-timeout = "15m"
# Override the default if jobs need even more time for cleanup
sdexec-stop-timer-sec = 1800 # 30 minutes instead of 15
[exec.testexec]
allow-guests = true
RESOURCES
Flux: http://flux-framework.org
Flux RFC: https://flux-framework.readthedocs.io/projects/flux-rfc
Issue Tracker: https://github.com/flux-framework/flux-core/issues
Flux Administrator's Guide: https://flux-framework.readthedocs.io/projects/flux-core/en/latest/guide/admin.html
FLUX RFC
15/Independent Minister of Privilege for Flux: The Security IMP, 23/Flux Standard Duration