Systemd Execution Module
The sdexec broker module implements the sdexec subprocess service,
an alternative to the built-in rexec service. It processes
sdexec.exec requests from job-exec and manages the full lifecycle
of a transient systemd unit for each request. One instance runs per
broker rank.
sdexec uses the libsdexec library and the D-Bus Bridge Module module to communicate with systemd.
When resource containment is enabled (exec.sdexec-constrain-resources),
the sdexec-mapper module works alongside sdexec to translate job
resource allocations into systemd unit properties that restrict jobs to
their allocated CPUs, GPUs, and devices. See flux-config-exec(5) for
configuration details.
sdexec Module
Per-process State (struct sdproc)
Each exec request creates an sdproc that holds:
the original request message
the command JSON object
futures for start, stop, and property-watch RPCs
a
struct unittracking current unit statethree
struct channelinstances for stdin, stdout, and stderrstop timer state for kill escalation
response-sent flags to prevent duplicate responses
Unit Naming
Each transient unit is given a unique name derived from a UUID, with the
Flux job ID embedded for observability. The name has a .service suffix
as required by systemd.
I/O Channels
Three socketpair(2) channels are created before the unit is started:
in (stdin) — written by sdexec from incoming write RPCs
out (stdout) — read by sdexec; data forwarded as streaming responses
err (stderr) — read by sdexec; data forwarded as streaming responses
The file descriptors for the systemd side of each pair are passed to
sdexec_start_transient_unit() as the stdin_fd, stdout_fd, and
stderr_fd arguments, and are transmitted to systemd as D-Bus file handle
(h) typed arguments in StartTransientUnit. The Flux side FDs are
retained for reading (stdout/stderr) and writing (stdin).
Both stdout and stderr channels are line-buffered by default (CHANNEL_LINEBUF).
When systemd closes its end of a channel upon unit exit, the Flux side sees
EOF. sdexec waits for both stdout and stderr to reach EOF before sending the
final ENODATA response that closes the exec stream.
Output data is encoded using libioencode (stream name + rank + data) and
sent as streaming RPC responses.
Unit Lifecycle
After calling sdexec_start_transient_unit(), sdexec subscribes to
PropertiesChanged signals on the unit's D-Bus object path. The
following state transitions drive the response protocol:
- ACTIVE / RUNNING with ExecMainPID set
Send started response with PID.
- ACTIVE / EXITED with ExecMainCode available
Send finished response with wait status; call StopUnit.
- FAILED
Send error response with systemd result code.
After StopUnit is called, sdexec waits for stdout and stderr to reach EOF,
then sends ENODATA to close the exec stream.
Stop Timer and Kill Escalation
If a process does not exit on its own, the stop timer provides SIGTERM-to-SIGKILL escalation:
When the unit enters DEACTIVATING state, the stop timer is armed (disabled by default; configured by
kill-timeoutin flux-config-exec(5)).On first expiry:
KillUnitwith SIGTERM is sent; timer is reset.On second expiry:
KillUnitwith SIGKILL is sent; timer is reset.On third expiry: the request is failed with
EDEADLK.
sdexec-mapper Module
The sdexec-mapper module translates job resource allocations (cores, GPUs)
into systemd unit properties when exec.sdexec-constrain-resources is
enabled. It runs on each broker rank alongside the sdexec module.
Resource Mapping Process
For each job start request, sdexec-mapper:
Extracts the local rank's resources from the job's R (resource set)
Calls
map_<type>()methods for each resource type (cores, gpus, etc.)Calls
finalize_properties()to add general properties not tied to specific resource typesReturns the complete property dict to sdexec for inclusion in the StartTransientUnit D-Bus call
The mapper receives exec.sdexec-properties as additional context.
The default HwlocMapper uses this to scale memory cap properties
(MemoryHigh, MemoryMax, MemorySwapMax) proportional to the
job's processing unit allocation; custom mappers may use it for
site-specific adjustments. Mapper-generated properties take precedence
over sdexec-properties values.
The default HwlocMapper implementation:
Uses hwloc topology XML to map logical core IDs to physical CPUs and NUMA nodes
Discovers GPU device nodes via sysfs based on PCI addresses from hwloc
Sets DevicePolicy=closed to enforce device containment
Scales memory cap properties (
MemoryHigh,MemoryMax,MemorySwapMax) fromexec.sdexec-propertiesby the ratio of allocated to total processing units, so jobs sharing a node each receive a proportional memory limit
Property Generation
The mapper generates systemd unit properties that enforce resource constraints:
- AllowedCPUs
CPU affinity mask restricting the job to allocated physical CPUs. Derived from logical core IDs via hwloc topology.
- AllowedMemoryNodes
NUMA node affinity mask restricting memory allocations to nodes associated with allocated cores.
- DeviceAllow
Allow list of device paths the job may access. For GPUs, includes vendor-specific devices (e.g., /dev/nvidia0, /dev/kfd) and DRM render nodes (/dev/dri/renderD*).
- DevicePolicy
Set to "closed" to allow standard pseudo devices (/dev/null, etc.) while blocking access to physical devices unless explicitly allowed via DeviceAllow.
Module Statistics
The mapper module provides runtime statistics via flux module stats sdexec-mapper:
{
"config": {
"mapper_class": "flux.sdexec.map.HwlocMapper",
"mapper_searchpath": ""
},
"requests": 42
}
Configuration
See the [sdexec] table in flux-config-exec(5) for mapper
configuration options, including how to specify custom mapper classes
and search paths.