Systemd Execution Module

The sdexec broker module implements the sdexec subprocess service, an alternative to the built-in rexec service. It processes sdexec.exec requests from job-exec and manages the full lifecycle of a transient systemd unit for each request. One instance runs per broker rank.

sdexec uses the libsdexec library and the D-Bus Bridge Module module to communicate with systemd.

When resource containment is enabled (exec.sdexec-constrain-resources), the sdexec-mapper module works alongside sdexec to translate job resource allocations into systemd unit properties that restrict jobs to their allocated CPUs, GPUs, and devices. See flux-config-exec(5) for configuration details.

sdexec Module

Per-process State (struct sdproc)

Each exec request creates an sdproc that holds:

  • the original request message

  • the command JSON object

  • futures for start, stop, and property-watch RPCs

  • a struct unit tracking current unit state

  • three struct channel instances for stdin, stdout, and stderr

  • stop timer state for kill escalation

  • response-sent flags to prevent duplicate responses

Unit Naming

Each transient unit is given a unique name derived from a UUID, with the Flux job ID embedded for observability. The name has a .service suffix as required by systemd.

I/O Channels

Three socketpair(2) channels are created before the unit is started:

  • in (stdin) — written by sdexec from incoming write RPCs

  • out (stdout) — read by sdexec; data forwarded as streaming responses

  • err (stderr) — read by sdexec; data forwarded as streaming responses

The file descriptors for the systemd side of each pair are passed to sdexec_start_transient_unit() as the stdin_fd, stdout_fd, and stderr_fd arguments, and are transmitted to systemd as D-Bus file handle (h) typed arguments in StartTransientUnit. The Flux side FDs are retained for reading (stdout/stderr) and writing (stdin).

Both stdout and stderr channels are line-buffered by default (CHANNEL_LINEBUF). When systemd closes its end of a channel upon unit exit, the Flux side sees EOF. sdexec waits for both stdout and stderr to reach EOF before sending the final ENODATA response that closes the exec stream.

Output data is encoded using libioencode (stream name + rank + data) and sent as streaming RPC responses.

Unit Lifecycle

After calling sdexec_start_transient_unit(), sdexec subscribes to PropertiesChanged signals on the unit's D-Bus object path. The following state transitions drive the response protocol:

ACTIVE / RUNNING with ExecMainPID set

Send started response with PID.

ACTIVE / EXITED with ExecMainCode available

Send finished response with wait status; call StopUnit.

FAILED

Send error response with systemd result code.

After StopUnit is called, sdexec waits for stdout and stderr to reach EOF, then sends ENODATA to close the exec stream.

Stop Timer and Kill Escalation

If a process does not exit on its own, the stop timer provides SIGTERM-to-SIGKILL escalation:

  1. When the unit enters DEACTIVATING state, the stop timer is armed (disabled by default; configured by kill-timeout in flux-config-exec(5)).

  2. On first expiry: KillUnit with SIGTERM is sent; timer is reset.

  3. On second expiry: KillUnit with SIGKILL is sent; timer is reset.

  4. On third expiry: the request is failed with EDEADLK.

sdexec-mapper Module

The sdexec-mapper module translates job resource allocations (cores, GPUs) into systemd unit properties when exec.sdexec-constrain-resources is enabled. It runs on each broker rank alongside the sdexec module.

Resource Mapping Process

For each job start request, sdexec-mapper:

  1. Extracts the local rank's resources from the job's R (resource set)

  2. Calls map_<type>() methods for each resource type (cores, gpus, etc.)

  3. Calls finalize_properties() to add general properties not tied to specific resource types

  4. Returns the complete property dict to sdexec for inclusion in the StartTransientUnit D-Bus call

The mapper receives exec.sdexec-properties as additional context. The default HwlocMapper uses this to scale memory cap properties (MemoryHigh, MemoryMax, MemorySwapMax) proportional to the job's processing unit allocation; custom mappers may use it for site-specific adjustments. Mapper-generated properties take precedence over sdexec-properties values.

The default HwlocMapper implementation:

  • Uses hwloc topology XML to map logical core IDs to physical CPUs and NUMA nodes

  • Discovers GPU device nodes via sysfs based on PCI addresses from hwloc

  • Sets DevicePolicy=closed to enforce device containment

  • Scales memory cap properties (MemoryHigh, MemoryMax, MemorySwapMax) from exec.sdexec-properties by the ratio of allocated to total processing units, so jobs sharing a node each receive a proportional memory limit

Property Generation

The mapper generates systemd unit properties that enforce resource constraints:

AllowedCPUs

CPU affinity mask restricting the job to allocated physical CPUs. Derived from logical core IDs via hwloc topology.

AllowedMemoryNodes

NUMA node affinity mask restricting memory allocations to nodes associated with allocated cores.

DeviceAllow

Allow list of device paths the job may access. For GPUs, includes vendor-specific devices (e.g., /dev/nvidia0, /dev/kfd) and DRM render nodes (/dev/dri/renderD*).

DevicePolicy

Set to "closed" to allow standard pseudo devices (/dev/null, etc.) while blocking access to physical devices unless explicitly allowed via DeviceAllow.

Module Statistics

The mapper module provides runtime statistics via flux module stats sdexec-mapper:

{
  "config": {
    "mapper_class": "flux.sdexec.map.HwlocMapper",
    "mapper_searchpath": ""
  },
  "requests": 42
}

Configuration

See the [sdexec] table in flux-config-exec(5) for mapper configuration options, including how to specify custom mapper classes and search paths.