flux-config-pam(5)
DESCRIPTION
The pam table configures flux-pam features that manage systemd user
slices for Flux job users. This includes:
The flux-pam prolog and housekeeping scripts, which run during job prolog and housekeeping phases to start/stop user services and optionally apply resource constraints to user slices.
The
pam_flux.soPAM session module, which attaches login sessions authenticated via the account module to the user's managed slice whenmanage-user-sliceis enabled. See pam_flux(8).
PREREQUISITES
pam.manage-user-slice requires systemd ≥ 239 and the cgroup v2 unified
hierarchy. Resource constraints (AllowedCPUs, AllowedMemoryNodes,
DevicePolicy, DeviceAllow) are applied to user slice units via
systemctl set-property --runtime, which is only enforced by systemd on
the unified hierarchy. cgroup v1 systems are not supported.
When pam.manage-user-slice is enabled, Flux takes ownership of the
user@UID.service manager for job users — starting it at first job and
stopping it after the last. systemd linger must not be enabled for job
users on compute nodes. Linger (loginctl enable-linger) keeps
user@UID.service running independently of jobs, which interferes with
Flux's lifecycle management in ways that can cause login sessions to escape
containment silently. The prolog will fail hard if it detects linger is
enabled for a job user, rather than proceeding with an inconsistent state.
The flux-pam package installs flux-pam-prolog and
flux-pam-housekeeping into $libexecdir/flux/prolog.d/ and
$libexecdir/flux/housekeeping.d/, where they are run by
flux-run-prolog and flux-run-housekeeping respectively.
For these scripts to execute on compute nodes, the Flux system instance
must load the perilog.so job-manager plugin with per-rank = true
for prolog and housekeeping. With per-rank = true, the default
command is flux-imp run prolog (or housekeeping), which invokes
flux-run-prolog (or flux-run-housekeeping) as root, executing
all scripts in the drop-in directory.
Flux system instance (/etc/flux/system/conf.d/):
[job-manager]
plugins = [
{ load = "perilog.so" }
]
[job-manager.prolog]
per-rank = true
[job-manager.housekeeping]
per-rank = true
IMP (/etc/flux/imp/conf.d/):
[run.prolog]
allowed-users = [ "flux" ]
allowed-environment = [ "FLUX_*" ]
path = "/usr/libexec/flux/cmd/flux-run-prolog"
[run.housekeeping]
allowed-users = [ "flux" ]
allowed-environment = [ "FLUX_*" ]
path = "/usr/libexec/flux/cmd/flux-run-housekeeping"
See flux-config-job-manager(5) and the Flux Administrator's Guide for further details.
KEYS
All keys are optional and default to false unless otherwise noted.
- manage-user-slice
Boolean value that enables systemd user slice lifecycle management via prolog and housekeeping scripts. When enabled, the prolog starts
user@UID.service(if not already running) for each job user, and housekeeping stops it when the user's last job completes. This is the master switch for all user slice management features, including session attachment inpam_flux.so(see pam_flux(8)). (Default:false).When this feature is disabled, prolog and housekeeping scripts exit early without managing user services or applying resource constraints. This includes the instance owner (who always skips management).
- kill-user-slice
Boolean value that controls whether housekeeping actively terminates processes remaining in the user slice when stopping
user@UID.service. (Default:false).When set to
true, housekeeping implements aggressive cleanup:Checks for orphan processes (processes in
user-UID.slicebut not underuser@UID.service, such as leftover SSH sessions or other systemd scopes)If orphans exist, sends
SIGTERMto all processes in the sliceWaits for
kill-slice-grace-timefor processes to exitIf processes remain, sends
SIGKILLto all processes in the sliceWaits for
kill-slice-grace-timeagainIf processes still remain, raises an error and drains the node
When set to
false(the default), housekeeping stopsuser@UID.servicewithout attempting to kill processes. Cleanup is delegated to other mechanisms such as site-specific tools.Warning
All processes in the user's slice — including interactive login sessions — are terminated when the last job completes.
- kill-slice-grace-time
Duration in Flux Standard Duration (FSD) format specifying how long to wait for processes to exit after each kill signal. Only applies when
kill-user-slice = true. (Default:"30s").The grace time is applied twice: once after
SIGTERM, once afterSIGKILL. Maximum total cleanup time is therefore2 * kill-slice-grace-time. If processes remain after both waits, housekeeping drains the node.See 23/Flux Standard Duration for the FSD format specification.
- debug
Boolean value that enables verbose debug logging for the prolog and housekeeping scripts. When
true, each script logs its actions to stderr, which is captured in the Flux job-manager log. Equivalent to setting theFLUX_PAM_SCRIPTS_DEBUGenvironment variable. (Default:false).
RESOURCE CONSTRAINTS
The pam table works in conjunction with the exec configuration
for resource management:
- exec.sdexec-constrain-resources
When enabled (along with
pam.manage-user-slice), prolog scripts compute the union of resources allocated to all of a user's jobs on a node and apply corresponding systemd properties to the user slice:AllowedCPUs- Restricts slice to allocated CPU coresAllowedMemoryNodes- Restricts slice to NUMA nodes for allocated coresDeviceAllow- Grants access only to allocated GPUsDevicePolicy=closed- Blocks access to physical devices except those explicitly allowed
When
exec.sdexec-constrain-resourcesis disabled, prolog/housekeeping still manage the user service lifecycle (start/stop) ifpam.manage-user-sliceis enabled, but do not apply resource constraints.See flux-config-exec(5) for details on the
execconfiguration.
OPERATION
Prolog Scripts
Prolog scripts run at job start and perform the following actions (when
pam.manage-user-slice is enabled):
Acquire an exclusive lock for the user (prevents races between concurrent prolog/housekeeping operations)
Check that linger is not enabled for the user (fail hard if it is — see PREREQUISITES)
Count active jobs on the node for this user (excluding the starting job)
Start
user@UID.service(idempotent: no-op if already running due to a concurrent prolog for the same user)If
exec.sdexec-constrain-resourcesis enabled:Compute the union of resources from all active jobs (including the starting job)
Query
sdexec-mapperfor systemd properties corresponding to the resource unionApply properties to
user-UID.sliceviasystemctl set-property
Release the lock
Housekeeping Scripts
Housekeeping scripts run at job completion and perform the following
actions (when pam.manage-user-slice is enabled):
Acquire an exclusive lock for the user
Count remaining active jobs on the node for this user (excluding the completed job)
If jobs remain (count > 0) and
exec.sdexec-constrain-resourcesis enabled, recalculate and apply resource constraints for the remaining jobsIf no jobs remain (count = 0):
If
pam.kill-user-sliceistrue, perform cleanup sequence (seekill-user-sliceabove)Stop
user@UID.service
Release the lock
LOCKING AND SERIALIZATION
Prolog and housekeeping scripts acquire an exclusive lock (via flock)
on /run/flux-pam/uid.UID.lock to serialize operations for each user.
This prevents race conditions when multiple jobs for the same user start
or complete concurrently on the same node.
The lock is held for the entire duration of prolog/housekeeping execution and released automatically when the script exits.
The lock directory (/run/flux-pam by default) must have permissions
0700 (owner read/write/execute only) and be owned by root. Lock files
within the directory are created with permissions 0600 (owner read/write
only) and are never deleted (they persist to avoid recreating them on each
operation). If the lock directory has group or other write permissions, both
the prolog/housekeeping scripts and the PAM session module will refuse to
proceed and log an error. The directory is created with correct permissions
at boot by the flux-pam tmpfiles.d drop-in (/usr/lib/tmpfiles.d/flux-pam.conf).
SECURITY CONSIDERATIONS
User Isolation
When exec.sdexec-constrain-resources is enabled, systemd resource
constraints ensure that:
Users can only access CPU cores allocated to their jobs
Users can only access GPUs allocated to their jobs
Users cannot access physical devices not explicitly granted
However, users in the same user slice (user-UID.slice) share these
constraints. All of a user's jobs on a node, plus any other processes the
user starts within user@UID.service (such as SSH sessions if permitted),
collectively share the union of resources allocated to the user's jobs.
Orphan Processes
With kill-user-slice = false (the default), processes that outlive a
user's jobs may remain in the user slice even after user@UID.service
stops. These processes may retain access to resources that were allocated
to previous jobs. Sites concerned about this should either:
Enable
kill-user-sliceto forcibly terminate orphansConfigure systemd's
KillModefor user slices to handle cleanupDeploy separate mechanisms to detect and terminate orphan processes
Use
pam_flux.soaccount management to deny non-job logins entirely
EXAMPLES
Minimal configuration to enable user slice lifecycle management:
[pam]
manage-user-slice = true
Enable resource constraints (requires systemd execution service):
[exec]
service = "sdexec"
sdexec-constrain-resources = true
[pam]
manage-user-slice = true
Enable aggressive orphan cleanup with 60-second grace time:
[pam]
manage-user-slice = true
kill-user-slice = true
kill-slice-grace-time = "60s"
Enable debug logging for prolog and housekeeping scripts:
[pam]
manage-user-slice = true
debug = true
RESOURCES
Flux Administrator's Guide: https://flux-framework.readthedocs.io/projects/flux-core/en/latest/guide/admin.html
SEE ALSO
flux-config(5), flux-config-exec(5), flux-config-job-manager(5), Flux Administrator's Guide: Adding Prolog/Housekeeping Scripts, pam_flux(8), pam.d(5)