************* Configuration ************* Much of Flux configuration occurs via `TOML `_ configuration files found in a hierarchy under ``/etc/flux``. There are three separate TOML configuration spaces: one for flux-security, one for the IMP (an independent component of flux-security), and one for Flux running as the system instance. Each configuration space has a separate directory, from which all files matching the glob ``*.toml`` are read. System administrators have the option of using one file for each configuration space, or breaking up each configuration space into multiple files. In the examples below, one file per configuration space is used. For more information on the three configuration spaces, please refer to :man5:`flux-config`, :security:man5:`flux-config-security`, and :security:man5:`flux-config-security-imp`. Configuring flux-security ========================= When Flux is built to support multi-user workloads, job requests are signed using a library provided by the flux-security project. This library reads a static configuration from ``/etc/flux/security/conf.d/*.toml``. Note that for security, these files and their parent directory should be owned by ``root`` without write access to other users, so adjust permissions accordingly. Example file installed path: ``/etc/flux/security/conf.d/security.toml`` .. code-block:: toml # Job requests should be valid for 2 weeks # Use munge as the job request signing mechanism [sign] max-ttl = 1209600 # 2 weeks default-type = "munge" allowed-types = [ "munge" ] See also: :security:man5:`flux-config-security-sign`. Configuring the IMP =================== The Independent Minister of Privilege (IMP) is the only program that runs as root, by way of the setuid mode bit. To enhance security, it has a private configuration space in ``/etc/flux/imp/conf.d/*.toml``. Note that the IMP will verify that files in this path and their parent directories are owned by ``root`` without write access from other users, so adjust permissions and ownership accordingly. Example file installed path: ``/etc/flux/imp/conf.d/imp.toml`` .. code-block:: toml # Only allow access to the IMP exec method by the 'flux' user. # Only allow the installed version of flux-shell(1) to be executed. [exec] allowed-users = [ "flux" ] allowed-shells = [ "/usr/libexec/flux/flux-shell" ] # Enable the "flux" PAM stack (requires PAM configuration file) pam-support = true See also: :security:man5:`flux-config-security-imp`. Configuring the Flux PAM Stack ------------------------------ If PAM support is enabled in the IMP config, the ``flux`` PAM stack must exist and have at least one ``auth`` and one ``session`` module. Example file installed path: ``/etc/pam.d/flux`` .. code-block:: console auth required pam_localuser.so session required pam_limits.so The ``pam_limits.so`` module is useful for setting default job resource limits. If it is not used, jobs run in the system instance may inherit inappropriate limits from ``flux-broker``. .. note:: The linux kernel employs a heuristic when assigning initial limits to pid 1. For example, the max user processes and max pending signals are scaled by the amount of system RAM. The Flux system broker inherits these limits and passes them on to jobs if PAM limits are not configured. This may result in rlimit warning messages similar to .. code-block:: console flux-shell[0]: WARN: rlimit: nproc exceeds current max, raising value to hard limit .. _config_cert: Configuring the Network Certificate =================================== Overlay network security requires a certificate to be distributed to all nodes. It should be readable only by the ``flux`` user. To create a new certificate, run :man1:`flux-keygen` as the ``flux`` user, then copy the result to ``/etc/flux/system`` since the ``flux`` user will not have write access to this location: .. code-block:: console $ sudo -u flux flux keygen /tmp/curve.cert $ sudo mv /tmp/curve.cert /etc/flux/system/curve.cert Do this once and then copy the certificate to the same location on the other nodes, preserving owner and mode. .. note:: The ``flux`` user only needs read access to the certificate and other files and directories under ``/etc/flux``. Keeping these files and directories non-writable by user ``flux`` adds an extra layer of security for the system instance configuration. Systemd and cgroup unified hierarchy ==================================== The flux systemd unit launches a systemd user instance as the flux user. It is recommended to use this to run user jobs, as it provides cgroups containment and the ability to enforce resource limits such as memory caps and CPU isolation. To do this, Flux requires the cgroup version 2 unified hierarchy: - The cgroup2 file system must be mounted on ``/sys/fs/cgroup`` - On some systems, add ``systemd.unified_cgroup_hierarchy=1`` to the kernel command line (RHEL 8). - On some systems, add ``cgroup_enable=memory`` to the kernel command line (debian 12). The cgroup controllers needed for resource containment must also be delegated to the Flux systemd user instance. Create or update the following override files: ``/etc/systemd/system/flux.service.d/override.conf``: .. code-block:: ini [Service] Delegate=cpu cpuset io memory pids ``/etc/systemd/system/user@.service.d/override.conf`` (where ```` is the numeric UID of the ``flux`` user, e.g. from ``id -u flux``): .. code-block:: ini [Service] Delegate=cpu cpuset io memory pids .. note:: ``cpuset`` delegation is required for ``sdexec-constrain-resources`` (``AllowedCPUs`` enforcement), and ``memory`` is required for memory limits (``MemoryMax``). The remaining controllers (``cpu``, ``io``, ``pids``) are not required but may be useful via ``sdexec-properties`` or in future Flux releases. After creating or modifying these files, reload systemd and restart the Flux user instance:: systemctl daemon-reload systemctl restart user@$(id -u flux).service The configuration that follows presumes jobs will be launched through systemd, although it is not strictly required if your system cannot meet these prerequisites. When using systemd for job execution, consider enabling resource containment via ``exec.sdexec-constrain-resources`` to restrict jobs to their allocated CPUs, GPUs, and devices. See :man5:`flux-config-exec` for details on configuring resource containment and customizing the resource mapping behavior. .. _config-flux: Configuring the Flux System Instance ==================================== Although the security components need to be isolated, most Flux components share a common configuration space, which for the system instance is located in ``/etc/flux/system/conf.d/*.toml``. The Flux broker for the system instance is pointed to this configuration by the systemd unit file. Example file installed path: ``/etc/flux/system/conf.d/system.toml`` .. code-block:: toml # Enable the sdbus and sdexec broker modules [systemd] enable = true # Flux needs to know the path to the IMP executable [exec] imp = "/usr/libexec/flux/flux-imp" # Run jobs in a systemd user instance service = "sdexec" # Constrain jobs to allocated resources (CPUs, GPUs, devices) sdexec-constrain-resources = true # Limit jobs to a percentage of physical memory [exec.sdexec-properties] MemoryMax = "95%" # Allow users other than the instance owner (guests) to connect to Flux # Optionally, root may be given "owner privileges" for convenience [access] allow-guest-user = true allow-root-owner = true # Point to shared network certificate generated flux-keygen(1). # Define the network endpoints for Flux's tree based overlay network # and inform Flux of the hostnames that will start flux-broker(1). [bootstrap] curve_cert = "/etc/flux/system/curve.cert" default_port = 8050 default_bind = "tcp://eth0:%p" default_connect = "tcp://%h:%p" # Rank 0 is the TBON parent of all brokers unless explicitly set with # parent directives. hosts = [ { host = "test[1-16]" }, ] # Speed up detection of crashed network peers (system default is around 20m) [tbon] tcp_user_timeout = "2m" # Uncomment 'norestrict' if flux broker is constrained to system cores by # systemd or other site policy. This allows jobs to run on assigned cores. # Uncomment 'exclude' to avoid scheduling jobs on certain nodes (e.g. login, # management, or service nodes). [resource] #norestrict = true #exclude = "test[1-2]" [[resource.config]] hosts = "test[1-15]" cores = "0-7" gpus = "0" [[resource.config]] hosts = "test16" cores = "0-63" gpus = "0-1" properties = ["fatnode"] # Store the kvs root hash in sqlite periodically in case of broker crash. # Recommend offline KVS garbage collection when commit threshold is reached. [kvs] checkpoint-period = "30m" gc-threshold = 100000 # Immediately reject jobs with invalid jobspec or unsatisfiable resources [ingest.validator] plugins = [ "jobspec", "feasibility" ] # Remove inactive jobs from the KVS after one week. [job-manager] inactive-age-limit = "7d" # Jobs submitted without duration get a very short one [policy.jobspec.defaults.system] duration = "1m" # Jobs that explicitly request more than the following limits are rejected [policy.limits] duration = "2h" job-size.max.nnodes = 8 job-size.max.ncores = 32 # Configure the flux-sched (fluxion) scheduler policies # The 'lonodex' match policy selects node-exclusive scheduling, and can be # commented out if jobs may share nodes. [sched-fluxion-qmanager] queue-policy = "easy" [sched-fluxion-resource] match-policy = "lonodex" match-format = "rv1_nosched" See also: :man5:`flux-config-exec`, :man5:`flux-config-access` :man5:`flux-config-bootstrap`, :man5:`flux-config-tbon`, :man5:`flux-config-resource`, :man5:`flux-config-ingest`, :man5:`flux-config-job-manager`, :man5:`flux-config-policy`, :man5:`flux-config-kvs`, :man5:`flux-config-systemd`, :sched:man5:`flux-config-sched-fluxion-qmanager`, :sched:man5:`flux-config-sched-fluxion-resource`. Configuring Resources ===================== The Flux system instance must be configured with a static resource set. The ``resource.config`` TOML array in the example above is the preferred way to configure clusters with a resource set consisting of only nodes, cores, and GPUs. More complex resource sets may be represented by generating a file in RFC 20 (R version 1) form with scheduler extensions using a combination of ``flux R encode`` and ``flux ion-R encode`` and then configuring ``resource.path`` to its fully-qualified file path. The details of this method are beyond the scope of this document. When Flux is running, ``flux resource list`` shows the configured resource set and any resource properties. Persistent Storage on Rank 0 ============================ Flux is prolific in its use of disk space to back up its key value store, proportional to the number of jobs run and the quantity of standard I/O. On your rank 0 node, ensure that the ``statedir`` directory (normally ``/var/lib/flux``) has plenty of space and is preserved across Flux instance restarts. The ``statedir`` directory is used for the ``content.sqlite`` file that contains content addressable storage backing the Flux key value store (KVS). Adding Prolog/Epilog/Housekeeping Scripts ========================================= Flux can execute site-defined scripts as root on compute nodes before and after each job. prolog The prolog runs as soon as the job enters RUN state. Job shells are not launched until all prolog tasks have completed. If the prolog fails on any nodes, or if any node takes longer than a fail-safe timeout (default 30m), those nodes are drained and a fatal exception is raised on the job. If the job is canceled or reaches its time limit during the prolog, the prolog is simply aborted and the job enters COMPLETING state. epilog The epilog runs after job shell exits on all nodes, with the job held in COMPLETING state until all epilog tasks have terminated. If the epilog fails on any nodes, those nodes are drained and a fatal exception is raised on the job. There is no default epilog timeout. housekeeping Housekeeping runs after the job has reached the INACTIVE state. It is not recorded in the job eventlog and does not affect the job result. If housekeeping fails on any nodes, those nodes are drained. Housekeeping releases resources to the scheduler as they complete. Script Installation Locations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. note:: New in v0.78.0 When configured as recommended below, Flux runs prolog, epilog and housekeeping scripts from the following locations (in order): - Package provided scripts in ``$libexecdir/flux/{name}.d``, where `{name}` is `prolog`, `epilog`, or `housekeeping`. - If ``/etc/flux/system/{name}`` exists and is executable, then this site provided script is run next. This provides backward compatible support for existing installations or allows sites to override default behavior for execution of site-provided scripts. - If ``/etc/flux/system/{name}`` does not exist or is not executable, then all scripts in ``/etc/flux/system/{name}.d`` are executed. .. note:: The location of ``$libexecdir`` is system dependent, but can be determined from :command:`pkg-config --variable=fluxlibexecdir flux-core`. Verifying Script Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use :command:`flux admin system-scripts` to view the current configuration and verify which scripts will be executed: .. code-block:: console $ flux admin system-scripts prolog: enabled (per-rank=true) system: /usr/libexec/flux/prolog.d ✓ 01-syslog.sh site: /etc/flux/system/prolog.d ✓ 10-local-setup.sh epilog: enabled (per-rank=true) system: /usr/libexec/flux/epilog.d ✓ 01-cleanup.sh housekeeping: not configured This command shows: - Whether each system script type is enabled or configured - The execution mode: - ``per-rank=true``: scripts execute on all allocated nodes - ``per-rank=false``: scripts execute only on rank 0 Note: The ``per-rank`` setting controls *where* scripts execute, not *which* scripts execute. The same scripts from the directories are used in both modes. - Which scripts will be executed and from which directories. Scripts are shown when using the default ``flux-imp run `` command (used when no explicit command is configured). Custom commands show the command itself instead of script directories. - Whether scripts are executable (✓) or not (✗) - If a legacy single-file script is present, which skips the ``.d`` directory Use :option:`flux admin system-scripts -v` to show script locations even when not configured. Script Environment ~~~~~~~~~~~~~~~~~~ Scripts may use :envvar:`FLUX_JOB_ID` and :envvar:`FLUX_JOB_USERID` to take job or user specific actions. Flux commands can be run from the scripts with instance owner credentials if the system is configured for root access as suggested in :ref:`config-flux`. The IMP sets :envvar:`PATH` to a safe ``/usr/sbin:/usr/bin:/sbin:/bin``. Error Handling ~~~~~~~~~~~~~~ By default, the Flux prolog, epilog, and housekeeping collect exit codes from all scripts and will exit with the highest exit code encountered. This allows all scripts to run even if some fail. To change this behavior and exit immediately on the first script failure, one or more of the following entries can be added to the broker configuration: .. code-block:: toml [job-manager.prolog] exit-on-first-error = true [job-manager.epilog] exit-on-first-error = true [job-manager.housekeeping] exit-on-first-error = true Configuration ~~~~~~~~~~~~~ 1. Flux provides systemd *oneshot* units ``flux-prolog@``, ``flux-epilog@``, and ``flux-housekeeping@`` templated by jobid, which run the actual workflow described in the sections above. Configure the IMP to allow the system instance user to start these units as root via the provided provided wrapper scripts: .. code-block:: toml [run.prolog] allowed-users = [ "flux" ] allowed-environment = [ "FLUX_*" ] path = "/usr/libexec/flux/cmd/flux-run-prolog" [run.epilog] allowed-users = [ "flux" ] allowed-environment = [ "FLUX_*" ] path = "/usr/libexec/flux/cmd/flux-run-epilog" [run.housekeeping] allowed-users = [ "flux" ] allowed-environment = [ "FLUX_*" ] path = "/usr/libexec/flux/cmd/flux-run-housekeeping" 2. Configure the Flux system instance to run prolog, epilog, and housekeeping: .. code-block:: toml [job-manager] plugins = [ { load = "perilog.so" } ] [job-manager.prolog] per-rank = true # timeout = "30m" [job-manager.epilog] per-rank = true # timeout = "0" [job-manager.housekeeping] release-after = "30s" Standard output and standard error of the prolog, epilog, and housekeeping units are captured by the systemd journal. Standard systemd tools like :linux:man1:`systemctl` and :linux:man1:`journalctl` can be used to observe and manipulate the prolog, epilog, and housekeeping systemd units. See also: :man1:`flux-housekeeping`. :man5:`flux-config-job-manager`, :security:man5:`flux-config-security-imp`, Adding Job Request Validation ============================= Jobs are submitted to Flux via a job-ingest service. This service validates all jobs before they are assigned a jobid and announced to the job manager. By default, only basic validation is done, but the validator supports plugins so that job ingest validation is configurable. The list of available plugins can be queried via ``flux job-validator --list-plugins``. The current list of plugins distributed with Flux is shown below: .. code-block:: console $ flux job-validator --list-plugins Available plugins: feasibility Use feasibility service to validate job jobspec Python bindings based jobspec validator require-instance Require that all jobs are new instances of Flux schema Validate jobspec using jsonschema Only the ``jobspec`` plugin is enabled by default. In a system instance, it may be useful to also enable the ``feasibility`` and ``require-instance`` validators. This can be done by configuring the Flux system instance via the ``ingest`` TOML table, as shown in the example below: .. code-block:: toml [ingest.validator] plugins = [ "jobspec", "feasibility", "require-instance" ] The ``feasibility`` plugin will allow the scheduler to reject jobs that are not feasible given the current resource configuration. Otherwise, these jobs are enqueued, but will have a job exception raised once the job is considered for scheduling. The ``require-instance`` plugin rejects jobs that do not start another instance of Flux. That is, jobs are required to be submitted via tools like :man1:`flux-batch` and :man1:`flux-alloc`, or the equivalent. For example, with this plugin enabled, a user running :man1:`flux-run` will have their job rejected with the message: .. code-block:: console $ flux run -n 1000 myapp flux-run: ERROR: [Errno 22] Direct job submission is disabled for this instance. Please use the flux-batch(1) or flux-alloc(1) commands. See also: :man5:`flux-config-ingest`. Adding Queues ============= It may be useful to configure a Flux system instance with multiple queues. Each queue should be associated with a non-overlapping resource subset, identified by property name. It is good practice for queues to create a new property that has the same name as the queue. (There is no requirement that queue properties match the queue name, but this will cause extra entries in the PROPERTIES column for these queues. When queue names match property names, :command:`flux resource list` suppresses these matching properties in its output.) When queues are defined, all jobs must specify a queue at submission time. If that is inconvenient, then ``policy.jobspec.defaults.system.queue`` may define a default queue. Finally, queues can override the ``[policy]`` table on a per queue basis. This is useful for setting queue-specific limits. Here is an example that puts these concepts together: .. code-block:: toml [policy] jobspec.defaults.system.duration = "1m" jobspec.defaults.system.queue = "debug" [[resource.config]] hosts = "test[1-4]" properties = ["debug"] [[resource.config]] hosts = "test[5-16]" properties = ["batch"] [queues.debug] requires = ["debug"] policy.limits.duration = "30m" [queues.batch] requires = ["batch"] policy.limits.duration = "4h" When named queues are configured, :man1:`flux-queue` may be used to list them: .. code-block:: console $ flux queue status batch: Job submission is enabled debug: Job submission is enabled Scheduling is enabled See also: :man5:`flux-config-policy`, :man5:`flux-config-queues`, :man5:`flux-config-resource`, :man1:`flux-queue`. Policy Limits ============= Job duration and size are unlimited by default, or limited by the scheduler feasibility check discussed above, if configured. When policy limits are configured, the job request is compared against them *after* any configured jobspec defaults are set, and *before* the scheduler feasibility check. If the job would exceed a duration or job size policy limit, the job submission is rejected. .. warning:: flux-sched 0.25.0 limitation: jobs that specify nodes but not cores may escape flux-core's ``ncores`` policy limit, and jobs that specify cores but not nodes may escape the ``nnodes`` policy limit. The flux-sched feasibility check will eventually cover this case. Until then, be sure to set both ``nnodes`` *and* ``ncores`` limits when configuring job size policy limits. Limits are global when set in the top level ``[policy]`` table. Global limits may be overridden by a ``policy`` table within a ``[queues]`` entry. Here is an example which implements duration and job size limits for two queues: .. code-block:: toml # Global defaults [policy] jobspec.defaults.system.duration = "1m" jobspec.defaults.system.queue = "debug" [queues.debug] requires = ["debug"] policy.limits.duration = "30m" policy.limits.job-size.max.nnodes = 2 policy.limits.job-size.max.ncores = 16 [queues.batch] requires = ["batch"] policy.limits.duration = "8h" policy.limits.job-size.max.nnodes = 16 policy.limits.job-size.max.ncores = 128 See also: :man5:`flux-config-policy`. Use PAM to Restrict Access to Compute Nodes =========================================== If Pluggable Authentication Modules (PAM) are in use within a cluster, it may be convenient to use the ``pam_flux.so`` *account* module to configure a PAM stack that denies users access to compute nodes unless they have a job running there. Install the ``flux-pam`` package to make the ``pam_flux.so`` module available to be added to one or more PAM stacks, e.g. .. code-block:: console account sufficient pam_flux.so