Flux Accounting Guide
key terms: association, bank
Note
flux-accounting is still beta software and many of the interfaces documented in this guide may change with regularity.
This document is in DRAFT form.
Overview
By default, a Flux system instance treats users equally and schedules work based on demand, without consideration of a user's history of resource consumption, or what share of available resources their organization considers they should be entitled to use relative to other competing users.
Flux-accounting adds a database which stores site policy, banks with with user/project associations, and metrics representing historical usage. It also adds a Flux jobtap plugin that sets the priority on each job that enters the system based on multiple factors including fair share values. The priority determines the order in which jobs are considered by the scheduler for resource allocation. In addition, the jobtap plugin holds or rejects job requests that exceed user/project specific limits or have exhausted their bank allocations.
The database is populated and queried with command line tools prefixed with
flux account
. Accounting scripts are run regularly by
flux-cron(1) to pull historical job information from the Flux
job-list
and job-info
interfaces into the accounting database,
and to push bank and limit data to the jobtap plugin.
At this time, the database is expected to be installed on a cluster management node, co-located with the rank 0 Flux broker, managing accounts for that cluster only. Sites would typically populate the database and keep it up to date automatically using information regularly pulled or pushed from an external source like an identity management system.
Installation and Configuration
System Prerequisites
The Flux Administrator's Guide documents relevant information for the administration and management of a Flux system instance.
The following instructions assume that Flux is configured and working, that
the Flux statedir (/var/lib/flux
) is writable by the flux
user,
and that the flux
user is the system instance owner.
Installing Software Packages
The flux-accounting
package should be installed on the management node
from your Linux distribution package manager. Once installed, the service
that accepts flux account
commands and interacts with the flux-accounting
database can be started.
You can enable the service with systemctl
; if not configured with a custom
path, the flux-accounting systemd unit file will be installed to the same
location as flux-core's systemd unit file:
$ sudo systemctl enable flux-accounting
The service can then be controlled with systemd
. To utilize the service,
the following prerequisites must be met:
1. A flux-accounting database has been created with flux account create-db
.
The service establishes a connection with the database in order to read from
and write to it. If the service has been started before the creation of the
database, you may encounter unexpected behavior from running flux account
commands, such as sqlite3.OperationalError: attempt to write a readonly database
.
2. An active Flux system instance is running. The flux-accounting service will only run after the system instance is started.
Accounting Database Creation
The accounting database is created with the command below. Default
parameters are assumed, including the accounting database path of
/var/lib/flux/FluxAccounting.db
.
$ sudo -u flux flux account create-db
Note
The flux accounting commands should always be run as the flux user. If they are run as root, some commands that rewrite the database could change the owner to root, causing flux-accounting scripts run from flux cron to fail.
Banks must be added to the system, for example:
$ sudo -u flux flux account add-bank root 1
$ sudo -u flux flux account add-bank --parent-bank=root sub_bank_A 1
Users that are permitted to run on the system must be assigned banks, for example:
$ sudo -u flux flux account add-user --username=user1234 --bank=sub_bank_A
Enabling Multi-factor Priority
When flux-accounting is installed, the job manager uses a multi-factor
priority plugin to calculate job priorities. The Flux system instance must
configure the job-manager
to load this plugin.
[job-manager]
plugins = [
{ load = "mf_priority.so" },
]
See also: flux-config-job-manager(5).
The plugin can also be manually loaded with flux jobtap load
. Be sure to
send all flux-accounting data to the plugin after it is loaded:
$ flux jobtap load mf_priority.so
$ flux account-priority-update
Automatic Accounting Database Updates
If updating flux-accounting to a newer version on a system where a flux-accounting DB is already configured and set up, it is important to update the database schema, as tables and columns may have been added or removed in the newer version. The flux-accounting database schema can be updated with the following command:
$ sudo -u flux flux account-update-db
A series of actions should run periodically to keep the accounting system in sync with Flux:
A script fetches inactive jobs and inserts them into a
jobs
table in the flux-accounting DB.The job-archive module scans inactive jobs and dumps them to a sqlite database.
A script reads the archive database and updates the job usage data in the accounting database.
A script updates the per-user fair share factors in the accounting database.
A script pushes updated factors to the multi-factor priority plugin.
The Flux system instance must configure the job-archive
module to run
periodically:
[archive]
period = "1m"
See also: flux-config-archive(5).
The scripts should be run by flux-cron(1):
# /etc/flux/system/cron.d/accounting
30 * * * * bash -c "flux account-fetch-job-records; flux account update-usage; flux account-update-fshare; flux account-priority-update"
Periodically fetching and storing job records in the flux-accounting database
can cause the DB to grow large in size. Since there comes a point where job
records become no longer useful to flux-accounting in terms of job usage and
fair-share calculation, you can run flux account scrub-old-jobs
to
remove old job records. If no argument is passed to this command, it will
delete any job record that has completed more than 6 months ago. This can be
tuned by specifying the number of weeks to go back when determining which
records to remove. The example below will remove any job record more than 4
weeks old:
$ flux account scrub-old-jobs 4
By default, the memory occupied by a SQLite database does not decrease when
records are DELETE
'd from the database. After scrubbing old job records
from the flux-accounting database, if space is still an issue, the VACUUM
command will clean up the space previously occupied by those deleted records.
You can run this command by connecting to the flux-accounting database in a
SQLite shell:
$ sqlite3 FluxAccounting.db
sqlite> VACUUM;
Note that running VACUUM
can take minutes to run and also requires an
exclusive lock on the database; it will fail if the database has a pending SQL
statement or open transaction.
Database Administration
The flux-accounting database is a SQLite database which stores user account information and bank information. Administrators can add, disable, edit, and view user and bank information by interfacing with the database through front-end commands provided by flux-accounting. The information in this database works with flux-core to calculate job priorities submitted by users, enforce basic job accounting limits, and calculate fair-share values for users based on previous job usage.
Each user belongs to at least one bank. This user/bank combination is known as an association, and henceforth will be referred to as an association throughout the rest of this document.
Note
In order to interact with the flux-accounting database, you must have read and write permissions to the directory that the database resides in. The SQLite documentation states that since "SQLite reads and writes an ordinary disk file, the only access permissions that can be applied are the normal file access permissions of the underlying operating system."
The front-end commands provided by flux-accounting allow an administrator to
interact with association or bank information. flux account -h
will list
all possible commands that interface with the information stored in their
respective tables in the flux-accounting database. The current database
consists of the following tables:
table name |
description |
---|---|
association_table |
stores associations |
bank_table |
stores banks |
job_usage_factor_table |
stores past job usage factors for associations |
t_half_life_period_table |
keeps track of the current half-life period for calculating job usage factors |
queue_table |
stores queues, their limits properties, as well as their associated priorities |
project_table |
stores projects for associations to charge their jobs against |
jobs |
stores inactive jobs for job usage and fair share calculation |
To view all associations in a flux-accounting database, the view-bank
command will print this DB information in a hierarchical format. An example is
shown below showing all associations under the root bank:
$ flux account view-bank root -t
Account Username RawShares RawUsage Fairshare
root 1 0.0
bank_A 1 0.0
bank_A user_1 1 0.0 0.5
bank_B 1 0.0
bank_B user_2 1 0.0 0.5
bank_B user_3 1 0.0 0.5
bank_C 1 0.0
bank_C_a 1 0.0
bank_C_a user_4 1 0.0 0.5
bank_C_b 1 0.0
bank_C_b user_5 1 0.0 0.5
bank_C_b user_6 1 0.0 0.5
Job Usage Factor Calculation
An association's job usage represents their usage on a cluster in relation to
the size of their jobs and how long they ran. The raw job usage value is
defined as the sum of products of the number of nodes used (nnodes
) and
time elapsed (t_elapsed
):
RawUsage = sum(nnodes * t_elapsed)
This job usage factor per association has a half-life decay applied to it as time passes. By default, this half-life decay is applied to jobs every week for four weeks; jobs older than four weeks no longer play a role in determining an association's job usage factor. The configuration parameters that determine how to represent a half-life for jobs and how long to consider jobs as part of an association's overall job usage are represented by PriorityDecayHalfLife and PriorityUsageResetPeriod, respectively. These parameters are configured when the flux-accounting database is first created.
Example Job Usage Calculation
Below is an example of how flux-accounting calculates an association's current job usage. Let's say a user has the following job records from the most recent half-life period (by default, jobs that have completed in the last week):
UserID Username JobID T_Submit T_Run T_Inactive Nodes R
0 1002 1002 102 1605633403.22141 1605635403.22141 1605637403.22141 2 {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
1 1002 1002 103 1605633403.22206 1605635403.22206 1605637403.22206 2 {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
2 1002 1002 104 1605633403.22285 1605635403.22286 1605637403.22286 2 {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
3 1002 1002 105 1605633403.22347 1605635403.22348 1605637403.22348 1 {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
4 1002 1002 106 1605633403.22416 1605635403.22416 1605637403.22416 1 {"version":1,"execution": {"R_lite":[{"rank":"0","children": {"core": "0"}}]}}
From these job records, we can gather the following information:
total nodes used (
nnodes
): 8total time elapsed (
t_elapsed
): 10000.0
So, the usage of the association from this current half life is:
sum(nnodes * t_elapsed) = (2 * 2000) + (2 * 2000) + (2 * 2000) + (1 * 2000) + (1 * 2000)
= 4000 + 4000 + 4000 + 2000 + 2000
= 16000
This current job usage is then added to the association's previous job usage stored in the flux-accounting database. This sum then represents the association's overall job usage.
Multi-Factor Priority Plugin
The multi-factor priority plugin is a jobtap plugin that generates an integer job priority for incoming jobs in a Flux system instance. It uses a number of factors to calculate a priority and, in the future, can add more factors. Each factor \(F\) has an associated integer weight \(W\) that determines its importance in the overall priority calculation. The current factors present in the multi-factor priority plugin are:
- fair-share
The ratio between the amount of resources allocated vs. resources consumed. See the Glossary definition for a more detailed explanation of how fair-share is utilized within flux-accounting.
- queue
A configurable factor assigned to a queue.
- urgency
A user-controlled factor to prioritize their own jobs.
Thus the priority \(P\) is calculated as follows:
\(P = (F_{fairshare} \times W_{fairshare}) + (F_{queue} \times W_{queue}) + (F_{urgency} - 16)\)
Each of these factors can be configured with a custom weight to increase their relevance to the final calculation of a job's integer priority. By default, fair-share has a weight of 100000 and the queue the job is submitted in has a weight of 10000. These can be modified to change how a job's priority is calculated. For example, if you wanted the queue to be more of a factor than fair-share, you can adjust each factor's weight accordingly:
[accounting.factor-weights]
fairshare = 1000
queue = 100000
In addition to generating an integer priority for submitted jobs in a Flux system instance, the multi-factor priority plugin also enforces per-association job limits to regulate use of the system. The two per-association limits enforced by this plugin are:
max_active_jobs: a limit on how many active jobs an association can have at any given time. Jobs submitted after this limit has been hit will be rejected with a message saying that the association has hit their active jobs limit.
max_running_jobs: a limit on how many running jobs an association can have at any given time. Jobs submitted after this limit has been hit will be held by adding a
max-running-jobs-user-limit
dependency until one of the association's currently running jobs finishes running.
Both "types" of jobs, running and active, are based on Flux's definitions of job states. Active jobs can be in any state but INACTIVE. Running jobs are jobs in either RUN or CLEANUP states.
Glossary
- association
A 2-tuple combination of a username and bank name.
- bank
An account that contains associations.
- fair-share
A metric used to ensure equitable resource allocation among associations within a shared system. It represents the ratio between the amount of resources an association is allocated versus the amount actually consumed. The fair-share value influences an association's priority when submitting jobs to the system, adjusting dynamically to reflect current usage compared to allocated quotas. High consumption relative to allocation can decrease an association's fair-share value, reducing their priority for future resource allocation, thereby promoting balanced usage across all associations to maintain system fairness and efficiency.
Note
The design of flux-accounting was driven by LLNL site requirements. Years ago, the design of Slurm accounting and its multi-factor priority plugin were driven by similar LLNL site requirements. We chose to reuse terminology and concepts from Slurm to facilitate a smooth transition to Flux. The flux-accounting code base is all completely new, however.