Debugging Notes

source tree executables

libtool is in use, so libtool e trickery is needed to launch a tool against an actual compiled executable. Command front ends further complicate this.

Note

libtool e is shorthand for libtool --mode=execute.

Example: run a built-in sub-command under GDB

$ libtool e gdb --ex run --args src/cmd/flux version

Example: run an external sub-command under GDB

$ src/cmd/flux /usr/bin/libtool e gdb --ex run --args src/cmd/flux-keygen

Example: run a broker module separately from the broker under GDB

$ src/cmd/flux /usr/bin/libtool e gdb --ex run --args src/broker/flux-module-exec heartbeat

Example: run the broker under GDB

$ src/cmd/flux start --wrap=libtool,e,gdb,--ex,run

Example: run the broker under valgrind

$ src/cmd/flux start --wrap=libtool,e,valgrind

message tracing

Example: trace messages sent/received by a command

$ FLUX_HANDLE_TRACE=t flux kvs get foo

Example: trace messages sent/received by two broker modules

$ flux module trace --full content kvs

Example: trace messages sent/received by this broker on the overlay network

$ flux overlay trace --full

CI failures

Failures that occur only in a github CI workflow can be directly examined with tmate access.

In flux-core, the tmate action can be enabled by restarting the workflow and selecting the "Enable debug logging" checkbox.

For other framework projects, the tmate action can be temporarily patched into the workflows config as in the diff below for the macos workflow:

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index c65ea7f08..fba80d919 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -130,6 +130,9 @@ jobs:
       run: make check -j4 TESTS=
     - name: check what works so far
       run: scripts/check-macos.sh
+    - name: tmate
+      if: failure()
+      uses: mxschmitt/action-tmate@v3

Push that temporary change to a personal fork, then find the ssh address by examining the action output.

running tests as Flux jobs

Tests in the testsuite can be run as Flux jobs to help with debugging and testing under different conditions. This approach is particularly useful for ensuring tests are isolated from the enclosing Flux environment, running tests in parallel across multiple nodes, or detecting race conditions.

Example: run a single test as a Flux job

$ flux run -o pty -n1 ./t1234-test.t -d -v

This ensures the test is properly isolated from the enclosing environment, which can be helpful for debugging environment-related issues or verifying that a test doesn't depend on external state. The -o pty option allocates a pty for the job to enable colorized output from the sharness tests.

Example: run all sharness tests in parallel as Flux jobs

$ flux bulksubmit -n1 -o pty --watch --progress --job-name={./%} ./{} -d -v ::: t*.t

This is useful for running the entire test suite (or a subset depending on what the glob after ::: matches) quickly when you have access to multiple nodes or cores. Each test runs as a separate job using one core (cores per test can be adjusted with -c, --cores-per-task=N), allowing you to leverage available resources efficiently. The --watch option displays live output from jobs as they execute, --progress shows progress and pass/fail counts, and --job-name={./%} sets each job name to the test file basename with the .t extension removed.

Note

After running tests with bulksubmit, you can list any failures with flux jobs -f failed and then examine the output of a specific failed test using flux job attach JOBID.

Example: run a single test multiple times to detect race conditions

$ flux submit --cc=1-16 -o pty --watch --progress -n1 ./t1234-test.t -d -v --root={{tmpdir}}

This runs the same test 16 times simultaneously, each with one core and a unique temporary directory. This is particularly effective for finding race conditions or intermittent failures that only appear under concurrent execution. Adjust the --cc range as needed for your testing requirements. Test concurrency can be further increased by running this example under flux start -N, where N is greater than 1.

Note

Similar to bulksubmit, failures can be identified with flux jobs -f failed and examined with flux job attach JOBID.

Note

The -d -v flags enable debug mode and verbose output respectively, which provide more detailed information when tests fail.