Debugging Notes
source tree executables
libtool is
in use, so libtool e trickery is needed to launch a tool against
an actual compiled executable. Command front ends further complicate this.
Note
libtool e is shorthand for libtool --mode=execute.
Example: run a built-in sub-command under GDB
$ libtool e gdb --ex run --args src/cmd/flux version
Example: run an external sub-command under GDB
$ src/cmd/flux /usr/bin/libtool e gdb --ex run --args src/cmd/flux-keygen
Example: run a broker module separately from the broker under GDB
$ src/cmd/flux /usr/bin/libtool e gdb --ex run --args src/broker/flux-module-exec heartbeat
Example: run the broker under GDB
$ src/cmd/flux start --wrap=libtool,e,gdb,--ex,run
Example: run the broker under valgrind
$ src/cmd/flux start --wrap=libtool,e,valgrind
message tracing
Example: trace messages sent/received by a command
$ FLUX_HANDLE_TRACE=t flux kvs get foo
Example: trace messages sent/received by two broker modules
$ flux module trace --full content kvs
Example: trace messages sent/received by this broker on the overlay network
$ flux overlay trace --full
CI failures
Failures that occur only in a github CI workflow can be directly examined with tmate access.
In flux-core, the tmate action can be enabled by restarting the workflow and selecting the "Enable debug logging" checkbox.
For other framework projects, the tmate action can be temporarily patched into the workflows config as in the diff below for the macos workflow:
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index c65ea7f08..fba80d919 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -130,6 +130,9 @@ jobs:
run: make check -j4 TESTS=
- name: check what works so far
run: scripts/check-macos.sh
+ - name: tmate
+ if: failure()
+ uses: mxschmitt/action-tmate@v3
Push that temporary change to a personal fork, then find the ssh address by examining the action output.
running tests as Flux jobs
Tests in the testsuite can be run as Flux jobs to help with debugging and testing under different conditions. This approach is particularly useful for ensuring tests are isolated from the enclosing Flux environment, running tests in parallel across multiple nodes, or detecting race conditions.
Example: run a single test as a Flux job
$ flux run -o pty -n1 ./t1234-test.t -d -v
This ensures the test is properly isolated from the enclosing environment,
which can be helpful for debugging environment-related issues or verifying
that a test doesn't depend on external state. The -o pty option
allocates a pty for the job to enable colorized output from the sharness
tests.
Example: run all sharness tests in parallel as Flux jobs
$ flux bulksubmit -n1 -o pty --watch --progress --job-name={./%} ./{} -d -v ::: t*.t
This is useful for running the entire test suite (or a subset depending on
what the glob after ::: matches) quickly when you have access to multiple
nodes or cores. Each test runs as a separate job using one core (cores per
test can be adjusted with -c, --cores-per-task=N), allowing
you to leverage available resources efficiently. The --watch
option displays live output from jobs as they execute, --progress
shows progress and pass/fail counts, and --job-name={./%} sets
each job name to the test file basename with the .t extension removed.
Note
After running tests with bulksubmit, you can list any failures with
flux jobs -f failed and then examine the output of a specific failed
test using flux job attach JOBID.
Example: run a single test multiple times to detect race conditions
$ flux submit --cc=1-16 -o pty --watch --progress -n1 ./t1234-test.t -d -v --root={{tmpdir}}
This runs the same test 16 times simultaneously, each with one core and a
unique temporary directory. This is particularly effective for finding
race conditions or intermittent failures that only appear under concurrent
execution. Adjust the --cc range as needed for your testing requirements.
Test concurrency can be further increased by running this example under
flux start -N, where N is greater than 1.
Note
Similar to bulksubmit, failures can be identified with
flux jobs -f failed and examined with flux job attach JOBID.
Note
The -d -v flags enable debug mode and verbose output respectively,
which provide more detailed information when tests fail.