Execute LUE framework programs

Programs that use the LUE framework can be executed on a single node, like a laptop or desktop computer, or on a distributed memory system, like a compute cluster.

LUE framework programs are designed to use the available hardware as good as possible. Which hardware is available to the running program is determined by command line arguments.

The LUE framework makes use of the HPX AMT (asynchronous many-tasks) runtime, which can be configured using configuration files and command line arguments. How this works is described in the HPX manual. Additionally, LUE framework programs provide LUE-specific command line arguments, for example for setting the partition size for partitioned arrays.

The next sections provide examples for executing LUE framework programs. Unless stated, the information is relevant for binary programs using the LUE framework C++ API, as well as for Python scripts using the LUE framework Python API.

In the examples in this section, we assume that a cluster exists with nodes that have the following properties:

CPUs

2 AMD EPYC 7451 (2 packages)

NUMA nodes

8 (4 / package)

Cores

48 (6 / NUMA node)

Each cluster node contains 2 CPUs, each of which contains 4 NUMA nodes, each of which contains 6 cores. One easy way to make use of the fact that latencies to memory within a NUMA node are smaller than between them is to assign a LUE framework program to each NUMA node (8 / cluster node).

Note

In case you experience crashes when running LUE framework programs, these may be related to the custom memory allocator being loaded later than the program starts allocating memory. Memory allocations and deallocations must be handled by the same allocation library. Depending on how the HPX library LUE depends upon is built, a non-standard allocation library containing faster memory allocation and deallocation functions may be used. In case of Python scripts using LUE, this allocation library will only be loaded once the script has already started. We have noticed that in case HPX is built with support for the tcmalloc memory allocation library, starting LUE Python scripts like this solved the crashes:

LD_PRELOAD=<prefix>/lib/libtcmalloc_minimal.so.4 python ./my_lue_script.py

Single-node execution

TODO

Distributed execution

SLURM

See also How to use HPX applications with SLURM.

Synchronous

The example script shows how to start a synchronous SLURM job. This is handy for quick runs, when you are still developing the model, for example.

The UCX_LOG_LEVEL environment variable and the --mca btl_openib_allow_ib true argument of mpirun are necessary on the cluster on which this was tested. These are unrelated to LUE or HPX, and may or may not be necessary on other clusters.

# Fixed. Depends on platform.
partition="my_partition"
nr_numa_domains_per_node=8
nr_cores_per_socket=6
nr_cpus_per_task=12
cpu_binding="thread:0-5=core:0-5.pu:0"

# Depends on size of job.
nr_nodes=1

# Fixed.
nr_localities=$(expr $nr_nodes \* $nr_numa_domains_per_node)

export UCX_LOG_LEVEL=error

salloc \
    --partition=$partition \
    --nodes=$nr_nodes \
    --ntasks=$nr_localities \
    --cpus-per-task=$nr_cpus_per_task \
    --cores-per-socket=$nr_cores_per_socket \
    mpirun \
        --mca btl_openib_allow_ib true \
        python my_model.py \
            my_argument1 my_argument2 \
            --hpx:ini="hpx.os_threads=$nr_cores_per_socket" \
            --hpx:bind=$cpu_binding \
            --hpx:print-bind

Note

In case you see this warning:

hpx::init: command line warning: --hpx:localities used when
running with SLURM, requesting a different number of localities
(8) than have been assigned by SLURM (1), the application might
not run properly.

but the printed CPU bindings seem fine, then you can safely ignore it.

Asynchronous

The example script shows how to start an asynchronous SLURM job. This is handy for long running runs, after you have finished developing the model, for example.

# Fixed. Depends on platform.
partition="my_partition"
nr_numa_domains_per_node=8
nr_cores_per_socket=6
nr_cpus_per_task=12
cpu_binding="thread:0-5=core:0-5.pu:0"

# Depends on size of job.
nr_nodes=1

# Fixed.
nr_localities=$(expr $nr_nodes \* $nr_numa_domains_per_node)

# Depends on size of job
nr_nodes=12

# Fixed
nr_tasks=$(expr $nr_nodes \* $nr_numa_domains_per_node)

sbatch --job-name my_job_name << END_OF_SLURM_SCRIPT
#!/usr/bin/env bash
#SBATCH --nodes=$nr_nodes
#SBATCH --ntasks=$nr_localities
#SBATCH --cpus-per-task=$nr_cpus_per_task
#SBATCH --cores-per-socket=$nr_cores_per_socket
#SBATCH --partition=$partition

set -e

module purge
module load my_required_model

mpirun \
    --mca btl_openib_allow_ib true \
    python my_model.py \
        my_argument1 my_argument2 \
        --hpx:ini="hpx.os_threads=$nr_cores_per_socket" \
        --hpx:bind=$cpu_binding \
        --hpx:print-bind

END_OF_SLURM_SCRIPT