Globus Compute Endpoints#

A Globus Compute Endpoint is a persistent service launched by the user on a compute system to serve as a conduit for executing functions on that computer. Globus Compute supports a range of target systems, enabling an endpoint to be deployed on a laptop, the login node of a campus cluster, a cloud instance, or a Kubernetes cluster, for example.

The endpoint requires outbound network connectivity. That is, it must be able to connect to Globus Compute at compute.api.globus.org.

The Globus Compute Endpoint package is available on PyPI (and thus available via pip). However, we strongly recommend installing the Globus Compute endpoint into an isolated virtual environment. Pipx automatically manages package-specific virtual environments for command line applications, so install Globus Compute endpoint via:

$ python3 -m pipx install globus-compute-endpoint

Note

Please note that the Globus Compute endpoint is only supported on Linux.

After installing the Globus Compute endpoint, use the globus-compute-endpoint command to manage existing endpoints.

First time setup#

You will be required to authenticate the first time you run globus-compute-endpoint. If you have authenticated previously, the endpoint will cache access tokens in the local configuration file.

Globus Compute requires authentication in order to associate endpoints with users and that ensure only the authorized user can run tasks on that endpoint. As part of this step, we request access to your identity and Globus Groups.

To get started, you will first want to configure a new endpoint:

$ globus-compute-endpoint configure

Once you’ve run this command, a directory will be created at $HOME/.globus_compute and a set of default configuration files will be generated.

You can also set up auto-completion for the globus-compute-endpoint commands in your shell, by using the command:

$ globus-compute-endpoint --install-completion [zsh bash fish ...]

Configuring an Endpoint#

Globus Compute endpoints act as gateways to diverse computational resources, including clusters, clouds, supercomputers, and even your laptop. To make the best use of your resources, the endpoint must be configured to match the capabilities of the resource on which it is deployed.

Globus Compute provides a Python class-based configuration model that allows you to specify the shape of the resources (number of nodes, number of cores per worker, walltime, etc.) as well as allowing you to place limits on how Globus Compute may scale the resources in response to changing workload demands.

To generate the appropriate directories and default configuration template, run the following command:

$ globus-compute-endpoint configure <ENDPOINT_NAME>

This command will create a profile for your endpoint in $HOME/.globus_compute/<ENDPOINT_NAME>/ and will instantiate a config.yaml file. This file should be updated with the appropriate configurations for the computational system you are targeting before you start the endpoint. Globus Compute is configured using a Config object. Globus Compute uses Parsl to manage resources. For more information, see the Config class documentation and the Parsl documentation .

Note

If the ENDPOINT_NAME is not specified, a default endpoint named “default” is configured.

Starting an Endpoint#

To start a new endpoint run the following command:

$ globus-compute-endpoint start <ENDPOINT_NAME>

Note

If the ENDPOINT_NAME is not specified, a default endpoint named “default” is started.

Starting an endpoint will perform a registration process with Globus Compute. The registration process provides Globus Compute with information regarding the endpoint. The endpoint also establishes an outbound connection to RabbitMQ to retrieve tasks, send results, and communicate command information. Thus, the Globus Compute endpoint requires outbound access to the Globus Compute services over HTTPS (port 443) and AMQPS (port 5671).

Once started, the endpoint uses a daemon process to run in the background.

Note

If the endpoint was not stopped correctly previously (e.g., after a computer restart when the endpoint was running), the endpoint directory will be cleaned up to allow a fresh start

Warning

Only the owner of an endpoint is authorized to start an endpoint. Thus if you register an endpoint using one identity and try to start an endpoint owned by another identity, it will fail.

To start an endpoint using a client identity, rather than as a user, you can export the FUNCX_SDK_CLIENT_ID and FUNCX_SDK_CLIENT_SECRET environment variables. This is explained in detail in Client Credentials with Clients.

Stopping an Endpoint#

To stop an endpoint, run the following command:

$ globus-compute-endpoint stop <ENDPOINT_NAME>

If the endpoint is not running and was stopped correctly previously, this command does nothing.

If the endpoint is not running but was not stopped correctly previously (e.g., after a computer restart when the endpoint was running), this command will clean up the endpoint directory such that the endpoint can be started cleanly again.

Note

If the ENDPOINT_NAME is not specified, the default endpoint is stopped.

Listing Endpoints#

To list available endpoints on the current system, run:

$ globus-compute-endpoint list
+---------------+-------------+--------------------------------------+
| Endpoint Name |   Status    |             Endpoint ID              |
+===============+=============+======================================+
| default       | Active      | 1e999502-b434-49a2-a2e0-d925383d2dd4 |
+---------------+-------------+--------------------------------------+
| KNL_test      | Inactive    | 8c01d13c-cfc1-42d9-96d2-52c51784ea16 |
+---------------+-------------+--------------------------------------+
| gpu_cluster   | Initialized | None                                 |
+---------------+-------------+--------------------------------------+

Endpoints can be the following states:

Initialized: The endpoint has been created, but not started following configuration and is not registered with the Globus Compute service.
Running: The endpoint is active and available for executing functions.
Stopped: The endpoint was stopped by the user. It is not running and therefore, cannot service any functions. It can be started again without issues.
Disconnected: The endpoint disconnected unexpectedly. It is not running and therefore, cannot service any functions. Starting this endpoint will first invoke necessary endpoint cleanup, since it was not stopped correctly previously.

Ensuring execution environment#

When running a function, endpoint worker processes expect to have all the necessary dependencies readily available to them. For example, if a function uses numpy to do some calculations, and a worker is running on a machine without numpy installed, any attempts to execute that function using that worker will result in an error.

In HPC contexts, the endpoint process - which receives tasks from the Compute central services and queues them up for execution - generally runs on a separate node from the workers which actually do the computation. As a result, it’s often necessary to load in some kind of pre-initialized environment. In general there are two solutions here:

Python based environment isolation such as conda environment or venv,
Containers: containerization with docker or apptainer (singularity)

Python based environment isolation#

To use python based environment management, use the worker_init config option:

engine:
  provider:
      worker_init: conda activate my-conda-env

The exact behavior of worker_init depends on the ExecutionProvider being used.

In some cases, it may also be helpful to run some setup during the startup process of the endpoint itself, before any workers start. This can be achieved using the top-level endpoint_setup config option:

endpoint_setup: |
  conda create -n my-conda-env
  conda activate my-conda-env
  pip install -r requirements.txt

Note that endpoint_setup is run by the system shell, as a child of the endpoint startup process.

Similarly, artifacts created by endpoint_setup can be cleaned up with endpoint_teardown:

endpoint_teardown: |
  conda remove -n my-conda-env --all

Containerized Environments#

Container support is limited to GlobusComputeEngine on the globus-compute-endpoint. To configure the endpoint the following options are now supported:

container_type : Specify container type from one of ('docker', 'apptainer', 'singularity', 'custom', 'None')
container_uri: Specify container uri, or file path if specifying sif files
container_cmd_options: Specify custom command options to pass to the container launch command, such as filesystem mount paths, network options etc.

display_name: Docker
engine:
  type: GlobusComputeEngine
  container_type: docker
  container_uri: funcx/kube-endpoint:main-3.10
  container_cmd_options: -v /tmp:/tmp
  provider:
    init_blocks: 1
    max_blocks: 1
    min_blocks: 0
    type: LocalProvider

For more custom use-cases where either an unsupported container technology is required or building the container string programmatically is preferred use container_type='custom' In this case, container_cmd_options is treated as a string template, in which the following two strings are expected:

{EXECUTOR_RUNDIR} : Used to specify mounting of the RUNDIR to share logs
{EXECUTOR_LAUNCH_CMD} : Used to specify the worker launch command within the container.

Here’s an example:

display_name: Docker Custom
engine:
  type: GlobusComputeEngine
  container_type: custom
  container_cmd_options: docker run -v {EXECUTOR_RUNDIR}:{EXECUTOR_RUNDIR} funcx/kube-endpoint:main-3.10 {EXECUTOR_LAUNCH_CMD}
  provider:
    init_blocks: 1
    max_blocks: 1
    min_blocks: 0
    type: LocalProvider

Restarting endpoint when machine restarts#

To ensure that a compute endpoint comes back online when its host machine restarts, a systemd service can be configured to run the endpoint.

Ensure the endpoint isn’t running:

$ globus-compute-endpoint stop <my-endpoint-name>

Update the endpoint’s config.yaml to set detach_endpoint to false

Create a service file at /etc/systemd/system/my-globus-compute-endpoint.service, and populate it with the following settings:

[Unit]
Description=Globus Compute Endpoint systemd service
After=network.target
StartLimitIntervalSec=0

[Service]
ExecStart=</full/path/to/globus-compute-endpoint/executable> start <my-endpoint-name>
User=<user_account>
Type=simple
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target

Enable the service and start the endpoint:

$ sudo systemctl enable my-globus-compute-endpoint.service --now

To edit an existing systemd service, make changes to the service file and then run the following:

$ sudo systemctl daemon-reload
$ sudo systemctl restart my-globus-compute-endpoint.service

AMQP Port#

Endpoints receive tasks and communicate task results via the AMQP messaging protocol. As of v2.11.0, newly configured endpoints use AMQP over port 443 by default, since firewall rules usually leave that port open. In case 443 is not open on a particular cluster, the port to use can be changed in the endpoint config via the amqp_port option, like so:

amqp_port: 5671
display_name: My Endpoint
engine: ...

Note that only ports 5671, 5672, and 443 are supported with the Compute hosted services. Also note that when amqp_port is omitted from the config, the port is based on the connection URL the endpoint receives after registering itself with the services, which typically means port 5671.

Example configurations#

Globus Compute has been used on various systems around the world. Below are example configurations for commonly used systems. If you would like to add your system to this list please contact the Globus Compute Team via Slack.

Note

All configuration examples below must be customized for the user’s allocation, Python environment, file system, etc.

Anvil (RCAC, Purdue)#

The following snippet shows an example configuration for executing remotely on Anvil, a supercomputer at Purdue University’s Rosen Center for Advanced Computing (RCAC). The configuration assumes the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

display_name: Anvil CPU
engine:
  type: HighThroughputEngine
  max_workers_per_node: 2

  address:
    type: address_by_interface
    ifname: ib0

  provider:
    type: SlurmProvider
    partition: debug

    account: {{ ACCOUNT }}
    launcher:
        type: SrunLauncher

    # string to prepend to #SBATCH blocks in the submit
    # script to the scheduler
    # e.g., "#SBATCH --constraint=knl,quad,cache"
    scheduler_options: {{ OPTIONS }}

    # Command to be run before starting a worker
    # e.g., "module load anaconda; source activate gce_env
    worker_init: {{ COMMAND }}

    init_blocks: 1
    max_blocks: 1
    min_blocks: 0

    walltime: 00:05:00

Blue Waters (NCSA)#

The following snippet shows an example configuration for executing remotely on Blue Waters, a supercomputer at the National Center for Supercomputing Applications. The configuration assumes the user is running on a login node, uses the TorqueProvider to interface with the scheduler, and uses the AprunLauncher to launch workers.

engine:
    type: HighThroughputEngine
    max_workers_per_node: 1
    worker_debug: False
    address: 127.0.0.1

    provider:
        type: TorqueProvider
        queue: normal

        launcher:
            type: AprunLauncher
            overrides: -b -- bwpy-environ --

        # string to prepend to #SBATCH blocks in the submit
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., "module load bwpy; source activate compute env"
        worker_init: {{ COMMAND }}

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

Delta (NCSA)#

The following snippet shows an example configuration for executing remotely on Delta, a supercomputer at the National Center for Supercomputing Applications. The configuration assumes the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

display_name: NCSA Delta 2 CPU
engine:
    type: HighThroughputEngine
    max_workers_per_node: 2
    worker_debug: False

    address:
        type: address_by_interface
        ifname: eth6.560

    provider:
        type: SlurmProvider
        partition: cpu
        account: {{ ACCOUNT NAME }}

        launcher:
            type: SrunLauncher

        # Command to be run before starting a worker
        # e.g., "module load anaconda3; source activate gce_env"
        worker_init: {{ COMMAND }}

        init_blocks: 1
        min_blocks: 0
        max_blocks: 1

        walltime: 00:30:00

Expanse (SDSC)#

The following snippet shows an example configuration for executing remotely on Expanse, a supercomputer at the San Diego Supercomputer Center. The configuration assumes the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

display_name: SDSC Expanse
engine:
    type: HighThroughputEngine
    max_workers_per_node: 2
    worker_debug: False

    address:
        type: address_by_interface
        ifname: ib0

    provider:
        type: SlurmProvider
        partition: compute
        account: {{ ACCOUNT }}

        launcher:
            type: SrunLauncher

        # string to prepend to #SBATCH blocks in the submit
        # script to the scheduler
        # e.g., "#SBATCH --constraint=knl,quad,cache"
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., "module load anaconda3; source activate gce_env"
        worker_init: {{ COMMAND }}

        init_blocks: 1
        min_blocks: 0
        max_blocks: 1

        walltime: 00:05:00

UChicago AI Cluster#

The following snippet shows an example configuration for the University of Chicago’s AI Cluster. The configuration assumes the user is running on a login node and uses the SlurmProvider to interface with the scheduler and launch onto the GPUs.

Link to docs.

engine:
    type: HighThroughputEngine
    label: fe.cs.uchicago
    worker_debug: False

    address:
        type: address_by_interface
        ifname: ens2f1

    provider:
        type: SlurmProvider
        partition: general

        # This is a hack. We use hostname ; to terminate the srun command, and
        # start our own
        launcher:
            type: SrunLauncher
            overrides: >
                hostname; srun --ntasks={{ TOTAL_WORKERS }}
                --ntasks-per-node={{ WORKERS_PER_NODE }}
                --gpus-per-task=rtx2080ti:{{ GPUS_PER_WORKER }}
                --gpu-bind=map_gpu:{{ GPU_MAP }}

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: {{ NODES_PER_JOB }}
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

Here is some Python that demonstrates how to compute the variables in the YAML example above:

# Launch 4 managers per node, each bound to 1 GPU
# Modify before use
NODES_PER_JOB = 2
GPUS_PER_NODE = 4
GPUS_PER_WORKER = 2

# DO NOT MODIFY
TOTAL_WORKERS = int((NODES_PER_JOB * GPUS_PER_NODE) / GPUS_PER_WORKER)
WORKERS_PER_NODE = int(GPUS_PER_NODE / GPUS_PER_WORKER)
GPU_MAP = ",".join([str(x) for x in range(1, TOTAL_WORKERS + 1)])

Midway (RCC, UChicago)#

The Midway cluster is a campus cluster hosted by the Research Computing Center at the University of Chicago. The snippet below shows an example configuration for executing remotely on Midway. The configuration assumes the user is running on a login node and uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

engine:
    type: HighThroughputEngine
    max_workers_per_node: 2
    worker_debug: False

    address:
        type: address_by_interface
        ifname: bond0

    provider:
        type: SlurmProvider
        partition: broadwl

        launcher:
            type: SrunLauncher

        # string to prepend to #SBATCH blocks in the submit
        # script to the scheduler
        # e.g., "#SBATCH --constraint=knl,quad,cache"
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., module load Anaconda; source activate parsl_env
        worker_init: {{ COMMAND }}

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

The following configuration is an example to use singularity container on Midway.

engine:
    type: HighThroughputEngine
    max_workers_per_node: 10

    address:
        type: address_by_interface
        ifname: bond0

    scheduler_mode: soft
    worker_mode: singularity_reuse
    container_type: singularity
    container_cmd_options: -H /home/$USER

    provider:
        type: SlurmProvider
        partition: broadwl

        launcher:
            type: SrunLauncher

        # string to prepend to #SBATCH blocks in the submit
        # script to the scheduler
        # eg: "#SBATCH --constraint=knl,quad,cache"
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., "module load Anaconda; source activate parsl_env"
        worker_init: {{ COMMAND }}

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

Kubernetes Clusters#

_images/kuberneteslogo.eabc6359f48c8e30b7a138c18177f3fd39338e05.png

Kubernetes is an open-source system for container management, such as automating deployment and scaling of containers. The snippet below shows an example configuration for deploying pods as workers on a Kubernetes cluster. The KubernetesProvider exploits the Python Kubernetes API, which assumes that you have kube config in ~/.kube/config.

heartbeat_period: 15
heartbeat_threshold: 200
log_dir: "."

engine:
    type: HighThroughputEngine
    label: Kubernetes_funcX
    max_workers_per_node: 1

    address:
      type: address_by_route

    scheduler_mode: hard
    container_type: docker

    strategy:
        type: KubeSimpleStrategy
        max_idletime: 3600

    provider:
        type: KubernetesProvider
        init_blocks: 0
        min_blocks: 0
        max_blocks: 2
        init_cpu: 1
        max_cpu: 4
        init_mem: 1024Mi
        max_mem: 4096Mi

        # e.g., python:3.8-buster
        image: {{ IMAGE }}

        # e.g., "pip install --force-reinstall globus_compute_endpoint>=2.0.1"
        worker_init: {{ COMMAND }}

        # e.g., default
        namespace: {{ NAMESPACE }}

        incluster_config: False

Theta (ALCF)#

The following snippet shows an example configuration for executing on Argonne Leadership Computing Facility’s Theta supercomputer. This example uses the HighThroughputEngine and connects to Theta’s Cobalt scheduler using the CobaltProvider. This configuration assumes that the script is being executed on the login nodes of Theta.

engine:
    type: HighThroughputEngine
    max_workers_per_node: 1
    worker_debug: False

    address:
        type: address_by_interface
        ifname: vlan2360

    provider:
        type: CobaltProvider
        queue: debug-flat-quad

        # Specify the account/allocation to which jobs should be charged
        account: {{ YOUR_THETA_ALLOCATION }}

        launcher:
            type: AprunLauncher
            overrides: -d 64

        # string to prepend to #COBALT blocks in the submit
        # script to the scheduler
        # eg: "#COBALT -t 50"
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., "module load Anaconda; source activate compute_env"
        worker_init: {{ COMMAND }}

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

The following configuration is an example to use singularity container on Theta.

engine:
    type: HighThroughputEngine
    max_workers_per_node: 1
    worker_debug: False

    address:
        type: address_by_interface
        ifname: vlan2360

    scheduler_mode: soft
    worker_mode: singularity_reuse
    container_type: singularity
    container_cmd_options: -H /home/$USER

    provider:
        type: CobaltProvider
        queue: debug-flat-quad

        # Specify the account/allocation to which jobs should be charged
        account: {{ YOUR_THETA_ALLOCATION }}

        launcher:
            type: AprunLauncher
            overrides: -d 64

        # string to prepend to #COBALT blocks in the submit
        # script to the scheduler
        # eg: "#COBALT -t 50"
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., "module load Anaconda; source activate compute_env"
        worker_init: {{ COMMAND }}

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

Cooley (ALCF)#

The following snippet shows an example configuration for executing on Argonne Leadership Computing Facility’s Cooley cluster. This example uses the HighThroughputEngine and connects to Cooley’s Cobalt scheduler using the CobaltProvider. This configuration assumes that the script is being executed on the login nodes of Cooley.

engine:
    type: HighThroughputEngine
    max_workers_per_node: 2
    worker_debug: False

    address:
        type: address_by_interface
        ifname: ib0

    provider:
        type: CobaltProvider
        queue: default
        account: {{ YOUR_COOLEY_ALLOCATION }}

        launcher:
            type: MpiExecLauncher

        # string to prepend to #COBALT blocks in the submit
        # script to the scheduler
        # eg: "#COBALT -t 50"
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., "module load Anaconda; source activate compute_env"
        worker_init: {{ COMMAND }}

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

Polaris (ALCF)#

The following snippet shows an example configuration for executing on Argonne Leadership Computing Facility’s Polaris cluster. This example uses the HighThroughputEngine and connects to Polaris’s PBS scheduler using the PBSProProvider. This configuration assumes that the script is being executed on the login node of Polaris.

engine:
    type: HighThroughputEngine
    max_workers_per_node: 1

    # Un-comment to give each worker exclusive access to a single GPU
    # available_accelerators: 4

    strategy:
        type: SimpleStrategy
        max_idletime: 300

    address:
        type: address_by_interface
        ifname: bond0

    provider:
        type: PBSProProvider

        launcher:
            type: MpiExecLauncher
            # Ensures 1 manger per node, work on all 64 cores
            bind_cmd: --cpu-bind
            overrides: --depth=64 --ppn 1

        account: {{ YOUR_POLARIS_ACCOUNT }}
        queue: preemptable
        cpus_per_node: 32
        select_options: ngpus=4

        # e.g., "#PBS -l filesystems=home:grand:eagle\n#PBS -k doe"
        scheduler_options: "#PBS -l filesystems=home:grand:eagle"

        # Node setup: activate necessary conda environment and such
        worker_init: {{ COMMAND }}

        walltime: 01:00:00
        nodes_per_block: 1
        init_blocks: 0
        min_blocks: 0
        max_blocks: 2

Perlmutter (NERSC)#

_images/Nersc9-image-compnew-sizer7-group-type-4-1.jpg

The following snippet shows an example configuration for accessing NERSC’s Perlmutter supercomputer. This example uses the HighThroughputEngine and connects to Perlmutters’s Slurm scheduler. It is configured to request 2 nodes configured with 1 TaskBlock per node. Finally, it includes override information to request a particular node type (GPU) and to configure a specific Python environment on the worker nodes using Anaconda.

engine:
    type: HighThroughputEngine
    worker_debug: False

    strategy:
        type: SimpleStrategy
        max_idletime: 300

    address:
        type: address_by_interface
        ifname: hsn0

    provider:
        type: SlurmProvider

        # We request all hyperthreads on a node.
        # GPU nodes have 128 threads, CPU nodes have 256 threads
        launcher:
            type: SrunLauncher
            overrides: -c 128

        # string to prepend to #SBATCH blocks in the submit
        # script to the scheduler
        # For GPUs in the debug qos eg: "#SBATCH --constraint=gpu -q debug"
        scheduler_options: {{ OPTIONS }}
        # Your NERSC account, eg: "m0000"
        account: {{ NERSC_ACCOUNT }}

        # Command to be run before starting a worker
        # e.g., "module load Anaconda; source activate parsl_env"
        worker_init: {{ COMMAND }}

        # increase the command timeouts
        cmd_timeout: 120

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 10 minutes
        walltime: 00:10:00

Frontera (TACC)#

The following snippet shows an example configuration for accessing the Frontera system at TACC. The configuration below assumes that the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

engine:
    type: HighThroughputEngine
    max_workers_per_node: 2
    worker_debug: False

    address:
      type: address_by_interface
      ifname: ib0

    provider:
        type: SlurmProvider

        # e.g., EAR22001
        account: {{ YOUR_FRONTERA_ACCOUNT }}

        # e.g., development
        partition: {{ PARTITION }}

        launcher:
            type: SrunLauncher

        # Enter scheduler_options if needed
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., "module load Anaconda; source activate parsl_env"
        worker_init: {{ COMMAND }}

        # Add extra time for slow scheduler responses
        cmd_timeout: 60

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

Bebop (LCRC, ANL)#

The following snippet shows an example configuration for accessing the Bebop system at Argonne’s LCRC. The configuration below assumes that the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

engine:
    type: HighThroughputEngine
    max_workers_per_node: 1
    worker_debug: False

    address:
        type: address_by_interface
        ifname: ib0

    provider:
        type: SlurmProvider
        partition: {{ PARTITION }}  # e.g., bdws
        launcher:
          type: SrunLauncher

        # string to prepend to #SBATCH blocks in the submit
        # script to the scheduler
        # eg: '#SBATCH --constraint=knl,quad,cache'
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., "module load Anaconda; source activate parsl_env"
        worker_init: {{ COMMAND }}

        nodes_per_block: 1
        init_blocks: 1
        min_blocks: 0
        max_blocks: 1
        walltime: 00:30:00

Bridges-2 (PSC)#

The following snippet shows an example configuration for accessing the Bridges-2 system at PSC. The configuration below assumes that the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

engine:
    type: HighThroughputEngine
    max_workers_per_node: 2
    worker_debug: False

    address:
      type: address_by_interface
      ifname: ens3f0

    provider:
        type: SlurmProvider
        partition: RM

        launcher:
            type: SrunLauncher

        # string to prepend to #SBATCH blocks in the submit
        # script to the scheduler
        # e.g., "#SBATCH --constraint=knl,quad,cache"
        scheduler_options: {{ OPTIONS }}

        # Command to be run before starting a worker
        # e.g., module load Anaconda; source activate parsl_env
        worker_init: {{ COMMAND }}

        # Scale between 0-1 blocks with 2 nodes per block
        nodes_per_block: 2
        init_blocks: 1
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 30 minutes
        walltime: 00:30:00

Stampede2 (TACC)#

The following snippet shows an example configuration for accessing the Stampede2 system at the Texas Advanced Computing Center (TACC). The configuration below assumes that the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

display_name: Stampede2.TACC.batch

engine:
  type: HighThroughputEngine
  address:
    type: address_by_interface
    ifname: em3

  max_workers_per_node: 2

  provider:
    type: SlurmProvider
    partition: development

    launcher:
        type: SrunLauncher

    # string to prepend to #SBATCH blocks in the submit
    # script to the scheduler
    # e.g., "#SBATCH --constraint=knl,quad,cache"
    scheduler_options: {{ OPTIONS }}

    # Command to be run before starting a worker
    # e.g., module load Anaconda; source activate parsl_env
    # e.g., "source ~/anaconda3/bin/activate; conda activate gce_py3.9"
    worker_init: {{ COMMAND }}

    # Scale between 0-1 blocks with 2 nodes per block
    nodes_per_block: 2
    init_blocks: 1
    max_blocks: 1
    min_blocks: 0

    # Blocks are provisioned for a maxmimum of 10 minutes
    walltime: 00:10:00

FASTER (TAMU)#

The following snippet shows an example configuration for accessing the FASTER system at Texas A & M (TAMU). The configuration below assumes that the user is running on a login node, uses the SlurmProvider to interface with the scheduler, and uses the SrunLauncher to launch workers.

engine:
    type: HighThroughputEngine
    worker_debug: False

    strategy:
        type: SimpleStrategy
        max_idletime: 90

    address:
        type: address_by_interface
        ifname: eno8303

    provider:
        type: SlurmProvider
        partition: cpu
        mem_per_node: 128

        launcher:
            type: SrunLauncher

        scheduler_options: {{ OPTIONS }}

        worker_init: {{ COMMAND }}

        # increase the command timeouts
        cmd_timeout: 120

        # Scale between 0-1 blocks with 1 nodes per block
        nodes_per_block: 1
        init_blocks: 0
        min_blocks: 0
        max_blocks: 1

        # Hold blocks for 10 minutes
        walltime: 00:10:00

Pinning Workers to devices#

Many modern clusters provide multiple accelerators per compute note, yet many applications are best suited to using a single accelerator per task. Globus Compute supports pinning each worker to different accelerators using the available_accelerators option of the HighThroughputEngine. Provide either the number of accelerators (Globus Compute will assume they are named in integers starting from zero) or a list of the names of the accelerators available on the node. Each Globus Compute worker will have the following environment variables set to the worker specific identity assigned: CUDA_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES, SYCL_DEVICE_FILTER.

engine:
    type: HighThroughputEngine
    max_workers_per_node: 4

    # `available_accelerators` may be a natural number or a list of strings.
    # If an integer, then each worker launched will have an automatically
    # generated environment variable. In this case, one of 0, 1, 2, or 3.
    # Alternatively, specific strings may be utilized.
    available_accelerators: 4
    # available_accelerators: ["opencl:gpu:1", "opencl:gpu:2"]  # alternative

    provider:
        type: LocalProvider
        init_blocks: 1
        min_blocks: 0
        max_blocks: 1