Skip to content

Conversation

@erezzarum
Copy link

@erezzarum erezzarum commented Oct 24, 2025

  • Custom aws-ofi-nccl support
  • Cleanup LD_LIBRARY_PATH and PATH in favor of /etc/ld.so.conf.d

Issue #, if available:

Description of changes:
Even though we pass an AWS OFI NCCL version during build, this was never used and there is no possibility to test custom version of the AWS OFI NCCL.
I have added support for building a custom AWS OFI NCCL version, it is not set as the default, as the EFA installer provided one remains the default as best practices.
If a user wishes to use the custom built one he can do so by passing LD_LIBRARY_PATH=/opt/aws-ofi-nccl/build/lib

I made cleanups where we aggressively use LD_LIBRARY_PATH for no reason and moved the needed libraries into /etc/ld.so.conf.d.
Main reason is that in many examples related to EKS, we pass LD_LIBRARY_PATH=<PATH>:$LD_LIBRARY_PATH as env field, the $LD_LIBRARY_PATH will not get interpolated as expected, so it's best we have defaults of libraries using the system wide (/etc/ld.so.conf.d) and any customization the user wants will be directly passed through LD_LIBRARY_PATH without relying on previous values of this env variable.

When we need to use NCCL tuner plugin, we can set the environment variable NCCL_TUNER_PLUGIN=ofi as opposite to set direct path, this will find the corresponding tuner plugin based on the aws-ofi-nccl library we use.
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-tuner-plugin

$ ls -ll /opt/amazon/ofi-nccl/lib/x86_64-linux-gnu/
total 332
-rw-r--r--. 1 root root 339744 Aug 12 23:37 libnccl-net-ofi.so
lrwxrwxrwx. 1 root root     18 Aug 12 23:37 libnccl-net.so -> libnccl-net-ofi.so
lrwxrwxrwx. 1 root root     18 Aug 12 23:37 libnccl-ofi-tuner.so -> libnccl-net-ofi.so
lrwxrwxrwx. 1 root root     18 Aug 12 23:37 libnccl-tuner-ofi.so -> libnccl-net-ofi.so

$ ls -ll /opt/aws-ofi-nccl/build/lib
total 1920
drwxr-xr-x. 2 root root  16384 Oct 24 13:24 ./
drwxr-xr-x. 4 root root     30 Oct 24 13:24 ../
-rwxr-xr-x. 1 root root   1077 Oct 24 13:24 libnccl-net-ofi.la*
-rwxr-xr-x. 1 root root 480272 Oct 24 13:24 libnccl-net-ofi.so*
-rwxr-xr-x. 1 root root   1053 Oct 24 13:24 libnccl-net.la*
-rwxr-xr-x. 1 root root 480272 Oct 24 13:24 libnccl-net.so*
-rwxr-xr-x. 1 root root   1089 Oct 24 13:24 libnccl-ofi-tuner.la*
-rwxr-xr-x. 1 root root 480272 Oct 24 13:24 libnccl-ofi-tuner.so*
-rwxr-xr-x. 1 root root   1089 Oct 24 13:24 libnccl-tuner-ofi.la*
-rwxr-xr-x. 1 root root 480272 Oct 24 13:24 libnccl-tuner-ofi.so*

Example of MPIJob that uses the custom built AWS OFI NCCL

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: nccl-tests
spec:
  runPolicy:
    cleanPodPolicy: Running
    backoffLimit: 20
  slotsPerWorker: 8
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          labels:
            mpi/name: test
        spec:
          containers:
          - image: nccl-tests:cuda12.8.1-efa1.43.3-ofiv1.17.0-ncclv2.28.7-1-testsv2.17.2
            imagePullPolicy: IfNotPresent
            name: test-nccl-launcher
            env:
              - name: LD_LIBRARY_PATH
                value: /opt/aws-ofi-nccl/build/lib
            command:
            - /opt/amazon/openmpi/bin/mpirun
            - --allow-run-as-root
            - --tag-output
            - -np
            - "16"
            - -N
            - "8"
            - --bind-to
            - none
            - -x
            - PATH
            - -x
            - LD_LIBRARY_PATH
            - -x
            - FI_PROVIDER=efa
            - -x
            - NCCL_DEBUG=INFO
            - -x
            - NCCL_TUNER_PLUGIN=ofi
            - --mca
            - pml
            - ^ucx
            - --mca
            - btl
            - tcp,self
            - --mca
            - btl_tcp_if_exclude
            - lo,docker0,veth_def_agent
            - /opt/nccl-tests/build/all_reduce_perf
            - -b
            - "8"
            - -e
            - "10G"
            - -f
            - "2"
            - -c
            - "1"
            - -g
            - "1"
            - -n
            - "100"
    Worker:
      replicas: 2
      template:
        metadata:
          labels:
            mpi/name: test
        spec:
          affinity:
            podAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                - topologyKey: topology.kubernetes.io/zone
                  labelSelector:
                    matchLabels:
                      mpi/name: test
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchLabels:
                      mpi/name: test
                  topologyKey: "kubernetes.io/hostname"
          nodeSelector:
            node.kubernetes.io/instance-type: "p5.48xlarge"
          containers:
          - image: nccl-tests:cuda12.8.1-efa1.43.3-ofiv1.17.0-ncclv2.28.7-1-testsv2.17.2-2
            imagePullPolicy: IfNotPresent
            name: nccl-tests-worker
            volumeMounts:
            - name: shmem
              mountPath: /dev/shm
            resources:
              limits:
                nvidia.com/gpu: 8
                vpc.amazonaws.com/efa: 32
              requests:
                nvidia.com/gpu: 8
                vpc.amazonaws.com/efa: 32
          volumes:
          - name: shmem
            hostPath:
              path: /dev/shm

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

* Custom aws-ofi-nccl support
* Cleanup LD_LIBRARY_PATH and PATH in favor of /etc/ld.so.conf.d
@nghtm
Copy link
Contributor

nghtm commented Oct 31, 2025

Thanks Erez for the PR, assinging @pbelevich and @amanshanbhag to review

@nghtm
Copy link
Contributor

nghtm commented Oct 31, 2025

Confirmed succesfull 2 node NCCL test using launcher script with new container build.

p5en-dy-gpu-3:17295:17295 [7] NCCL INFO NCCL_NVLSTREE_MAX_CHUNKSIZE set by environment to 524288.
           8             2     float     sum      -1    57.02    0.00    0.00      0    53.50    0.00    0.00      0
          16             4     float     sum      -1    52.30    0.00    0.00      0    56.32    0.00    0.00      0
          32             8     float     sum      -1    47.35    0.00    0.00      0    47.69    0.00    0.00      0
          64            16     float     sum      -1    47.76    0.00    0.00      0    47.42    0.00    0.00      0
         128            32     float     sum      -1    47.93    0.00    0.01      0    48.20    0.00    0.00      0
         256            64     float     sum      -1    58.54    0.00    0.01      0    48.26    0.01    0.01      0
         512           128     float     sum      -1    50.06    0.01    0.02      0    49.00    0.01    0.02      0
        1024           256     float     sum      -1    49.82    0.02    0.04      0    49.79    0.02    0.04      0
        2048           512     float     sum      -1    52.43    0.04    0.07      0    51.80    0.04    0.07      0
        4096          1024     float     sum      -1    57.70    0.07    0.13      0    57.27    0.07    0.13      0
        8192          2048     float     sum      -1    59.02    0.14    0.26      0    57.79    0.14    0.27      0
       16384          4096     float     sum      -1    60.26    0.27    0.51      0    59.16    0.28    0.52      0
       32768          8192     float     sum      -1    61.44    0.53    1.00      0    61.13    0.54    1.01      0
       65536         16384     float     sum      -1    61.81    1.06    1.99      0    62.82    1.04    1.96      0
      131072         32768     float     sum      -1    67.47    1.94    3.64      0    67.18    1.95    3.66      0
      262144         65536     float     sum      -1    76.10    3.44    6.46      0    75.77    3.46    6.49      0
      524288        131072     float     sum      -1    81.64    6.42   12.04      0    81.19    6.46   12.11      0
     1048576        262144     float     sum      -1    91.26   11.49   21.54      0    90.82   11.55   21.65      0
     2097152        524288     float     sum      -1    110.2   19.03   35.68      0    110.3   19.00   35.63      0
     4194304       1048576     float     sum      -1    127.2   32.96   61.81      0    127.1   33.00   61.88      0
     8388608       2097152     float     sum      -1    164.7   50.94   95.51      0    163.5   51.31   96.21      0
    16777216       4194304     float     sum      -1    247.8   67.71  126.96      0    246.5   68.07  127.64      0
    33554432       8388608     float     sum      -1    386.1   86.90  162.93      0    324.4  103.43  193.94      0
    67108864      16777216     float     sum      -1    496.6  135.15  253.40      0    498.6  134.60  252.37      0
   134217728      33554432     float     sum      -1    770.1  174.29  326.79      0    771.2  174.05  326.33      0
   268435456      67108864     float     sum      -1   1320.2  203.32  381.23      0   1310.8  204.79  383.99      0
   536870912     134217728     float     sum      -1   2476.4  216.80  406.50      0   2478.6  216.60  406.13      0
  1073741824     268435456     float     sum      -1   4570.6  234.92  440.48      0   4578.4  234.52  439.73      0
  2147483648     536870912     float     sum      -1   9102.0  235.93  442.38      0   9086.3  236.34  443.14      0
  4294967296    1073741824     float     sum      -1    17397  246.88  462.89      0    17403  246.80  462.75      0
  8589934592    2147483648     float     sum      -1    33733  254.64  477.45      0    33707  254.84  477.83      0
 17179869184    4294967296     float     sum      -1    66129  259.79  487.11      0    66059  260.07  487.63      0

@nghtm nghtm self-requested a review October 31, 2025 15:52
Copy link
Contributor

@nghtm nghtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested successfully with our slurm launcher and new container build, thank you for the PR. I will approve, however before merging, could you kindly take a look at the following launcher manifest for Kubernetes and see if anything should be updated to accomodate the new LD Libraray changes you have made in docker file.

https://github.com/erezzarum/awsome-distributed-training/blob/nccl-tests-aws-ofi-nccl/micro-benchmarks/nccl-tests/kubernetes/nccl-tests.yaml

@erezzarum
Copy link
Author

I've tested successfully with our slurm launcher and new container build, thank you for the PR. I will approve, however before merging, could you kindly take a look at the following launcher manifest for Kubernetes and see if anything should be updated to accomodate the new LD Libraray changes you have made in docker file.

https://github.com/erezzarum/awsome-distributed-training/blob/nccl-tests-aws-ofi-nccl/micro-benchmarks/nccl-tests/kubernetes/nccl-tests.yaml

There are changes needed, mostly cleaning up unused variables that are set either by the container image or by the NCCL tuner.
Would you like me to update it via this PR or open a new one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants