-
Notifications
You must be signed in to change notification settings - Fork 148
nccl-tests/Dockerfile - Add support for custom aws-ofi-nccl & cleanup #881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* Custom aws-ofi-nccl support * Cleanup LD_LIBRARY_PATH and PATH in favor of /etc/ld.so.conf.d
|
Thanks Erez for the PR, assinging @pbelevich and @amanshanbhag to review |
|
Confirmed succesfull 2 node NCCL test using launcher script with new container build. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested successfully with our slurm launcher and new container build, thank you for the PR. I will approve, however before merging, could you kindly take a look at the following launcher manifest for Kubernetes and see if anything should be updated to accomodate the new LD Libraray changes you have made in docker file.
There are changes needed, mostly cleaning up unused variables that are set either by the container image or by the NCCL tuner. |
Issue #, if available:
Description of changes:
Even though we pass an AWS OFI NCCL version during build, this was never used and there is no possibility to test custom version of the AWS OFI NCCL.
I have added support for building a custom AWS OFI NCCL version, it is not set as the default, as the EFA installer provided one remains the default as best practices.
If a user wishes to use the custom built one he can do so by passing
LD_LIBRARY_PATH=/opt/aws-ofi-nccl/build/libI made cleanups where we aggressively use LD_LIBRARY_PATH for no reason and moved the needed libraries into
/etc/ld.so.conf.d.Main reason is that in many examples related to EKS, we pass
LD_LIBRARY_PATH=<PATH>:$LD_LIBRARY_PATHasenvfield, the$LD_LIBRARY_PATHwill not get interpolated as expected, so it's best we have defaults of libraries using the system wide (/etc/ld.so.conf.d) and any customization the user wants will be directly passed throughLD_LIBRARY_PATHwithout relying on previous values of this env variable.When we need to use NCCL tuner plugin, we can set the environment variable
NCCL_TUNER_PLUGIN=ofias opposite to set direct path, this will find the corresponding tuner plugin based on the aws-ofi-nccl library we use.https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-tuner-plugin
Example of MPIJob that uses the custom built AWS OFI NCCL
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.