From e5db782866eddd0b4d5fb4e2e42fe0c4b770db8a Mon Sep 17 00:00:00 2001 From: Aravind Neelakantan Date: Tue, 4 Nov 2025 10:16:37 -0600 Subject: [PATCH 1/2] Updated docker build command with the correct context --- 3.test_cases/pytorch/FSDP/slurm/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/3.test_cases/pytorch/FSDP/slurm/README.md b/3.test_cases/pytorch/FSDP/slurm/README.md index 96c1654d3..55dae9285 100644 --- a/3.test_cases/pytorch/FSDP/slurm/README.md +++ b/3.test_cases/pytorch/FSDP/slurm/README.md @@ -33,7 +33,7 @@ You will first build the container image with the command below: ```bash -docker build -f ../Dockerfile -t fsdp:pytorch2.7.1 . +docker build -f ../Dockerfile -t fsdp:pytorch2.7.1 ../. ``` You will then convert the container image to a squash file via Enroot: From 004da4158deda9327f5851123c12b2570f76a0fc Mon Sep 17 00:00:00 2001 From: Aravind Neelakantan Date: Tue, 4 Nov 2025 10:22:26 -0600 Subject: [PATCH 2/2] Removed incomplete line - issue #813 --- 3.test_cases/pytorch/FSDP/slurm/README.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/3.test_cases/pytorch/FSDP/slurm/README.md b/3.test_cases/pytorch/FSDP/slurm/README.md index 55dae9285..dba9123ff 100644 --- a/3.test_cases/pytorch/FSDP/slurm/README.md +++ b/3.test_cases/pytorch/FSDP/slurm/README.md @@ -71,8 +71,6 @@ Also, under `User Variables` make sure to adjust `GPUS_PER_NODE` to match the nu You can also adjust the training parameters in `TRAINING_ARGS` (for example, to increase batch size). Additional parameters can be found in `model/arguments.py`. Note that we use the same directory for both `--checkpoint_dir` and `--resume_from_checkpoint`. If there are multiple checkpoints, `--resume_from_checkpoint` will automatically select the most recent one. This way if our training is interupted for any reason, it will automatically pick up the most recent checkpoint. -If you are using a container image, you need to uncomment the line below in the - ### Llama 3.1 8B training To launch your training for Llama 3.1 8B, run