Best Practices for Multi-Node Training on ABCI with PyTorch
This article summarizes a simple method for conducting distributed training with ABCI which is the GPU cloud computing service by AIST. The following repository provides a simple example of training codes on ABCI with support for multi-node training.
https://github.com/yukara-ikemiya/abci-code-sample
Let’s use ABCI for training large-scale models
AI Bridging Cloud Infrastructure (ABCI) is the world’s first large-scale Open AI Computing Infrastructure, constructed and operated by National Institute of Advanced Industrial Science and Technology (AIST). (https://abci.ai/en/about_abci/)
Create a Python environment
ABCI supports the use of Singularity containers for building learning environments. There are several ways to build a Singularity container (SIF), but in my repository, scripts are provided for creating a Docker image first and then converting that image into a SIF file.
Once a SIF file is created, move it to any directory on ABCI and use it for running your training.
Prepare training codes
When conducting distributed training with PyTorch, it is common to write training codes using DistributedDataParallel (DDP). However, here, it is recommended to utilize HuggingFace Accelerate, which wraps DDP, allowing for a more straightforward implementation. This not only simplifies the codes but also helps prevent the introduction of potential bugs.
Model (simplest auto-encoder)
import torch
from torch import nn
class SimpleModule(nn.Module):
def __init__(self, dim_in, dim_hidden):
super().__init__()
self.dim_in = dim_in
self.dim_hidden = dim_hidden
self.net = nn.Sequential(
nn.Linear(self.dim_in, self.dim_hidden),
nn.Linear(self.dim_hidden, self.dim_in)
)
def forward(self, x):
return self.net(x)
Toy dataset
import numpy as np
import torch
from torch.utils.data import Dataset
class DummyDataset(Dataset):
def __init__(self, dim:int, num_data:int=10000):
super().__init__()
self.dim = dim
self.num_data = num_data
def get_item(self, idx):
data = np.linspace(0, idx, self.dim)
return torch.from_numpy(data.astype(np.float32))
def __len__(self):
return self.num_data
def __getitem__(self, idx):
return self.get_item(idx)
Training code
import argparse
import torch
from torch.utils.data import DataLoader
from accelerate import Accelerator
from simple_module import SimpleModule
from dummy_dataset import DummyDataset
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument("--bs", type=int, default=50, help="batch size")
parser.add_argument("--lr", type=float, default=0.0001, help="learning rate")
parser.add_argument("--amp", type=str, default='fp16', help="autmatic mixed precision")
return parser.parse_args()
def main():
args = get_args()
# Initialize accelerator
accelerator = Accelerator(mixed_precision=args.amp, split_batches=True)
# Model
model = SimpleModule(dim_in=1000, dim_hidden=100)
# Dataset
num_data = 10000
dataset = DummyDataset(model.dim_in, num_data)
dataloader = DataLoader(dataset, batch_size=args.bs, num_workers=4,
pin_memory=True, persistent_workers=True, shuffle=True)
# Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=args.lr, betas=[0.0, 0.99])
# Prepare for distributed training
model, dataloader, optimizer = accelerator.prepare(model, dataloader, optimizer)
for idx_e in range(200):
loss_epoch = 0.
for idx_d, x in enumerate(dataloader):
# forward
optimizer.zero_grad(set_to_none=True)
y = model(x)
# RMSE loss
loss = ((x - y) ** 2).mean().sqrt()
# backward
accelerator.backward(loss)
optimizer.step()
loss_epoch += loss.detach()
if accelerator.is_main_process:
loss_epoch /= idx_d + 1
print(f'Epoch {idx_e+1} : {loss_epoch}')
if __name__ == '__main__':
main()
Prepare training codes
There are cases where the environment variables used internally by the HuggingFace Accelerator are not automatically retrieved. To address this, the Python command is wrapped as follows to copy and utilize variables from OpenMPI.
python.bash
#!/bin/bash
# define (copy) environment variables for Huggingface Accelerate
export WORLD_SIZE=$OMPI_COMM_WORLD_SIZE
export RANK=$OMPI_COMM_WORLD_RANK
export LOCAL_RANK=$OMPI_COMM_WORLD_LOCAL_RANK
export MASTER_PORT=11111
exec python "$@"
Most of required configurations (e.g. gpus per node) can be obtained automatically.
train.bash
#!/bin/bash
#$-cwd
# load modules
# (this can be modified based on the ABCI version.)
source /etc/profile.d/modules.sh
module load hpcx/2.12
module load singularitypro/3.11
module load cuda/11.6/11.6.2
module load nccl/2.11/2.11.4-1
# Singularity container path
CONTAINER_PATH="/path/to/your/container.sif"
# job ID is available if you need
JOB_NAME=$JOB_ID
# detect GPU type, V100 or A100
GPU_INFO=$(nvidia-smi --query-gpu=gpu_name --format=csv)
if [[ $GPU_INFO =~ "V100" ]]; then
NUM_GPUS_PER_NODE=4
elif [[ $GPU_INFO =~ "A100" ]]; then
NUM_GPUS_PER_NODE=8
else
readonly PROC_ID=$!
kill ${PROC_ID}
fi
# get number of GPUs
GPUS_IN_ONE_NODE=$(nvidia-smi --list-gpus | wc -l)
NUM_GPU=$(expr ${NHOSTS} \* ${GPUS_IN_ONE_NODE})
echo "NUM_GPU = ${NUM_GPU}"
# MPI options
MPIOPTS="-np $NUM_GPU -N ${NUM_GPUS_PER_NODE} -x MASTER_ADDR=${HOSTNAME} -hostfile $SGE_JOB_HOSTLIST"
# directories to be mounted
ROOT_SRC="/path/to/your/source/codes/"
# configurations
batch_size=256 # multiple of 16 (2 nodes)
learning_rate=0.0001
amp=fp16
# execute
mpirun $MPIOPTS \
singularity exec --nv --pwd ${ROOT_SRC}/src/ -B ${ROOT_SRC} \
${CONTAINER_PATH} \
${ROOT_SRC}/job/python.bash ${ROOT_SRC}/src/train.py \
--bs ${batch_size} \
--lr ${learning_rate} \
--amp ${amp}
Finally, you can run your training on multiple nodes on ABCI.
_run_job_on_abci_nodes.bash
# training with 2 nodes
qsub -j y -g gce12345 -l rt_AF=2 -l h_rt=0:30:00 ./job/train.bash