/ #ai #deep learning 

Best Practices for Multi-Node Training on ABCI with PyTorch

This article summarizes a simple method for conducting distributed training with ABCI which is the GPU cloud computing service by AIST. The following repository provides a simple example of training codes on ABCI with support for multi-node training.

https://github.com/yukara-ikemiya/abci-code-sample

Let’s use ABCI for training large-scale models

AI Bridging Cloud Infrastructure (ABCI) is the world’s first large-scale Open AI Computing Infrastructure, constructed and operated by National Institute of Advanced Industrial Science and Technology (AIST). (https://abci.ai/en/about_abci/)

How to use ABCI

Create a Python environment

ABCI supports the use of Singularity containers for building learning environments. There are several ways to build a Singularity container (SIF), but in my repository, scripts are provided for creating a Docker image first and then converting that image into a SIF file.

Once a SIF file is created, move it to any directory on ABCI and use it for running your training.

Prepare training codes

When conducting distributed training with PyTorch, it is common to write training codes using DistributedDataParallel (DDP). However, here, it is recommended to utilize HuggingFace Accelerate, which wraps DDP, allowing for a more straightforward implementation. This not only simplifies the codes but also helps prevent the introduction of potential bugs.

HuggingFace Accelerate

Model (simplest auto-encoder)

import torch
from torch import nn

class SimpleModule(nn.Module):
    def __init__(self, dim_in, dim_hidden):
        super().__init__()

        self.dim_in = dim_in
        self.dim_hidden = dim_hidden

        self.net = nn.Sequential(
            nn.Linear(self.dim_in, self.dim_hidden),
            nn.Linear(self.dim_hidden, self.dim_in)
        )

    def forward(self, x):
        return self.net(x)

Toy dataset

import numpy as np
import torch
from torch.utils.data import Dataset


class DummyDataset(Dataset):
    def __init__(self, dim:int, num_data:int=10000):
        super().__init__()
        self.dim = dim
        self.num_data = num_data
    
    def get_item(self, idx):
        data = np.linspace(0, idx, self.dim)
        return torch.from_numpy(data.astype(np.float32))

    def __len__(self):
        return self.num_data

    def __getitem__(self, idx):
        return self.get_item(idx)

Training code

import argparse

import torch
from torch.utils.data import DataLoader
from accelerate import Accelerator

from simple_module import SimpleModule
from dummy_dataset import DummyDataset

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--bs", type=int, default=50, help="batch size")
    parser.add_argument("--lr", type=float, default=0.0001, help="learning rate")
    parser.add_argument("--amp", type=str, default='fp16', help="autmatic mixed precision")
    return parser.parse_args()

def main():
    args = get_args()

    # Initialize accelerator
    accelerator = Accelerator(mixed_precision=args.amp, split_batches=True)

    # Model
    model = SimpleModule(dim_in=1000, dim_hidden=100)

    # Dataset
    num_data = 10000
    dataset = DummyDataset(model.dim_in, num_data)
    dataloader = DataLoader(dataset, batch_size=args.bs, num_workers=4,
                            pin_memory=True, persistent_workers=True, shuffle=True)

    # Optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=args.lr, betas=[0.0, 0.99])

    # Prepare for distributed training
    model, dataloader, optimizer = accelerator.prepare(model, dataloader, optimizer)

    for idx_e in range(200):
        loss_epoch = 0.
        for idx_d, x in enumerate(dataloader):
            # forward
            optimizer.zero_grad(set_to_none=True)
            y = model(x)

            # RMSE loss
            loss = ((x - y) ** 2).mean().sqrt()

            # backward
            accelerator.backward(loss)
            optimizer.step()

            loss_epoch += loss.detach()

        if accelerator.is_main_process:
            loss_epoch /= idx_d + 1
            print(f'Epoch {idx_e+1} : {loss_epoch}')

if __name__ == '__main__':
    main()

Prepare training codes

There are cases where the environment variables used internally by the HuggingFace Accelerator are not automatically retrieved. To address this, the Python command is wrapped as follows to copy and utilize variables from OpenMPI.

python.bash

#!/bin/bash

# define (copy) environment variables for Huggingface Accelerate
export WORLD_SIZE=$OMPI_COMM_WORLD_SIZE
export RANK=$OMPI_COMM_WORLD_RANK
export LOCAL_RANK=$OMPI_COMM_WORLD_LOCAL_RANK
export MASTER_PORT=11111
exec python "$@"

Most of required configurations (e.g. gpus per node) can be obtained automatically.

train.bash

#!/bin/bash
#$-cwd

# load modules 
# (this can be modified based on the ABCI version.)
source /etc/profile.d/modules.sh
module load hpcx/2.12
module load singularitypro/3.11
module load cuda/11.6/11.6.2
module load nccl/2.11/2.11.4-1

# Singularity container path
CONTAINER_PATH="/path/to/your/container.sif"

# job ID is available if you need
JOB_NAME=$JOB_ID

# detect GPU type, V100 or A100
GPU_INFO=$(nvidia-smi --query-gpu=gpu_name --format=csv)
if [[ $GPU_INFO =~ "V100" ]]; then
    NUM_GPUS_PER_NODE=4
elif [[ $GPU_INFO =~ "A100" ]]; then
    NUM_GPUS_PER_NODE=8
else
    readonly PROC_ID=$!
    kill ${PROC_ID}
fi

# get number of GPUs
GPUS_IN_ONE_NODE=$(nvidia-smi --list-gpus | wc -l)
NUM_GPU=$(expr ${NHOSTS} \* ${GPUS_IN_ONE_NODE})
echo "NUM_GPU = ${NUM_GPU}"

# MPI options
MPIOPTS="-np $NUM_GPU -N ${NUM_GPUS_PER_NODE} -x MASTER_ADDR=${HOSTNAME} -hostfile $SGE_JOB_HOSTLIST"

# directories to be mounted
ROOT_SRC="/path/to/your/source/codes/"

# configurations
batch_size=256 # multiple of 16 (2 nodes)
learning_rate=0.0001
amp=fp16

# execute
mpirun $MPIOPTS \
    singularity exec --nv --pwd ${ROOT_SRC}/src/ -B ${ROOT_SRC} \
    ${CONTAINER_PATH} \
    ${ROOT_SRC}/job/python.bash ${ROOT_SRC}/src/train.py \
    --bs ${batch_size} \
    --lr ${learning_rate} \
    --amp ${amp}

Finally, you can run your training on multiple nodes on ABCI.

_run_job_on_abci_nodes.bash

# training with 2 nodes
qsub -j y -g gce12345 -l rt_AF=2 -l h_rt=0:30:00 ./job/train.bash