PyTorch training/inference 성능 최적화 (1/2)

오늘은 해당 블로그의 내용을 베이스로 하여 PyTorch framework에서 training/inference 성능 최적화를 하는 것을 목적으로 설명드릴 것입니다.

성능이라 함은 1. speed, 2. memory에 대한 성능을 뜻합니다. speed에 대한 성능이 좋다함은 training 및 inference time cost가 적다는 것이고 memory에 대한 성능이 좋다는 것은 training 및 inference에 사용되는 memory가 적다는 것입니다.

오늘 소개할 최적화 방법에 대한 목록은 다음과 같다.

Data Loading 최적화
- num worker 설정
- pinned memory 사용
Data Operation 최적화
- torch.Tensor 사용과 device 할당
- CPU와 GPU간의 data transfer 줄이기
- tensor.to(non_blocking=True) 사용
Training 최적화
- Architecture design과 batch size를 8의 배수로 설정
- Mixed Precision Training 사용
- Optimizer로 weight를 update하기 전에 gradient을 None으로 설정
- Gradient accumulation 사용
Inference 최적화
- Inference시에 gradient calculation 끄기
CNN 최적화
- torch.backends.cudnn.benchmark = True 사용
- 4D NCHW tensors에 대해 channel_last memory format를 사용
- Conv-BN 구조에서 Conv의 bias 사용하지 않기

1. Data Loading 최적화

1.1 num worker 설정 (time cost ↓)

Dataloader의 parameter인 num_workers는 data loading 및 augmentation을 cpu작업을 통해 하는 데 몇 개의 cpu core를 사용할 것인지 결정합니다. num_workers=0 은 weight update나 전에 실행되었던 process가 끝난 뒤에만 data loading을 하게 됩니다. 이는 동기적(synchronuous)으로 작동하기 때문에 speed성능측면에서 좋지 않습니다. 그래서 num_workers >0으로 설정하여 data loading 및 augmentation 작업이 비동기적(asynchronuous)으로 가능해지기 때문에 training시에 time cost를 줄이게 됩니다. 그렇다고 num_workers의 값을 너무 크게 준다면 memory 사용에 overhead를 주기 때문에 num_workers=4*num_GPU가 실험적으로 적절한 값이라고 합니다.

Dataloader(dataset, num_workers=4*num_GPU)

1.2 pinned memory 사용 (time cost ↓)

아래 왼쪽 사진과 같이 GPU는 CPU의 pageable memory에 direct로 접근이 불가하며 staging memory(a.k.a pinned memory)를 거쳐서 data에 접근가능합니다. 이렇게 거쳐서 간다면 time cost가 오르겠죠. 해당 문제를 해결하기 위해 pin_memory=True를 사용하여 data를 CPU위의 staging memory(a.k.a pinned memory)에 할당합니다. 이렇게 되면 pageable memory가 staging memory를 거쳐가는(transfer) 시간을 줄이게 됩니다. 해당 옵션은 위의 num_worker와 같이 사용되는 파라미터입니다.

https://miro.medium.com/max/1400/1*M8mejDZ5WbnFl8h59UfjCg.png

Dataloader(dataset, pin_memory=True)

2. Data Operation 최적화

2.1 torch.Tensor 사용과 device 할당 (time cost ↓)

Data를 정의하거나 만들때 torch.Tensor를 사용하고 device를 torch.Tensor사용시에 할당하는 것이 효율적이다. 반대로 말하면 data를 정의할때 Python이나 Numpy를 사용해서 만들지 말라는 것이다. 모델을 학습할 경우 대부분 GPU를 통해 학습할텐데 Python이나 Numpy를 통해 만들고 torch.Tensor로 transfer한다면 CPU로 만들고 GPU로 변환하는 과정을 겪기때문에 time cost가 더 늘어난다. 하지만 torch.Tenosr로 즉시 GPU device에 할당하여 data를 정의한다면 time cost가 최적화된다.

# np.random.rand([10,5])와 같음
tensor = torch.rand([10, 5], device=torch.device('cuda:0'))

# np.random.randn([10,5])와 같음
tensor = torch.randn([10, 5], device=torch.device('cuda:0'))

2.2 CPU와 GPU간의 data transfer 줄이기 (time cost ↓)

I/O cost를 최대한 줄이기위해 아래와 같은 CPU와 GPU간의 data transfer를 자제하는 것이 좋습니다.

# BAD! AVOID THEM IF UNNECESSARY!
print(cuda_tensor)
cuda_tensor.cpu()
cuda_tensor.to_device('cpu')
cpu_tensor.cuda()
cpu_tensor.to_device('cuda')
cuda_tensor.item()
cuda_tensor.numpy()
cuda_tensor.nonzero()
cuda_tensor.tolist()

2.3 tensor.to(non_blocking=True) 사용 (time cost ↓)

아래 사진과 같이 tensor.to(non_blocking=True)으로 설정하면 data transfer가 비동기적으로 진행되어 execution time을 줄일 수 있다.

https://miro.medium.com/max/1390/1*no-gQHz8daJbmYhCfAGNOA.png

for input, target in Dataloader:
    # 아래 2 lines을 통해 non-blocking과 overlapping이 진행
    input = input.to('cuda:0', non_blocking=True)
    target = target.to('cuda:0', non_blocking=True)
    
    # 해당 구간에서 input과 target의 변수가 사용되지 않는 선에서 코딩을 할경우
    # 비동기적으로 실행되므로 execution time을 줄일 수 있음 
    
    output = model(input)# synchronization시점으로 위의 2 lines을 기다리는 구간

3. Training 최적화

3.1 Architecture design과 batch size를 8의 배수로 설정 (time cost ↓)

GPU의 computation efficiency를 최대화 하기위해서는 모델의 input과 output의 size, channel 수, batch size모두를 8의 배수로 설정해야한다. 그 이유로는 Nvidia GPU의 Tensor core들이 8의 배수로 matrix로 align되어있을때 가장 optimal한 성능을 내기 때문이다.

해당 실험에서 보이듯이 output size와 batch size를 8의 배수(i.e. 33712, 4088, 4096)으로 설정하였을 때 8의 배수가 아닌 수(i.e. 33708, 4084, 4095)로 설정하였을 때보다 1.3~4배정도 computation이 빨랐다고 합니다. 이렇게 속도 차이를 나게하는 주 component는 process type(e.g. forward pass, gradient calculation)와 cuBLAS version입니다.

3.2 Mixed Precision Training 사용 (time, memory cost↓)

Mixed Precision Training은 single-precision(FP32)와 half-precision(FP16) format을 결합하여 사용하여 training하는 방식을 말합니다. 기존의 FP32만 사용하는 방식보다 data size가 작은 FP16을 섞어 사용하기 때문에 memory 사용이나 training 속도면에서도 이득을 취할 수 있습니다.

해당 방법에 대한 자세한 내용은 이전 글에서 읽어보시고 사용하시면 됩니다.

3.3 Optimizer로 weight를 update하기 전에 gradient을 None으로 설정 (time cost ↓)

기존처럼 model.zero_grad()나 optimizer.zero_grad()함수를 통해 gradient를 0으로 설정하는 것은 모든 파라미터에 memset을 실행시키고 reading과 writing operations으로 gradient을 update하는 것이다. 하지만 graident를 None으로 설정하게 되면 memse함수를 실행하지 않고 writing operation만으로 gradient를 update가능하다. 그래서 optimizer.zero_grad()를 사용하는 것보다 gradient를 None으로 설정하는 것이 더 빠르다.

# gradient를 None으로 설정 (PyTorch < 1.7)
for param in model.parameters():
    param.grad = None

# gradient를 None으로 설정 (PyTorch >= 1.7)
optimizer.zero_grad(set_to_none=True)

3.4 Gradient accumulation 사용 (time cost ↓)

Gradient accumulation은 한 batch에서 계산된 loss을 통해 바로 gradient를 update하는 것이 아닌 여러 batch으로부터 gradient을 쌓은(accumulation) 뒤에 gradient를 update하는 방법이다. 이는 Input data의 size가 너무 크거나 GPU memoy가 작아서 batch size를 작게 설정하였을 때 사용하면 time cost는 줄일 수 있고 accuracy성능은 올릴 수 있는 방법이다.

for i, (input, target) in enumerate(dataloader):
    output = model(features)
    loss = criterion(output, target)
    loss.backward()
    
    # 매 2번의 iteration이 끝난 뒤에 weight를 update하여 batch size가 doubled되어 학습하는 효과를 줌 
    if (i+1) % 2 == 0 or (i+1) == len(dataloader):
        optimizer.step() # weight update
        optimizer.zero_grad(set_to_none=True)

4. Inferecne 최적화

4.1 Inference시에 gradient calculation 끄기 (time, memory cost↓)

inference시에는 training하는 것이 아니므로 gradient에 대한 계산이 불필요하므로 gradient-involved된 operation을 disable시킨다.

# inference코드에서 (decorator) torch.no_grad() 사용
@torch.no_grad()
def validation(model, input):
    output = model(input)
return output

5. CNN 최적화

5.1 torch.backends.cudnn.benchmark = True 사용 (time cost ↓)

Training loop전에 torch.backends.cudnn.benchmark = True 으로 설정할 경우 computation을 가속화가능하다. cuDNN algorithm의 성능은 변화하는 서로 다른 kernel size에 따라 달라지기 때문에 auto-tuner는 best algorithm을 찾기위해 benchmark를 실행한다. Input size가 변화하지 않는 구조에서 해당 setting이 유효하므로 CNN모델을 학습할 경우 사용해야한다.

torch.backends.cudnn.benchmark = True

5.2 4D NCHW tensors에 대해 channel_last memory format를 사용 (time cost ↓)

https://miro.medium.com/max/1400/1*yZF37VL9xLoYs6EpwpnyqQ.png

원래 이미지는 NCHW 형태로 (memory상에서) RGB 각 채널별로 clustering되어 있다. 이를 x = x.to(memory_format=torch.channels_last) 통해 memory상에서 NHWC로 바꾸게 되면 위 그림과 같이 RGB layer가 교차되어 표현가능하다. NHWC format은 Mixed Precision Training와 같이 사용할 경우에 NHWC format보다 7~19%의 speed up 효과를 가져온다고 합니다.

memory상에서의 pixel표현 방식이 다른것이지 실제 데이터의 shape은 바뀌지 않는다.

N, C, H, W = 10, 3, 32, 32
x = torch.rand(N, C, H, W)

# Stride는 한 element와 다은 element사이의 gap(distance)을 나타냄
print(x.stride()) # shape: (3072, 1024, 32, 1)

x2 = x.to(memory_format=torch.channels_last) # memory상에서 NHWC format으로 변경
print(x2.shape)  # shape은 (10, 3, 32, 32)으로 변경되지 않음
print(x2.stride())  # NHWC로 바꾸면서 stride결과(3072, 1, 96, 3)가 작아짐
print((x==x2).all()) # 해당 값은 True로 value자체는 변경되지 않음 오직 memory상에서의 format이 변경

5.3 Conv-BN 구조에서 Conv의 bias 사용하지 않기 (time, memory cost ↓)

Batch Normalization(BN)에 대해 이론적으로 잘 아시는 분은 아시겠지만 BN layer에 bias weight가 들어가있기 때문에 Conv의 bias을 사용한다고 해서 성능이 오르지 않고 그저 중복된 weight값이 되버린다. 그래서 Conv-BN구조에서는 Conv의 bias을 사용하지 않는다.

nn.Conv2d(..., bias=False)

오늘은 이렇게 PyTorch framework에서 사용가능한 성능 최적화 방법을 알아보았습니다. 다음 글에서는 해당 방법들을 실제로 사용하였을 때 얼마나 빨라지는 지 확인해보겠습니다.

저작자표시

'AI Engineering > PyTorch' 카테고리의 다른 글

PyTorch training/inference 성능 최적화 (2/2) (2)	2022.11.22
Mixed Precision Training 이해 및 설명 (0)	2022.11.02
[Torch2TFLite] Torch 모델 TFLite 변환 (feat. yolov5) (1)	2022.06.26
PyTorch MultiGPU (2) - Single-GPU vs Multi-GPU (DistributedDataParallel) (0)	2022.03.12
PyTorch MultiGPU (1) - Single-GPU vs Multi-GPU (DataParallel) (0)	2022.03.11