[2025-1] 임수연 - Large scale distributed neural network training through online distillation | Relational knowledge distillation | Be your own teacher: Improve the performance of convolutional neural networks via self distillation

안녕하세요, 이번 글에서는 Distillation의 변형된 기법들을 차례로 알아가보도록 하겠습니다. 핵심 아이디어 위주로 정리하였으므로 실험과 같은 자세한 내용은 논문을 참고해주시면 감사하겠습니다.

Knowledge 관점

< Relational Knowledge Distillation, 2019 >

1. Introduction

새로운 approach로 RKD(Relational Knowledge Distillation)과 loss 2가지 distance-wise & angle-wise distillation losses 를 제안. metric learning에서 student model이 outperform 할 수 있도록 함.

Conventional Knowledge Distillation은 KL divergence나 유클리디안을 loss function에 사용하며, teacher model의 output '각각'을 student model로 전달. 반면 RKD 는 relational potential function을 이용하여 나온 relational information을 전달. 자세한 설명은 아래에서 계속

2. Relational Knowledge Distillation

teacher model의 출력 표현에서 데이터들 간의 상호관계를 이용.
n개의 튜플로부터 나온 relational potential Φ 을 계산하여 t model에서 s model로 전달.
장점: high-order 특성을 전이 가능, t와 s의 출력 차원이 다를 경우에도 지식 전이 가능

3. Distance-wise distillation loss

t와 s 간의 distance-wise potential을 유클리디안으로 구함.

미니배치 normalization이 훈련 중 더 안정적이고 빠른 수렴을 가능하도록 함.
장점: 단순하지만 효과적, t와 s간 스케일 차이를 극복.
출력 표현 공간(output representation space)에서 데이터 예제들 간의 거리 차이를 penalize하여 관계를 전이.

4. Angle-wise distillation loss

세개의 데이터 예제 간의 출력 표현 공간에서 형성된 각도를 cos으로 측정하여 관계를 정의

t와 s model 간의 각도 차이를 penalize하여 관계를 전이
장점: 각도는 거리보다 고차원 정보를 담고 있어 더 효과적인 관계 전이 가능, 더 빠른 수렴과 높은 성능

Distillation 관점

< Large scale distributed neural network training through online distillation, 2018 >

1. Introduction

앙상블은 test time cost의 증가, 증류는 훈련 pipeline 복잡도의 증가로 산업 환경에서 쓰기 쉽지 않다.
이를 극복하기 위해 online distillation 기법을 제안.

2. Online Distillation

멀티 GPU를 통한 데이터 병렬처리, 복사된 네트워크끼리 서로 지식을 전달
각 모델이 나머지 모델의 예측값과 유사하도록 하는 distillation 방법론
병렬적으로 파라미터 값 업데이트되고 다른 모델들의 평균 예측값과 일치하도록 학습하여 정확도와 속도 향상
모델 자체가 작아지지 않았지만 예측 변동성 낮음

for 문 : 모든 네트워크에 대해 실제 y값과의 오차를 계산하여 파라미터 업데이트 진행
while 문 : 현재 네트워크(s model)가 아닌 나머지 네트워크들(t model)의 예측값의 평균을 내어 현재 네트워크의 소프트 타겟과의 오차를 계산하는 distillation term 추가. 수렴할 때 까지 파라미터 업데이트를 진행.
두 반복문은 병렬적으로 진행

참고: [DMQA Open Seminar] Introduction to Knowledge Distillation, https://www.youtube.com/watch?v=pgfsxe8sROQ

< Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation >

1. Introduction

Knowledge Distillation은 네트워크 경량화를 위한 하나의 모델 압축 기법. compact한 s model이 과적합 학습된 t model을 근사하도록 학습시켜 때로는 s가 t를 능가하기도 함.

traditional distillation의 setback:

낮은 효율성: t 모델의 모든 knowledge를 s 모델에게 주지 못하니까
outperformed s 모델이 적음
t 모델 설계와 훈련을 어떻게 적절하게 할건지

기존 Two-step의 KD가 아닌, One-step Self Distillation framework를 제안

2. Self Distillation

모델을 여러 shallow section(얕은 부분) 으로 나눈 뒤, 가장 깊은 부분(최종 레이어)을 teacher model로 간주하고, 나머지 shallow section을 student model로 훈련.

step>>

타겟 CNN을 shallow section으로 나눔.
각 shallow section 뒤에 bottleneck 레이어와 fully connected 레이어로 구성된 classifier 추가, 이는 훈련에만 쓰이고 추론 과정에서는 제거됨.
cf) bottleneck 을 추가하는 이유는 shallow classifier 간의 영향을 최소화하기 위함, L2 loss를 추가해 hint를 전달
s model 의 성능을 향상 시키기 위해 3가지 loss term을 훈련 과정에 도입
1. Cross Entropy Loss :
  - 모든 shallow classifier와 deepest classifier가 레이블 정보를 학습.
  - 각 classifier의 softmax 출력과 데이터셋의 ground-truth 레이블 간 cross entropy 계산.
  - 데이터셋 레이블에서 숨겨진 knowledge를 직접적으로 모든 classifier에 도입.
2. KL Divergence Loss :
  - deepest section(= t model)와 shallow section(= s model) 간 softmax 출력의 차이를 학습.
  - 각 shallow classifier와 deepest classifier의 softmac 출력 간 KL Divergence를 계산.
  - t model의 knowledge를 각 shallow classifier로 전달하여 s가 t의 출력 분포를 따라가도록 유도.
3. L2 Loss :
  - deepest classifier의 feature map을 shallow classifier의 bottleneck 레이어로 전달.
  - 피쳐맵과 보틀넥 레이어 출력 간 L2 손실을 계산.
  - 피쳐맵에 내재된 inexplicit knowledge를 shallow classifier에 도입하여 모든 s가 t의 피쳐맵에 맞도록 학습.

'Computer Vision' 카테고리의 다른 글

[25-1] 박지원 - Deep-Emotion: Facial Expression RecognitionUsing Attentional Convolutional Network (1)	2025.02.19
[2025-1] 주서영 - Adding Conditional Control to Text-to-Image Diffusion Models (0)	2025.02.15
[2025-1] 전윤경-VoxelMorph: A Learning Framework forDeformable Medical Image Registration (0)	2025.02.15
[2025-1] 이재호 - Masked Autoencoders Are Scalable Vision Learners (0)	2025.02.14
[2025-1] 전연주 - RePaint: Free-Form Image Inpainting with DDPM (0)	2025.02.14

[2025-1] 임수연 - Large scale distributed neural network training through online distillation | Relational knowledge distillation | Be your own teacher: Improve the performance of convolutional neural networks via self distillation

'Computer Vision' 카테고리의 다른 글

관련글

티스토리툴바