논문 개요

논문 제목: Quantifying Attention Flow in Transformers
게재 연도: 2020 (arXiv:2005.00928)
인용 횟수: 2025.11.08 기준 1331회 인용

논문 배경 : Attention 시각화 = 설명일까?
- Self-Attention은 각 토큰이 다른 토큰을 얼마나 참조하는지를 수치화하니, 사람들은 attention heatmap을 곧잘 explanation처럼 사용했었음.
- 하지만 Transformer는 레이어를 거치며 정보가 contextualization + mixing되고, residual connection과 FFN을 통해 정보가 우회/축적됨. 그래서 높은 레이어의 raw attention은 종종 uniform(평평)해지고, 토큰 기여도를 직관적으로 읽기 어려움.
- 2019년도 전후에, attention weight과 feature importance의 불일치를 일부 연구에서 보고함. 요지는 layer-by-layer mixing 때문에 상위 레이어의 weight만 보면 token-level attribution이 왜곡될 수 있다는 점.
Raw attention과의 구체적인 차이
- Raw attention은 동일 레이어 내 embedding↔embedding의 비율일 뿐, 입력 토큰(input token)까지 거슬러 올라간 정보 흐름(Information Flow)을 직접 말해주지 않음.
- Residual과 multi-layer paths를 무시하면, “어떤 입력 토큰 정체성(token identity)이 어디까지 전파되었는지”가 드러나지 않음.
- 따라서 이 논문에서는 “embedding attention → token attention”으로 역산할 수 있는, 간단하고 일반적인 두가지 post-hoc 기법을 제안

Post-hoc 기법은 모델을 다 학습(훈련)시킨 뒤에, 모델 내부를 바꾸지 않고 결과를 설명·해석(interpretation/explanation)하거나 점검(diagnostics)하는 방법

1) Attention Rollout

요약 : attention을 선형 혼합( linear combination )으로 보고, 레이어를 따라 경로의 가중치 곱(product)으로 전파량을 계산.
구현 핵심:

각 레이어 l의 head-avg attention W^(l)에 residual connection을 반영해 (residual을 반영한 normalization)

2) Attention Flow

요약 : attention graph를 flow network로 보고 edge capacity=attention weight로 놓고, max-flow로 각 source(hidden) → sink(input token)의 최대 유량을 계산.
경로 가중치는 경로 내 최소 capacity(min)로 결정되고, 경로 중첩을 고려해 overflow가 없도록 함.
Residual도 위와 같이 반영( 0.5W+0.5I).

(참고) flow: 그래프 이론의 max-flow(최대 유량) 개념을 attention graph에 적용한 것. Attention flow = “최종 레이어의 특정 embedding(예: [CLS])에서 입력 토큰들로 흘러갈 수 있는 ‘물’(정보)의 최대량을, 각 edge의 capacity=attention weight로 두고 계산한 값”.

Figure 1 — Attention 시각화 비교

각 layer의 self-attention map(head-averaged)을 보여줍니다. 노드는 각 layer의 token embedding, edge는 같은 layer의 token이 바로 아래 layer의 어떤 토큰을 얼마나 보았는지를 나타내는 attention weight. 그 토큰이 어떤 과거 토큰을 참조했는지 알 수 있음.

(a) Embedding attentions

각 layer에서 embedding ↔ embedding으로 가는 raw self-attention edge만 그림.
깊어질수록 edge가 uniform에 가까워져 어떤 input token 정체성이 어디로 갔는지 알기 어려움.

(b) Attention rollout

각 layer의 attention 행렬에 residual (0.5·W + 0.5·I)를 적용 후, 아래층부터 matrix product로 누적(rollout).
결과적으로 높은 layer에서도 input token들로의 중요도가 집중(sharp)된 패턴이 살아남음.

(c) Attention flow

attention graph를 flow network로 보고, edge capacity=attention으로 두고 max-flow를 계산.
여러 경로가 용량을 나눠 쓰므로 중요 토큰 세트로 유량이 분산(amortized)됨.
Rollout보다 덜 샤프하지만 robust한 중요도 분포를 보여줌.

Figure 2 — Raw attention(특히 [CLS])의 한계

[CLS]가 각 layer에서 input token에 주는 raw attention heatmap.
L1·L2를 지나면서 약간의 구조가 보이다가, 상위 layer로 갈수록 평평해져 token-level attribution에 부적합함을 시사.

정답 맞힌 문장 (중간) 오답 문장 (오른쪽) : NNS에 attract됨.

Figure 3 — [CLS] 기준 Token map: Rollout vs Flow

세 문장 사례에 대해 [CLS]→input token 중요도 heatmap.
Rollout: 소수 토큰에 집중.
Flow: 의미 있는 토큰 묶음을 강조(분산).

Table 1 — Blank-out Ablation과의 상관 (Subject–Verb agreement)

토큰을 하나씩 UNK로 치환했을 때 정답 클래스 확률이 얼마나 떨어지는지(영향이 크면 중요).
지표: Spearman’s ρ(attention 기반 중요도 vs blank-out 점수).
Raw: L1만 높고 이후 급격히 나쁨.
Rollout/Flow: layer가 깊어질수록 상관↑, L4–L6에서 ≈0.70에 도달.
→ token 중요도 근사로는 raw보다 두 방법이 확실히 낫다.

Table 2 — Input Gradients와의 상관 (같은 모델)

입력 임베딩에 대한 ∂logit/∂input의 크기(기울기 클수록 중요).
지표: Spearman’s ρ(attention 기반 중요도 vs input gradient saliency).
Raw: 상위 layer에서 여전히 낮거나 불안정.
Rollout/Flow: L3 이후 0.54–0.61 수준으로 일관된 개선.

Figure 4 — BERT Masked LM에서의 사례 (대명사 해소)

왼쪽 막대: 모델의 예측 확률(his/her).
오른쪽 열 지도: mask 위치 embedding이 후보 참조(noun들)에 두는 중요도.
raw : 레이어마다 패턴이 들쑥날쑥하고 상층으로 갈수록 uniform해져, 결정 근거를 명확히 말하기 어려움.
(a) 예시: Rollout/Flow가 모델 예측과 일관, Raw는 layer마다 들쑥날쑥.
(b) 예시: Flow만 모델 예측과 잘 맞음

언제 무엇을 쓰나?

Rollout: 가장 핵심 토큰을 더 날카롭게 찾고 싶을 때(집중적).
Flow: 모델이 참고한 토큰 집합을 알고 싶을 때(분산/관용적, misclassification 분석에도 유용).
→ 두 방법은 상보적

계산 복잡도

Rollout: (레이어 d, 시퀀스 길이 n)
Flow: (더 무거움; 보통 n이 크면 비용↑)

요약

Raw attention은 layer-wise mixing 때문에 높은 layer에서 설명력이 크게 떨어진다.
Attention rollout(경로 곱)과 Attention flow(max-flow)는 residual과 multi-layer 경로를 반영해 input token 중요도를 더 잘 근사한다:
- Rollout = 더 샤프(핵심 토큰 pinpoint),
- Flow = 더 robust(중요 토큰 세트 포착).
Ablation/Grad 기준으로 두 방법 모두 일관된 상관 향상을 보인다.

'NLP' 카테고리의 다른 글

[2025-2] 최민서 - Direct Preference Optimization:Your Language Model is Secretly a Reward Model (0)	2025.11.19
[2025-2] 백승우 - Agent Learning via Early Experience (0)	2025.10.15
[2025-2] 김경훈 - Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation (Arxiv 2025) (0)	2025.09.13
[2025-2] 백승우 - Intent of Data Contamination (0)	2025.09.13
[2025-2] 백승우 - Kimi K2: Open Agentic Intelligence (0)	2025.09.10

[2025-2] 정유림 - Quantifying Attention Flow in Transformers

1) Attention Rollout

2) Attention Flow

Figure 1 — Attention 시각화 비교

Figure 2 — Raw attention(특히 [CLS])의 한계

Figure 3 — [CLS] 기준 Token map: Rollout vs Flow

Table 1 — Blank-out Ablation과의 상관 (Subject–Verb agreement)

Table 2 — Input Gradients와의 상관 (같은 모델)

Figure 4 — BERT Masked LM에서의 사례 (대명사 해소)

언제 무엇을 쓰나?

계산 복잡도

요약

'NLP' 카테고리의 다른 글

티스토리툴바

[2025-2] 정유림 - Quantifying Attention Flow in Transformers

1) Attention Rollout

2) Attention Flow

Figure 1 — Attention 시각화 비교

Figure 2 — Raw attention(특히 [CLS])의 한계

Figure 3 — [CLS] 기준 Token map: Rollout vs Flow

Table 1 — Blank-out Ablation과의 상관 (Subject–Verb agreement)

Table 2 — Input Gradients와의 상관 (같은 모델)

Figure 4 — BERT Masked LM에서의 사례 (대명사 해소)

언제 무엇을 쓰나?

계산 복잡도

요약

'NLP' 카테고리의 다른 글

관련글

티스토리툴바