[2025-1] 백승우 - LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs fro

arxiv.org

1. Introduction

Current investigations focus on unimodal text
Multimodal conversational AI was trained by image-text pairs from the public web
Still lack sophistication in understanding and conversing about biomedical images

2. Methods

Biomedical Visual Instruction-Following Data

Collect 15 million medical image-caption pairs from PubMed Central

Curriculum Learning Methods

Stage 1: Medical Concept Alignment
Stage 2: Medical Instruction Tuning
Fine-Tuning

(1) Stage 1: Medical Concept Alignment

Both the visual encoder and LM weights frozen, and only update the projection matrix
Expanding the vocabulary of aligned image-text tokens to the biomedical domain

(2) Stage 2: Medical Instruction Tuning

Keep the visual encoder weights frozen
Update both the pre-trained weights of the projection layer and LM

(3) Fine-tuning to Downstream Datasets

For some specific biomedical scenarios, there is a need of developing highly accurate and dataset-specific models to improve the service quality of the assistant
Responds in free-form text for both the close-set and open-set questions

3. Experiments

Performance as an open-ended biomedcal visual chatbot

Performance on Established benchmarks

Achieving SoTA Performance with VQA-RAD and PathVQA
BioMed CLIP is better than regular CLIP

Instruction Data: 60K-IM data performs best
Stage 1: 3 epochs perform better than 1 epoch
Stage 2: Performance optimized from 9 to 15 epochs
Model Size: 13B perform better than 7B in the overall zero-shot performance and fine-tuned performance

'Multi-Modal' 카테고리의 다른 글

[2025-1] 유경석 - PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering (0)	2025.04.05
[2025-1] 황징아이 - GOKU : Flow Based Video Generative Foundation Models (0)	2025.03.22
[2025-1] 정인아 - CoCa: Contrastive Captioners are Image-Text Foundation Models (0)	2025.01.25
[2024-2] 박서형 - DETR , Deformable DETR (1)	2024.12.28
[2024-1] 백승우 - VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text (0)	2024.03.04