본문 바로가기
  • 책상 밖 세상을 경험할 수 있는 Playground를 제공하고, 수동적 학습에서 창조의 삶으로의 전환을 위한 새로운 라이프 스타일을 제시합니다.
Multi-Modal

[2025-1] 백승우 - LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

by BaekDaBang 2025. 3. 4.
 

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs fro

arxiv.org

1. Introduction

  • Current investigations focus on unimodal text
  • Multimodal conversational AI was trained by image-text pairs from the public web
  • Still lack sophistication in understanding and conversing about biomedical images

2. Methods

Biomedical Visual Instruction-Following Data

  • Collect 15 million medical image-caption pairs from PubMed Central

 

Curriculum Learning Methods

  • Stage 1: Medical Concept Alignment
  • Stage 2: Medical Instruction Tuning
  • Fine-Tuning

(1) Stage 1: Medical Concept Alignment

  • Both the visual encoder and LM weights frozen, and only update the projection matrix
  • Expanding the vocabulary of aligned image-text tokens to the biomedical domain

(2) Stage 2: Medical Instruction Tuning

  • Keep the visual encoder weights frozen
  • Update both the pre-trained weights of the projection layer and LM

(3) Fine-tuning to Downstream Datasets

  • For some specific biomedical scenarios, there is a need of developing highly accurate and dataset-specific models to improve the service quality of the assistant
  • Responds in free-form text for both the close-set and open-set questions

3. Experiments

Performance as an open-ended biomedcal visual chatbot

Performance on Established benchmarks

  • Achieving SoTA Performance with VQA-RAD and PathVQA
  • BioMed CLIP is better than regular CLIP

  • Instruction Data: 60K-IM data performs best
  • Stage 1: 3 epochs perform better than 1 epoch
  • Stage 2: Performance optimized from 9 to 15 epochs
  • Model Size: 13B perform better than 7B in the overall zero-shot performance and fine-tuned performance