LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of image-text pairs fro
arxiv.org
1. Introduction
- Current investigations focus on unimodal text
- Multimodal conversational AI was trained by image-text pairs from the public web
- Still lack sophistication in understanding and conversing about biomedical images
2. Methods
Biomedical Visual Instruction-Following Data
- Collect 15 million medical image-caption pairs from PubMed Central
Curriculum Learning Methods
- Stage 1: Medical Concept Alignment
- Stage 2: Medical Instruction Tuning
- Fine-Tuning
(1) Stage 1: Medical Concept Alignment
- Both the visual encoder and LM weights frozen, and only update the projection matrix
- Expanding the vocabulary of aligned image-text tokens to the biomedical domain
(2) Stage 2: Medical Instruction Tuning
- Keep the visual encoder weights frozen
- Update both the pre-trained weights of the projection layer and LM
(3) Fine-tuning to Downstream Datasets
- For some specific biomedical scenarios, there is a need of developing highly accurate and dataset-specific models to improve the service quality of the assistant
- Responds in free-form text for both the close-set and open-set questions
3. Experiments
Performance as an open-ended biomedcal visual chatbot
Performance on Established benchmarks
- Achieving SoTA Performance with VQA-RAD and PathVQA
- BioMed CLIP is better than regular CLIP
- Instruction Data: 60K-IM data performs best
- Stage 1: 3 epochs perform better than 1 epoch
- Stage 2: Performance optimized from 9 to 15 epochs
- Model Size: 13B perform better than 7B in the overall zero-shot performance and fine-tuned performance