본문 바로가기

책상 밖 세상을 경험할 수 있는 Playground를 제공하고, 수동적 학습에서 창조의 삶으로의 전환을 위한 새로운 라이프 스타일을 제시합니다.

Multi Modal5

[2025-1] 정인아 - CoCa: Contrastive Captioners are Image-Text Foundation Models https://arxiv.org/abs/2205.01917 CoCa: Contrastive Captioners are Image-Text Foundation ModelsExploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain anarxiv.org Intro문제Captioning과 Contrastive Learnin.. 2025. 1. 25.

[2024-1] 백승우 - VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text We present a framework for learning multimodal representations from unlabeled data using convolution-free Transformer architectures. Specifically, our Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations t arxiv.org 1. Abstract VATT는 트랜스포머 아키텍처를 사용해, 레이블이 없.. 2024. 3. 4.

[2023-2] 백승우 - 🦩 Flamingo: a Visual Language Model for Few-Shot Learning Flamingo: a Visual Language Model for Few-Shot Learning Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propo arxiv.org 0. Abstract Flamingo의 주요 아키텍쳐 발전 (1) 사전 학습된 강력한 시각 전용 모델과 언어 전용 모델을 연결 (2) .. 2024. 2. 23.

[2023-2] 양소정 - KOSMOS-G: Generating Images in Context with Multimodal Large Language Models Figure 1 KOSMOS-G는 입력된 이미지를 “외국어”로 간주하며 여러 이미지를 포함하는 일반화된 비전-언어 입력을 이해하고 이미지를 생성하는 능력을 가지고 있음. Abstract 최근 텍스트에서 이미지로의 변환 (T2I, text-to-image) 및 비전-언어에서 이미지로의 생성 (VL2I, vision-language-to-image) 분야에서 상당한 발전이 있었음. 특히 여러 이미지를 포함하는 일반화된 비전-언어 입력에서의 생성은 미개척된 분야임. 본 논문에서는 MLLMs(Multimodal Large Language Models)의 고급 지각 능력을 활용하여 이러한 도전 과제에 대처하는 KOSMOS-G 모델을 제안함. KOSMOS-G는 제로샷 다중 개체 주체 구동 생성(zero-shot m.. 2024. 2. 19.

이전 1 2 다음

티스토리툴바