본문 바로가기
  • 책상 밖 세상을 경험할 수 있는 Playground를 제공하고, 수동적 학습에서 창조의 삶으로의 전환을 위한 새로운 라이프 스타일을 제시합니다.

Multi-Modal25

[2026-1] 정재훈 - CoCa: Contrastive Captioners are Image-Text Foundation Models https://arxiv.org/abs/2205.01917v2 1. Introduction최근 BERT, T5, GPT-3와 같이 web-scale data로 pretrained된 기반 언어 모델들이 zero-shot, few-shot, 전이학습 등을 통해 대규모 멀티태스킹 능력을 증명하며 부상하고 있습니다. 각각 task에 전문화된 개별 모델에 비해 대규모 downstream을 위해 pretrained된 모델은 학습비용을 상각할 수 있어 인간 수준 지능의 모델을 위한 한계를 뛰어넘을 수 있는 가능성을 제시합니다. vision-language problem에 대하여 여러 기반 모델들이 후보로 탐색되었다.1. Single-encoder : 이전 연구들은 cross-entropy loss로 pretraine.. 2026. 2. 21.
[2026-1] 장인영 - BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Model 논문 링크 : https://arxiv.org/abs/2301.12597 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModelsThe cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models. This paper proposes BLIP-2, a generic and efficient pre-training strategy that bootstraps vision-language pre-training from .. 2026. 2. 21.
[2026-1] 백승우 - The Evolution of Human-Like Computer-Using Agents From Perception to Command UFO: A UI-Focused Agent for Windows OS InteractionWe introduce UFO, an innovative UI-Focused agent to fulfill user requests tailored to applications on Windows OS, harnessing the capabilities of GPT-Vision. UFO employs a dual-agent framework to meticulously observe and analyze the graphical user interfacearxiv.org UFO2: The Desktop AgentOSRecent Computer-Using Agents (CUAs), powered by multimod.. 2026. 1. 21.
[2025-2] 백승우 - MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI AgentsTo enhance the efficiency of GUI agents on various platforms like smartphones and computers, a hybrid paradigm that combines flexible GUI operations with efficient shortcuts (e.g., API, deep links) is emerging as a promising direction. However, a frameworkarxiv.org 2025. 12. 24.