Multi-Modal15 [2025-2] 백승우 - Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents Scalable Video-to-Dataset Generation for Cross-Platform Mobile AgentsRecent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have sparked significant interest in developing GUI visual agents. We introduce MONDAY (Mobile OS Navigation Task Dataset for Agents from YouTube), a large-scale datasetarxiv.org 2025. 8. 20. [2025-2] 백승우 - UI-TARS: Pioneering Automated GUI Interaction with Native Agents UI-TARS: Pioneering Automated GUI Interaction with Native AgentsThis paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercialarxiv.org 2025. 7. 30. [2025-1] 백승우 - GUI Agent by Script-based Automation 2025. 7. 4. [2025-1]박제우 - Scaling Language-Image Pre-training via Masking https://arxiv.org/abs/2212.00794 Scaling Language-Image Pre-training via MaskingWe present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP. Our method randomly masks out and removes a large portion of image patches during training. Masking allows us to learn from more image-text pairs givearxiv.org https://blog.outta.ai/284 본 논문은 지난번 리뷰했던 자연어 지도 학습 모.. 2025. 5. 17. 이전 1 2 3 4 다음