Multi-Modal17 [2025-2] 백승우 - UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action UltraCUA: A Foundation Model for Computer Use Agents with Hybrid ActionMultimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage richarxiv.org 2025. 10. 29. [2025-2] 박제우 - ANOMALYCLIP: OBJECT-AGNOSTIC PROMPT LEARNING FOR ZERO-SHOT ANOMALY DETECTION https://arxiv.org/abs/2310.18961 AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly DetectionZero-shot anomaly detection (ZSAD) requires detection models trained using auxiliary data to detect anomalies without any training sample in a target dataset. It is a crucial task when training data is not accessible due to various concerns, eg, data privaarxiv.org 0. Abstract제로샷 이상탐지(ZS.. 2025. 9. 27. [2025-2] 백승우 - Scalable Video-to-Dataset Generation for Cross-Platform Mobile Agents Scalable Video-to-Dataset Generation for Cross-Platform Mobile AgentsRecent advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs) have sparked significant interest in developing GUI visual agents. We introduce MONDAY (Mobile OS Navigation Task Dataset for Agents from YouTube), a large-scale datasetarxiv.org 2025. 8. 20. [2025-2] 백승우 - UI-TARS: Pioneering Automated GUI Interaction with Native Agents UI-TARS: Pioneering Automated GUI Interaction with Native AgentsThis paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercialarxiv.org 2025. 7. 30. 이전 1 2 3 4 5 다음