본문 바로가기
  • 책상 밖 세상을 경험할 수 있는 Playground를 제공하고, 수동적 학습에서 창조의 삶으로의 전환을 위한 새로운 라이프 스타일을 제시합니다.

Multi-Modal22

[2025-1] 백승우 - GUI Agent by Script-based Automation 2025. 7. 4.
[2025-1] 박지원-You Said That?: Synthesising Talking Faces from Audio 원문) https://arxiv.org/abs/1705.02966 You said that?We present a method for generating a video of a talking face. The method takes as inputs: (i) still images of the target face, and (ii) an audio speech segment; and outputs a video of the target face lip synched with the audio. The method runs in real timearxiv.org 1. INTRODUCTION i) 개요 및 핵심 아이디어: 대상 얼굴의 이미지와 오디오 음성 segment를 input -> 얼굴이 오디오에 맞.. 2025. 5. 20.
[2025-1]박제우 - Scaling Language-Image Pre-training via Masking https://arxiv.org/abs/2212.00794 Scaling Language-Image Pre-training via MaskingWe present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP. Our method randomly masks out and removes a large portion of image patches during training. Masking allows us to learn from more image-text pairs givearxiv.org https://blog.outta.ai/284 본 논문은 지난번 리뷰했던 자연어 지도 학습 모.. 2025. 5. 17.
[2025-1] 박제우 - CLIP : Learning Transferable Visual Models From Natural Language Supervision https://arxiv.org/abs/2103.00020 Learning Transferable Visual Models From Natural Language SupervisionState-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text .. 2025. 5. 6.