8월 30일
17분

Demystifying the Visual Quality Paradox in Multimodal Large Language Models

This research explores a **"visual-quality paradox"** in Multimodal Large Language Models (MLLMs), finding that **higher human-perceived image quality does not always lead to better MLLM performance**; in fact, degraded images can sometimes improve results for complex reasoning tasks. The study attributes this to **degradations potentially sharpening MLLM attention on semantically relevant features**, as evidenced by analyses of relative attention and logit lens techniques. Furthermore, **conventional image restoration methods often fail to enhance MLLM performance** because they prioritize human-centric visual aesthetics over the specific features MLLMs utilize. To address this, the authors propose **Visual-Quality Test-Time Tuning (VQ-TTT)**, a lightweight adaptation module that dynamically modulates input image quality and fine-tunes shallow vision encoder layers to align with MLLM task-specific preferences. VQ-TTT shows **consistent performance gains with minimal computational overhead**, suggesting a need for adaptive, model-aligned image processing rather than universally "clean" inputs for MLLMs.

에피소드 웹페이지

프로그램

Best AI papers explained
주기

매주 업데이트
발행일

2025년 8월 30일 오전 3:50 UTC
길이

17분
등급

전체 연령 사용가

Demystifying the Visual Quality Paradox in Multimodal Large Language Models

정보