Visionwhich
Visionwhich is a proposed framework in computer vision and human–computer interaction for resolving referential ambiguity in visual scenes by interpreting explicit which prompts and contextual cues to identify the intended target among several candidates. It describes a family of models that aim to map user intent expressed in natural language questions or prompts to a specific object or region within an image or video.
The term has appeared in early-2020s academic discussions and demonstration materials as part of broader work
Techniques associated with visionwhich typically fuse visual encoders and language models, sometimes with dialog history or
Applications span collaborative robotics, accessibility tools that assist users in cluttered environments, augmented reality interfaces, and
See also: visual question answering, referential expression comprehension, multimodal AI, interactive vision systems.