Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances

Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free vision-language pipelines because actionable regions are often small, visually ambiguous, and repeated across multiple same-category instances in a scene. We propose AffordMem, a framework that grounds 3D functional affordances by remembering geometry at two levels. The first is cross-scene affordance memory: the agent maintains a category-level memory bank of RGB images with affordance regions rendered as overlays, and recalls the most informative examples at query time to guide a frozen VLM toward small operable subregions that text-only prompting consistently misses. The second is in-scene spatial memory: as the agent processes the scene, it organizes candidate instances and their 3D spatial relations into a structured scene graph, enabling the language model to resolve references over distant or currently unobserved candidates such as ''the second handle from the top.'' AffordMem requires no model fine-tuning and no target-scene annotation, using a reusable memory bank built from source scenes. On SceneFun3D, our method improves over the prior training-free state of the art by +3.23 AP50 on Split 0 and +3.7 AP50 on Split 1. Ablation studies support complementary benefits: cross-scene memory improves fine-grained localization, while in-scene memory provides the larger gain on spatially qualified queries.

Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances

Abstract