Created: April 4, 2024

Tags: Analysis-by-Synthesis, Neuroscience,

Link: https://arxiv.org/abs/2301.03711

Status: Reading

Why

People can see shape even when surface cues that support 3D shape perception (examples?) are missing. Current CNN-based models underperform in such atypical recognition tasks. They propose analysis-by-synthesis model or alternatively, inference in a generative model of image formation. This model integrates intuitive physics to explain how shape can be inferred from deformation it causes to other objects, as in cloth draping. Experiments show that this approach best matches human observers in both accuracy and response times, and is the only model that correlated significantly with human performance on difficult discriminations. Top-down: analysis-by-synthesis models Bottom-up: deep CNNs

Marr levels

Computational model: how humans perceive 3D shape, even in the absence of traditional surface cues Algorithmic level: top-down(analysis-by-synthesis) vs bottom-up (CNNs) Implementation:

3D shape perception cues

There are some cues that help humans or machines use to recover shape. These include edges, bounding contours, gradients of shading or texture, stereo disparity, and motion parallax. Even when surface is obscured, humans can sometimes perceive shape, without directly seeing the object at all.

Bottom-up approach

Instead of interpretable, hand-designed cues, CNNs can learn other features which are learned in black-box fashion. Their knowledge can be transferred to other vision tasks via fine-tuning or transfer learning, so it is possible that they can generalize across even more extreme image transformations, such as cloth occlusions.

Top-down approach

Analysis-by-synthesis or inference in a physics-based generative model of how scenes form and give rise to images. They do not rely on a fixed, universal set of image cues. Instead, they infer 3D shape through top-down process based on internal model of how images are formed and the role that shape plays in that model. Cloth draping is seen as one example of unlimited variations of physics of the scene.

It would be impossible to learn universal bottom-up image cues (CNN features) which are reliable and robust. So, instead of learning features, we want to model individual, scene-level causes and by reversing this, we can recover original physical scene.

Result

CNNs and pixel-based observers, unlike people and top-down approach presented here, perform no better than chance on harder instances and even with extensive specialized training show little improvement.

Method

Task: match-to-sample task. It is chosen as they are primarily interested in how generative models can support online, detailed 3D perception, rather than object categorization or any kind of memory-based process. Assumption: humans perceive 3D shape by approximately simulating in their minds the process of how cloth drapes over the object in three dimensions, and imagining what the resulting 2D image would look like Experiments: Occluded condition: show unoccluded object with a target draped shape and an unoccluded distractor object (negative class). Which of these unoccluded objects is the same as the occluded object in the above image? Distractor class was chosen from different category (easier) and same category (harder) as the target occluded object. Unoccluded condition: show target shape without clothing but with viewpoint variability and shape. Which of these objects are similar to above image?

Model