Speaker
Description
Understanding how an observer decides where to look is central to both human-vision research and machine perception. We tackle this question by training a neural network that produces artificial scan paths while performing tasks such as classification, visual search, and counting. At each fixation, the network receives a log-polar, foveated view of the image, which retains high-resolution detail at the point of gaze and compresses the periphery. This process mirrors the acuity of the retina. A controller proposes the next fixation, and a task solver integrates the resulting glimpses to answer the query. The controller can be trained with standard back-propagation or, in an alternative configuration, entirely with reinforcement learning (RL).
On an MNIST-derived benchmark that places digits on a 224 × 224 canvas with distractor strokes, the model matches full-vision accuracy (99%) when the midpoint of the target digit is supplied as an oracle fixation. The same accuracy is maintained when the controller learns its own fixation sequence, and RL offers a small additional benefit. Under similar perceptual constraints the model achieves 54 % top-1 accuracy on ImageNet. Qualitative inspection shows that the learned fixations cluster around semantically informative regions. Ongoing experiments add an explicit novelty reward to study how curiosity incentives reshape exploration behaviour and downstream performance.