For years computer programmers have been trying to design algorithms that even remotely approach the ability of young infants in their first few months of life to rapidly learn to recognize complex objects and events in the their visual input, particularly events like hand movements and gaze direction. Even the most powerful probabilistic learning models, as well as connectionist and dynamical models, do not result by themselves in automatically learning about hands, detecting them, paying attention to what they are doing, and using them to make inferences and predictions. Ullman et al. develop a model that incorporates a plausible innate or early acquired bias, based on cognitive and perceptual findings, to detect “mover” events. It leads to the automatic acquisition of increasingly complex concepts and capabilities, which do not emerge without domain-specific biases. After exposure to video sequences containing people performing everyday actions, and without supervision, the model develops the capacity to locate hands in complex configurations by their appearance and by surrounding context and to detect direction of gaze. Here is their abstract:
Early in development, infants learn to solve visual problems that are highly challenging for current computational methods. We present a model that deals with two fundamental problems in which the gap between computational difficulty and infant learning is particularly striking: learning to recognize hands and learning to recognize gaze direction. The model is shown a stream of natural videos and learns without any supervision to detect human hands by appearance and by context, as well as direction of gaze, in complex natural scenes. The algorithm is guided by an empirically motivated innate mechanism—the detection of “mover” events in dynamic images, which are the events of a moving image region causing a stationary region to move or change after contact. Mover events provide an internal teaching signal, which is shown to be more effective than alternative cues and sufficient for the efficient acquisition of hand and gaze representations. The implications go beyond the specific tasks, by showing how domain-specific “proto concepts” can guide the system to acquire meaningful concepts, which are significant to the observer but statistically inconspicuous in the sensory input.