What were we all looking at? Identifying objects of collective visual attention


We aim to identify the salient objects in an image by applying a model of visual attention. We automate the process by predicting those objects in an image that are most likely to be the focus of someone’s visual attention. Concretely, we first generate fixation maps from the eye tracking data, which express the ground truth of people’s visual attention for each training image. Then, we extract the high-level features based on the bag-of-visual-words image representation as input attributes along with the fixation maps to train a support vector regression model. With this model, we can predict a new query image’s saliency. Our experiments show that the model is capable of providing a good estimate for human visual attention in test images sets with one salient object and multiple salient objects. In this way, we seek to reduce the redundant information within the scene, and thus provide a more accurate depiction of the scene.


The file attached to this record is the authors final peer reviewed version. The publisher's final version can be found by following the DOI link below.


visual attention, bag of visual words, eye tracking, support vector regression


Ma, Z., Vickers, S., Istance, H., Ackland, S., Zhao, X. and Wang, W. (2015) What were we all looking at? Identifying objects of collective visual attention. Journal of Experimental & Theoretical Artificial Intelligence, pp.1-14.


Research Institute