Download Gaze based quality assessment of visual media understanding

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Direct and indirect realism wikipedia , lookup

Artificial intelligence for video surveillance wikipedia , lookup

Computer vision wikipedia , lookup

Time perception wikipedia , lookup

Visual selective attention in dementia wikipedia , lookup

Neuroesthetics wikipedia , lookup

Embodied cognitive science wikipedia , lookup

Optical illusion wikipedia , lookup

Empirical theory of perception wikipedia , lookup

Eye tracking wikipedia , lookup

Transsaccadic memory wikipedia , lookup

Visual servoing wikipedia , lookup

Transcript
Gaze based quality assessment
of visual media understanding
Jean Martinet∗ , Adel Lablack∗ , Stanislas Lew∗ , Chabane Djeraba∗
∗ LIFL
- University of Lille, France
Abstract— Visual media is one of the most widely used in our
societies. With the increasing demand for digital image and video
technologies in applications such as communication, advertising,
or entertainment, there is a growing need for assessment tools to
evaluate the quality of visual media understanding. It is necessary
to quantify the adequacy of an audience visual media perception
and the original message or idea that the media creator intended
to transmit.
The aim of our work is to build a framework for measuring the
quality of a visual media, that is to say its ability to transmit the
original idea of the creator, and possibly to give recommendations
to the creator about how to better design the media.
Based on the recorded gaze data of people viewing the media,
we design some qualitative indicators helping to assess the
perception of the media by a target audience.
I. I NTRODUCTION
With the increasing demand for digital image and video
technologies in applications such as communication, advertising, or entertainment, there is a growing need for assessment
tools to evaluate the quality of visual media understanding. It
is necessary to quantify the adequacy of the audience visual
media understanding to the original message or idea that the
media creator intended to transmit.
Indeed, when advertising agencies and film makers produce
a visual media – an image or a movie, authors carefully chose
their subject, scene, and settings with the objective to transmit
a precise message to the viewer. How can one be sure that
the intended message is correctly received by the audience?
And how to verify that the most important items displayed in
their material is well perceived by the audience? Because of
the physiology of the human eye and the human vision system,
only a restricted area of the scene can be perceived at a time,
in the fovea region.
For applications and products that target human consumers,
it is desirable to have metrics that will predict the perceived
visual quality as measured with human subjects. Quality assessment of visual media understanding aims at quantifying
the quality of visual media understanding by the audience,
including still pictures and image sequences, by means of
quality metrics.
Providing such an evaluation tool is crucial for controlling
the audience perception of the media in existing and emerging
multimedia systems. Such tools are especially important in
constrained environments, for instance when the media is an
advertisement (still image or video) meant to be viewed in
a passing place. It also has the potential to impact nextgeneration systems by providing objective metrics to be used
during the design and testing stages, thereby reducing the need
for extensive evaluation with human subjects. With such a
tool, media producers could potentially save time by designing
suitable media, following the recommendations of the system.
The tool could for instance state that such item is not correctly
seen in a given shot, and would be better seen if placed in such
location at such moment.
This paper presents a first step towards a quality assessment of visual media understanding based on gaze. This step
consists in the recording and clustering of gaze points from
person viewing the visual media, and then elaborating several
estimators for analyzing this data. The promising results of
ongoing experiments are presented.
We briefly review in the following section some related
research about gaze analysis and its applications. Then we
describe in SectionIII our modeling of the problem in terms
of basic quality descriptors that are combined together into
a global estimator. SectionIV describes experiments that we
carried out to demonstrate the usefulness of the proposed
approach.
II. R ELATED WORK
With the recent development of low-cost gaze tracker devices, the possibility of taking advantage of the information
conveyed in gaze has opened many research directions, namely
in image compression – where users’ gaze is used to set
variable compression ratios at different places in an image,
in marketing – for detecting products of interest for customers, civil security – for detecting drowsiness or lack of
concentration of persons operating machinery such as motor
vehicles or air traffic control systems, and in human-computer
interactions. In the latter for instance, the user’s gaze is used
as a complementary input device to traditional ones such as a
mouse and a keyboard, namely for disabled users.
A. Gaze analysis
The analysis of gaze has been studied for over a century in
several disciplines, including physiology, psychology, psychoanalysis, and cognitive sciences. The objective is to analyze
eye saccades and fixations of persons watching a given scene,
in order to extract several kinds of information. During the
visual perception, human eyes move and successively fixate at
the most informative parts of the image [1].
Attention is the cognitive process of selectively concentrating
on one aspect of the environment while ignoring other things.
For images and video, the visual attention is at the core of the
visual perception, because it drives the gaze to salient points
in the scene.
B. Applications
The main applications of gaze analysis in Computer Science include regions of interest localization for compression
purpose. For instance, Osberger and Maeder [2] have defined
a way to identify perceptually important regions in an image
based on human visual attention and eye movement characteristics. In a similar way, Itti and Koch [3] have developed a visual
attention system based on the early primate vision system for
scene analysis. Later, Stentiford [4] applied the visual attention
to similarity matching.
Besides, some works are oriented towards estimating the
visual quality of a media in terms of low-level metrics [5], [6].
This can estimate the distortion after reconstructing a lossycompression encoded media. The low level criteria used are
objective and evaluate the quality of the media itself at a local
level.
Our work aims at evaluating the subjective higher level
process of viewing a media. This process, which is driven
by the visual attention, is influenced by both some low level
aspects (color, texture, shape, orientation) and the topology
of objects in the scene (and their interpretation involving
cognition).
III. B UILDING AUDIENCE TRACK PATHS FOR ANALYSIS
In order to detect and track users’ gaze, it is necessary to
employ a gaze tracking device which is able to determine
the fixation point of a user on a screen from the position of
their eyes. Non-intrusive gaze tracking systems usually require
a static camera capturing the face of the user and detecting
the direction of their gaze with respect to a known position.
A gaze tracker system is usually composed with an infra-red
light source directed towards the users’ eyes, a static infra-red
camera, a display device and software to provide an interface
between them.
A. From row data...
Gaze trackers provide the horizontal and vertical coordinates
of the point of regard relative to the display device. Thus, one
can obtain a sequence of points corresponding to the sampled
positions of the eye direction. These points pi correspond to
triplets of the form (xi , yi , ti ) and reflect the scanpath for a
given user (see Figure 2). The scanpath consists of a sequence
of eye fixations (gaze is kept still in a location, yielding regions
with important density of points), separated by eye saccades
(fast movement, yielding large spaces with only few isolated
points).
The points in the Pu for a user u participating in our
experiment are to be categorized into two classes: fixation and
saccades.
as the angular speed of the eye in degrees per second). The
velocity of a point corresponds to the distance that separates
it from its predecessor or successor. Separation of points into
fixations and saccades is achieved by using a velocity threshold.
Consecutive points pi and pi+1 separated by a distance under
the threshold are then grouped into what is considered to be
an eye fixation fj . All fj form the set Fu of fixations for the
user u, and F denotes the set of all fixations for all users.
Another threshold involving a minimal duration of a fixation
allows eliminating insignificant groups.
C. Multi-user data merging
Once the fixations are identified for all users and grouped
into the set F , a clustering process allows reducing the spatial
characteristics of fixations into a limited set of clusters Ki .
The clustering is to be achieved by unsupervised techniques,
such as K-means, because the spatial distribution of points
from all fixation points is unknown, and their number is finite.
The obtained clusters define the locations where most of the
audience have looked.
D. Working hypothesis
Together with practitioners and human scientists, we have
set a working hypothesis stating that a message in a visual
media is likely to be best understood when most people in
the audience are able to see directly the main target objects at
given times. The target objects are to be selected by the media
producer.
This working hypothesis has inspired the definition of the
following indicators, to help measure some metrics about the
scanpaths.
E. Fixation distribution indicators
Based on the above defined set of clusters Ki , we can now
define dedicated indicators to estimate some features about
the distribution of the fixation points. We have identified the
following indicators for the fixation points analysis:
• The average total number of fixations: the average total
number of fixations n̄ for all participants is defined by :
n̄ =
•
(1)
where U denotes the set of all participants (hence its
cardinal ||U || is the number of participants), and ||Fu ||
is the number of fixation points for the participant u.
The average duration of fixations: the average duration
¯ is obtained with the following formula :
of fixations ∆
¯ =
∆
1 X
∆(f )
||F ||
(2)
f ∈F
B. ...to identified fixations
An essential part in scanpath analysis and processing is the
identification of fixations and saccades. Indeed, fixations and
saccades are often used as basic source for the various metrics
that are used for interpreting eye movements [7], [8] (number
of fixations, saccades, duration of the first fixation, average
amplitude of saccades, etc.) The most widespread identification
technique is by computing the velocity of each point (defined
1 X
||Fu ||
||U || u
•
where ∆(f ) is the duration of the fixation f obtained by
subtracting the timestamp of the first sample point in f
from the timestamp of the last sample point in f .
The maximum duration of fixations: the maximum
duration of fixations ∆max is defined to be :
∆max = maxf ∈F (∆(f ))
(3)
•
The average duration of the first fixation: the average
¯ 1 is defined as :
duration of the first fixation ∆
X
¯1 = 1
∆
∆(f1u )
||U || u
•
S̄ =
(4)
•
where f1u is the first fixation for each participant.
The average scanpath length: an example of scanpath
is given Figure 1. The scanpath length Lu for a given
participant is the the sum of the length of all segments
between consecutive fixation points:
Lu =
The average number of regressions: Figure 2 shows
examples of regressions. The number of regressions Ru
is as follows:
Then the average number of regressions R̄ is:
dist(fi , fi+1 )
R̄ =
i=1
Then the average scanpath length L̄ is:
L̄ =
(7)
Ru = ||{fi |fi−1d
fi fi+1 < 90◦ }||
||Fu ||−1
X
1 X
Su
||U || u
1 X
Lu
||U || u
1 X
Ru
||U || u
(8)
(5)
Fig. 2. A regression corresponds to the case where three consecutive fixation
points form an angle less than 90◦ .
•
Fig. 1.
Gini coefficient: Gini coefficient is a measure of statistical dispersion widely used in Economics to estimate
the inequality of wealth or income distribution. In our
application, after tessellating the displayed media in rectangular patches, we use the following definition of Gini
coefficient:
Definition of the scanpath length.
G=
•
The average scanpath duration (for still images): the
average scanpath duration D̄ represents the average time
spent by participants to explore the displayed media:
D̄ =
•
1 X
Du
||U || u
(6)
where Du is the scanpath duration for participant u,
obtained by subtracting the timestamp of the first sample
point in FU from the timestamp of the last sample point
in Fu .
The average scanpath convex area: the scanpath convex
area Su for a participant is given by the surface of its
convex hull:
Su = convexHullSurf ace(Fu )
Then the average scanpath convex area S̄ is:
X φi,j
(
)2
||F
||
i,j
(9)
where ||F || is the total number of fixation points, and
φi,j ∈ [0, ||F ||] is the number of fixation points in the
patch (i, j). The value of the Gini coefficient, which
belongs to [0, 1], quantifies the degree of concentration
(or dispersion) of the set of fixation points in the image.
A value close to 1 indicates a strong dispersion of points;
values close to 0 means that points are strongly concentrated in few areas.
We believe that the above listed indicators are useful for
estimating metrics about the scanpaths across the participants,
although the list is not exhaustive. We describe in the following section the experimental setting for measuring gaze data
from participants, and show some examples to illustrate our
approach.
IV. E XPERIMENTS
We have used a single-camera gaze tracker setting, based the
Pupil-Centre/ Corneal-Reflection (PCCR) method to determine
the gaze direction. The video camera is located below the computer screen, and monitors the subject’s eyes. No attachment
to the head is required, but the head still needs to be static. A
small low power infrared light emitting diode (LED) embedded
in the infrared camera and directed towards the eye. The LED
generates the corneal reflection and causes the bright pupil
effect, which enhances the camera’s image of the pupil. The
centers of both the pupil and corneal reflection are identified
and located, and trigonometric calculations allow projecting the
gaze point onto the image.
Given this experimental setting, we have recorded the gaze of
10 persons participating in our tests. Participants were shown
images successively and were asked to watch attentively the
presented scenes. All gaze information have been recorded and
processed according to the description given in Section III.
We show in the remainder of this section some results of our
approach for still images and for movies.
to evaluate the impact of their advertisement during sport
events, and how to best dispose them. These two examples
show some keyframes taken from broadcasted sport events
(soccer and tennis). For this specific application, if we want to
focus on the specific advertisement areas on the media, while
disregarding other irrelevant parts, the model could benefit from
a supplementary estimator counting the hits of the fixations
points in advertisement areas:
• The average number of relevant hits: the number of
relevant hits Hur in a specific image/frame region r for a
participant is given by :
Hur = ||{fi |fi ∈ r}||
Then the average number of relevant hits is:
1 X r
H
H¯r =
||U || u u
(10)
•
A. Still images
Figure 3 shows three images from our database, representing
some advertising posters for specific social or scientific events.
For the two first posters, we highlight the recorded scanpath.
The scanpath, which is superimposed on the pictures, is composed with several points of regard, each being linked to the
previous one to represent the path.
On the third poster, we have superimposed the result of
clustering the fixation points, which defines a heat map of
the most seen locations in the poster, giving an indication of
whether the main informations are seen or not.
This information should be matched with the requirements
of the media producer, e.g. to check if the list of sponsors is
well seen.
Figure 4 shows another example which is a movie poster
(the movie is entitled Into the Blue, directed by John Stockwell
in 2005). The left image shows the recored scanpath, and the
right image shows the clustered points as the disks (the size
of the disks represent their relative importance in the poster,
according to cluster features such as the number of fixation
points in the cluster and their variance from the centroid). This
example demonstrates that the movie title is hardly seen by the
target audience.
Fig. 4. Another example for a movie poster, with the scanpath displayed on
the left, and clustered fixation points displayed on the right.
The last example for static images shown at Figure 5
illustrates how our approach can help advertising designers
Fig. 5. Illustration of our approach for to evaluate the impact of advertisement
campaigns during sport events.
B. Movies
We now apply our approach to movies. The material that we
have used is a 250 advertising movie of Coca Cola1 . Figure 6
shows a sampled sequence of keyframes from the movie, with
the superimposed fixation tracks from all 10 participants.
Based on the previously defined indicators, we have estimated a dispersion value for each frame in the video, denoting
how spread are the fixations from all participants. The higher
this dispersion value, the more undefined is the focus of
attention. The lower, the more precise it is. We show in Figure 7
a plot of the evolution of the dispersion value against time
for this short movie. Under the graph are highlighted time
intervals where the dispersion value is below a threshold,
denoting moments of concentrated focus in the video. Sampled
keyframes corresponding to such moments extracted from the
video are also displayed.
The dispersion value reveals the structure of the movie. It
is composed of a fast sequence of short-length shots, with a
few longer shots in between, where the faces of the characters
can be clearly seen. The intervals of low dispersion (i.e. good
focus) correspond to longer shots, and the intervals of high
dispersion (i.e. uncertain focus) correspond to chains of shortlength shots, with corresponding hard cuts in which the last
1 This movie has been submitted to the workshop as an attached document. It is also available from http://www.lifl.fr/∼martinej/martinet09gazebasedWCVIM.MP4. Please refer to the original movie to experience the
dynamics of the recorded fixation points.
Fig. 3.
Example of three images from our database, representing some advertising posters for specific social or scientific events.
Fig. 6.
Sampled sequence of keyframes from the movie, with superimposed fixation tracks from 10 participants.
frame of a shot is replaced with (followed by) the first frame
of the next shot, with no transition. In such situations, the focus
of attention shifts from the last salient location in the first shot
towards the first salient location in the second shot. Due to
individual physiological variations, the duration of the gaze
shift is not the same for all participants, which explains high
values of dispersion.
We can also notice that the last shot corresponds to a static
display of the advertised product alone, with a red color on
a white background. In this situation, the dispersion value is
very low.
V. D ISCUSSION
However, we think that the satisfaction of media producers implementing our approach would be a first qualitative validation.
An alternative solution to evaluate how people possibly
discover a visual media is to use saliency maps [3] to simulate
the human vision process. Saliency maps have been widely
used in the computer vision field to solve the attention selection
problem. The purpose of the saliency map is to represent
the conspicuity (as a scalar value) of each locations of the
image. According to such a map, it is possible to simulate the
successive fixation points of a user for the image by selecting
salient locations. Hence another direction is to compare the
result of merging all fixation points from all participants with
the result obtained with saliency maps.
A. Highlight
In our last example, we highlight the fact that a simple
analysis of the results for the test movie permits to reveal the
structure of the movie, and also to formulate the following
recommendation for advertising movie designers: the focus of
attention is better preserved for most people in the audience
when/if salient locations coincide between two consecutive
shots. Indeed, when the salient location differ between two
consecutive shots, the gaze is naturally shifted, with varying
durations for different people.
B. Validation of our approach
To the best of our knowledge, no prior work has been done
for the purpose of evaluating the quality of a visual media
from recorded gaze points from the audience. As a consequence
it is difficult to compare our findings with other approaches.
VI. C ONCLUSION
We have presented a framework for measuring the quality
of a visual media, that is to say its ability to transmit the
original idea of the creator. This framework is based on
the natural human gaze of people watching static images or
dynamic scenes. It includes qualitative indicators of the nature
of the collected scanpaths, to evaluate how well the message
is received by the audience.
A. Towards a global quality estimator
The work presented here is a part of ongoing works aiming
at defining useful tools for visual media designers. As stated
in Section III, we have carried out this work based on the
hypothesis from practitioners and human scientists that a
message in a visual media is likely to be best understood when
Fig. 7. Evolution of the dispersion value against time for the short movie. Under the graph are highlighted time intervals where the dispersion value is below
a threshold, denoting moment of concentrated focus in the video. Sampled keyframes corresponding to such moments extracted from the video are displayed.
most people in the audience are able to see directly the main
target objects at given times.
We have defined a number of indicators for estimating some
metrics related to the way the audience perceives the visual
media. An important part of this work will consist in providing
an explicit matching between our indicators and cognitive
interpretations in terms of quality of media understanding. For
this purpose, it is necessary to test individually the indicators,
and to further define a sound way to combine them into a global
quality estimator.
B. Large scale settings
An active research area in computer vision is aimed at
developing some gaze tracking software operating from a
simple webcam. From an accurate face detection, it is possible
to determine the position of the eyes, and therefore to find the
iris. The eye detection can be based on (among other methods) template matching, appearance classification, or feature
detection. In the template matching methods, a generic eye
model is first created based on the eye shape, and a template
matching process is then used to search eyes in the image.
The appearance based methods detect eyes based on their
appearance using a classifier trained using a large amount of
image patches representing the eyes of several users under
different orientations and illumination conditions. The feature
detection methods explore the visual characteristics of the eyes
(such as edge, intensity of iris, or color distributions) to identify
some distinctive features around the eyes.
The availability of such techniques is relevant to our research
activity in the sense that it will enable the deployment the
presented approach at a large scale, using different types of
display devices (computer screen, TV, shopping mall LCD
displays, etc.) We believe that in the near future, webcam-based
trackers will permit such large scale experiments to further
validate our approach.
Providing such an evaluation tool is crucial for controlling
the audience perception of the media in existing and emerging
multimedia systems. Hence the possibility to automatically
gather and analyze naturally obtained feedback from the audience is a promising proposition.
VII. ACKNOWLEDGMENTS
This work has been supported by the french National Research Agency (ANR) through the project ANAFIX (2006RIAM-026-01).
R EFERENCES
[1] A. L. Yarbus. Eye Movements and Vision. Plenum Press, New York, 1967.
[2] W. Osberger and A. J. Maeder. Automatic identification of perceptually
important regions in an image using a model of the human visual system.
In International Conference on Pattern Recognition, Brisbane, Australia,
1998.
[3] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based
visual attention for rapid scene analysis. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 20(11):1254–1259, 1998.
[4] F. Stentiford. An attention based similarity measure with application to
content based information retrieval, 2003.
[5] Stefan Winkler. Visual quality assessment using a contrast gain control
model. In in IEEE Signal Processing Society Workshop on Multimedia
Signal Processing, pages 527–532, 1999.
[6] Stefan Winkler. Issues in vision modeling for perceptual video quality
assessment. In Signal Processing, pages 231–252, 1999.
[7] R. J. Jacob and K. S. Karn. Eye tracking in human-computer interaction
and usability research: Ready to deliver the promises. In UK Elsevier Science, Oxford, editor, The Mind’s Eyes: Cognitive and Applied Aspects of
Eye Movements, 2004.
[8] A. Poole, L. J. Ball, and P. Phillips. In search of salience: A response-time
and eye-movement analysis of bookmark recognition. In Conference on
Human-Computer Interaction (HCI), pages 19–26, 2004.