direction. Due to the computational cost of processing two streams simultaneously, the resolution of
the image of each eye is often small. This makes such systems less accurate, although increasing
computational power and lower costs mean that more computationally intensive algorithms can be run
in real time. As an alternative, in [181], the authors propose using a single high-resolution image of
one eye to improve accuracy. On the other hand, infrared-based systems usually use only one
camera, but the use of two cameras has been proposed to further increase accuracy[152].
Although most research on non-wearable systems has focused on desktop users, the ubiquity of
computing devices has allowed for application in other domains in which the user is stationary (e.g.,
[168,152]). For example, the authors of [168] monitor driver visual attention using a single,
non-wearable camera placed on a car’s dashboard to track face features and for gaze detection.
Wearable eye trackers have also been investigated mostly for desktop applications (or for users
that do not walk wearing the device). Also, because of advances in hardware (e.g., reduction in size
and weight) and lower costs, researchers have been able to investigate uses in novel applications.
For example, in [193], eye tracking data are combined with video from the user’s perspective, head
directions, and hand motions to learn words from natural interactions with users; the authors of [137]
use a wearable eye tracker to understand hand–eye coordination in natural tasks, and the authors of
[38] use a wearable eye tracker to detect eye contact and record video for blogging.
The main issues in developing gaze tracking systems are intrusiveness, speed, robustness, and
accuracy. The type of hardware and algorithms necessary, however, depend highly on the level of
analysis desired. Gaze analysis can be performed at three different levels [23]: (a) highly detailed
low-level micro-events, (b) low-level intentional events, and (c) coarse-level goal-based events.
Micro-events include micro-saccades, jitter, nystagmus, and brief fixations, which are studied for their
physiological and psychological relevance by vision scientists and psychologists. Low-level intentional
events are the smallest coherent units of movement that the user is aware of during visual activity,
which include sustained fixations and revisits. Although most of the work on HCI has focused on
coarse-level goal based events (e.g., using gaze as a pointer [165]), it is easy to foresee the
importance of analysis at lower levels, particularly to infer the user’s cognitive state in affective
interfaces (e.g., [62]). Within this context, an important issue often overlooked is how to interpret
eye-tracking data. In other words, as the user moves his eyes during interaction, the system must
decide what the movements mean in order to react accordingly. We move our eyes 2 –3 times per
second, so a system may have to process large amounts of data within a short time, a task that is not
trivial even if processing does not occur in real-time. One way to interpret eye tracking data is to
cluster fixation points and assume, for instance, that clusters correspond to areas of interest.
Clustering of fixation points is only one option, however, and as the authors of[154] discuss, it can be