Reconfigurability. Hand-gesture systems are used by many different types
of users, and related hand-gesture interfaces are not “one size fits all.” Location, anthropometric characteristics,
and type and number of gestures are
some of the most common features
that vary among users.
Challenges. This requirement is
not technically challenging; the main
problem is the choice of functionalities
within the interface that can change
and those that cannot. The designer
should avoid overwhelming the user
by offering infinite tunable parameters
and menus. On the other hand, users
should have enough flexibility that they
can freely set up the system when a major component is replaced or extended.
Interaction space. Most systems
assume users are standing in a fixed
place with hands extended (limited
figure 1. Head and hand detection using
depth from stereo, illumination-specific
color segmentation, and knowledge of
typical body characteristics. 17
figure 2. Hue-saturation histogram of skin
color. the circled region contains the hand
pixels in the photo; the high spike is caused
by grayish and white pixels.
by a virtual interaction envelope) and
within the envelope recognize gestures. But these assumptions do not
hold for mobile ubiquitous hand-gesture-recognition systems where the interaction envelope surrounds only the
mobile device.
Challenges. Recognition of 3D body-arm configurations is usually achieved
through at least two cameras with
stereo vision, a setup requiring previous calibration and usually slower
response than single-camera-based
systems. Monocular vision can be used
to disambiguate 3D location using accurate anthropomorphic models of the
body, but fitting such a model to the
image is computationally expensive.
Gesture spotting and the immersion syndrome. Gesture spotting consists of distinguishing useful gestures
from unintentional movement related
to the immersion-syndrome phenomenon, 2 where unintended movement is
interpreted against the user’s will. Unintended gestures are usually evoked
when the user interacts simultaneously with other people and devices or just
resting the hands.
Challenges. The main challenge
here is cue selection to determine the
temporal landmarks where gesture interaction starts and ends; for example,
hand tension can be used to find the
“peak” of the gesture temporal trajectory, or “stroke,” while voice can be
used to mark the beginning and culmination of the interaction. However,
recognition alone is not a reliable measure when the start and end of a gesture are unknown, since irrelevant activities often occur during the gesture
period. One solution is to assume that
relevant gestures are associated with
activities that produce some kind of
sound; audio-signal analysis can therefore aid the recognition task. 52
While responsiveness, accuracy,
intuitiveness, come as you are, and
gesture spotting apply to all classes of
gesture interface, other requirements
are more specific to the context of the
application. For mobile environments
in particular, ubiquity and wearability
represent special requirements:
Ubiquity and wearability. For mo-
bile hand-gesture interfaces, these
requirements should be incorporated
into every aspect of daily activity in ev-
ery location and every context; for ex-
ample, small cameras attached to the
body or distributed, networked sen-
sors can be used to access information
when the user is mobile.
Hand-Gesture Recognition
Hand gestures can be captured
through a variety of sensors, including “data gloves” that precisely record
every digit’s flex and abduction angles,
and electromagnetic or optical position and orientation sensors for the
wrist. Yet wearing gloves or trackers,
as well as associated tethers, is uncomfortable and increases the “
time-to-interface,” or setup time. Conversely,
computer-vision-based interfaces offer
unencumbered interaction, providing
several notable advantages:
˲ ˲ Computer vision is nonintrusive;
˲ ˲ Sensing is passive, silent, possibly
stealthy;
˲ ˲ Installed camera systems can perform other tasks aside from hand-gesture interfaces; and
˲ ˲ Sensing and processing hardware
is commercially available at low cost.
However, vision-based systems usually require application-specific algorithm development, programming,
and machine learning. Deploying them
in everyday environments is a challenge, particularly for achieving the robustness necessary for user-interface
acceptability: robustness for camera
sensor and lens characteristics, scene
and background details, lighting conditions, and user differences. Here, we
look at methods employed in systems
that have overcome these difficulties,
first discussing feature-extraction
methods (aimed at gaining information about gesture position, orientation, posture, and temporal progression), then briefly covering popular
approaches to feature classification
(see Figure 1).