voice recordings per subject (extracted
from video) as training samples. We
performed the testing through a randomly selected face-and-voice sample
from a subject we selected randomly
from among the 54 subjects in the
database, leaving out the training
samples. Overall, our subjects created and used 480 training and test-set
combinations, and we averaged their
EERs and testing times. We undertook this statistical cross-validation
approach to assess and validate the
effectiveness of our proposed approaches based on the available database of 54 potential subjects.
Quality-based score-level fusion.
Table 1 lists the average EERs and
testing times from the unimodal and
multimodal schemes. We explain the
high EER of our HMM voice-recognition algorithm by the complex noise
signals in many of our samples, including traffic, people chatter, and
music, that were difficult to detect
and eliminate. Our quality-score-lev-el fusion scheme detected low SNR
levels and compensated by adjusting
weights in favor of the face images
that were of substantially better quality. By adjusting weights in favor of
face images, the face biometric thus
had a greater impact on the final decision of whether or not a user is legitimate than the voice biometric.
For the contrasting scenario, where
voice samples were relatively better
quality than face samples, as in Table
1, the EERs were 21.25% and 20.83%
for unimodal voice and score-level fusion, respectively.
These results are promising, as
they show the quality of the different
modalities can vary depending on the
circumstances in which mobile users
might find themselves. They also show
Proteus adapts to different conditions
by scaling the quality weights appropriately. With further refinements
(such as more robust normalization
techniques), the multimodal method
can yield even better accuracy.
Feature-level fusion. Table 2 outlines our performance results from
the feature-level fusion scheme, showing feature-level fusion yielded significantly greater accuracy in authentication compared to unimodal schemes.
Our experiments clearly reflect
the potential of multimodal bio-
the device software and hardware; the
Galaxy S5 uses this approach to protect
fingerprint data.
22
Storing and processing biometric
data on the mobile device itself, rather than offloading these tasks to a remote server, eliminates the challenge
of securely transmitting the biometric data and authentication decisions
across potentially insecure networks.
In addition, this approach alleviates
consumers’ concern regarding the
security, privacy, and misuse of their
biometric data in transit to and on remote systems.
Performance Evaluation
We compared Proteus recognition accuracy to unimodal systems based on
face and voice biometrics. We measured that accuracy using the standard equal error rate (EER) metric, or
the value where the false acceptance
rate (FAR) and the false rejection rate
(FRR) are equal. Mechanisms enabling secure storage and processing
of biometric data must therefore be
in place.
Database. For our experiments,
we created a CSUF-SG5 homegrown
multimodal database of face and
voice samples collected from Uni-
versity of California, Fullerton, stu-
dents, employees, and individuals
from outside the university using
the Galaxy S5 (hence the name). To
incorporate various types and lev-
els of variations and distortions in
the samples, we collected them in a
variety of real-world settings. Given
such a diverse database of multi-
modal biometrics is unavailable, we
plan to make our own one publicly
available. The database today in-
cludes video recordings of 54 people
of different genders and ethnicities
holding a phone camera in front of
their faces while speaking a certain
simple phrase.
The faces in these videos show the
following types of variations:
Four expressions. Neutral, happy,
sad, angry, and scared;
Three poses. Frontal and sideways
(left and right); and
Two illumination conditions.
Uniform and partial shadows.
The voice samples show different
levels of background noise, from car
traffic to music to people chatter, coupled with distortions in the voice itself
(such as raspiness). We used 20 different popular phrases, including “Roses
are red,” “Football,” and “ 13.”
Results. In our experiments, we
trained the Proteus face, voice, and
fusion algorithms using videos from
half of the subjects in our database
( 27 subjects out of a total of 54), while
we considered all subjects for testing. We collected most of the training
videos in controlled conditions with
good lighting and low background
noise levels and with the camera held
directly in front of the subject’s face.
For these subjects, we also added a
few face and voice samples from videos
of less-than-ideal quality (to simulate
the limited variation of training samples
a typical consumer would be expected
to provide) to increase the algorithm’s
chances of correctly identifying the
user in similar conditions. Overall,
we used three face frames and five
Table 1. EER results from score-level fusion.
Modality EER Testing Time (sec.)
Face 27.17% 0.065
Voice 41.44% 0.045
Score-level fusion 25.70% 0.108
Table 2. EER results from feature-level fusion.
Modality EER Testing Time (sec.)
Face 4.29% 0.13
Voice 34.72% 1. 42
Feature-level fusion 2.14% 1. 57