expected in the case of face images obtained through mobile devices. FisherFaces uses pixel intensities in the face
images as identifying features. In the
future, we plan to explore other face-recognition techniques, including Gabor wavelets6 and Histogram Oriented
Gradients (HOG).
5
We used two approaches for voice
recognition: Hidden Markov Models
(HMM) based on the Mel-Frequency
Cepstral Coefficients (MFCCs) as voice
features,
10 the basis of our score-level
fusion scheme; and Linear Discriminant Analysis (LDA),
14 the basis for our
feature-level fusion scheme. Both approaches recognize a user’s voice independent of phrases spoken.
Assessing face and voice sample
quality. Assessing biometric sample
quality is important for ensuring
the accuracy of any biometric-based
authentication system, particularly
for mobile devices, as discussed
earlier. Proteus thus assesses facial
image quality based on luminosity,
sharpness, and contrast, while voice-recording quality is based on signal-to-noise ratio (SNR). These classic
quality metrics are well documented
in the biometrics research literature.
1, 17, 24 We plan to explore other
promising metrics, including face
orientation, in the future.
Proteus computes the average luminosity, sharpness, and contrast of
a face image based on the intensity of
the constituent pixels using approaches
described in Nasrolli and Moeslund.
17
It then normalizes each quality measure using the min-max normalization
method to lie between [0, 1], finally
computing their average to obtain a single quality score for a face image. One
interesting problem here is determining the impact each quality metric has
on the final face-quality score; for example, if the face image is too dark, then
poor luminosity would have the greatest
impact, as the absence of light would be
the most significant impediment to recognition. Likewise, in a well-lit image
distorted due to motion blur, sharpness
would have the greatest impact.
SNR is defined as a ratio of voice
signal level to the level of background
noise signals. To obtain a voice-quality
score, Proteus adapts the probabilistic
approach described in Vondrasek and
Pollak25 to estimate the voice and noise
signals, then normalizes the SNR value
to the [0, 1] range using min-max nor-
malization.
Multimodal biometric fusion. In
multimodal biometric systems, infor-
mation from different modalities can
be consolidated, or fused, at the follow-
ing levels:
21
Feature. Either the data or the fea-
ture sets originating from multiple
sensors and/or sources are fused;
Match score. The match scores gen-
erated from multiple trait-matching
algorithms pertaining to the different
biometric modalities are combined, and
Decision. The final decisions of mul-
tiple matching algorithms are consoli-
dated into a single decision through
techniques like majority voting.
Biometric researchers believe integrating information at earlier stages of
processing (such as at the feature level)
is more effective than having integration take place at a later stage (such as
at the score level).
20
Multimodal Mobile
Biometrics Framework
Proteus fuses face and voice biometrics at either score or feature level.
Since decision-level fusion typically
produces only limited improvement,
21
we did not pursue it when developing
Proteus.
Proteus does its training and testing processes with videos of people
holding a phone camera in front of
their faces while speaking a certain
phrase. From each video, the face is
detected through the Viola-Jones algorithm24 and the system extracts the
soundtrack. The system de-noises all
sound frames to remove frequencies
outside human voice range (85Hz–
255Hz) and drops frames without
voice activity. It then uses the results
as inputs into our fusion schemes.
Score-level fusion scheme. Figure
1 outlines our score-level fusion ap-
proach, integrating face and voice bio-
metrics. The contribution of each mo-
dality’s match score toward the final
decision concerning a user’s authen-
ticity is determined by the respective
sample quality. Proteus works as out-
lined in the following paragraphs.
Let t1 and t2, respectively, denote
the average face- and voice-quality
scores of the training samples from
the user of the device. Next, from a
To get its algorithm
to scale to the
constrained
resources of the
device, Proteus had
to be able to shrink
the size of face
images to prevent
the algorithm
from exhausting
the available
device memory.