erage dropping accuracy of 92.38%;
per-class accuracy is included in the
online Appendix.
Modalities compared. We recruited
12 graduate and undergraduate students, including eight males and four
females, all 20 to 30 years old, to test
the effectiveness of modality training
on a mock surgical task simulating an
abdominal incision and closure (see
Figure 4), a task requiring five instrument classes: scalpel, scissors, needle,
retractor, and four hemostats, a total of
eight instruments.
We tested Gestonurse under three
conditions: speech (S), gesture (G), and
combined speech and gesture (SG).
Note that in SG, we used the gestures
and speech (see the online Appendix)
to request the surgical instruments but
not simultaneously. While Gestonurse
can deal with simultaneous requests
from multiple modalities, simultaneous requests using different modalities
is not desirable during real-life surgeries. Surgeons are allowed to use only
speech, only gestures, or speech and
gestures one at a time during surgery.
We assigned the 12 subjects randomly to one of three test groups depending on whether they would be
using speech, gestures, or both, each
participating in two experiments. Subjects in the S and G groups could use
only five commands to request five instruments from the robot for the mock
procedure. We asked the SG test group
to use speech to request half the required instruments and gestures for
the rest.
Within each group, we trained two
subjects to communicate with Gestonurse before performing the procedure, then asked them to repeat each
command 15 times. We then read the
name of the recognized instrument to
the subject through a text-to-speech
program, Microsoft SAM text-to-speech. We similarly conducted training for gesture recognition, with each
test subject repeating each gesture 15
times and shown a bar graph with the
log-likelihood score of the gesture for
each gesture class.
Each subject performed the surgical task six times, while we recorded
the task-completion times, which
were determined mainly by type of
surgical procedure, not the speed of
the computer-vision algorithm; for
another goal is to
add the ability to
predict the next
likely surgical
instrument
according to the
type of procedure
(the context)
instead of relying
on a subjective,
variable chain of
verbal commands.
example, repairing an open abdominal aortic aneurysm can take up to
eight hours.
Discussion
Having robotics support surgical performance promises shorter operating times, greater accuracy, and fewer
risks to the patient compared with
traditional, human-only surgery. Gestonurse assists the main surgeon by
passing surgical instruments while
freeing surgical technicians to perform other tasks. Such a system could
potentially reduce miscommunication
and compensate for understaffing by
understanding nonverbal communication (hand gestures) and speech
commands with recognition accuracy,
as we measured it, over 97%. We validated the system in a mock surgery, an
abdominal incision and closure. In it
we computed learning rates of 73.16%
and 73.09% for the test subjects with
and without gesture training, indicating learning occurred at the same rate
with and without gesture training,
and that improvement of 75. 44 seconds ( 12.92% less) in task completion
time was due directly to the training
provided to the test subjects prior to
the six trials. This means the test subjects’ skill was due to understanding
and participating in the surgical task,
rather than from learning to use hand
gestures. Gesturing is presumably intuitive enough to be used by surgical
staff with (almost) no training. Our informal discussions with surgical staff
at Wishard Hospital, a public hospital
affiliated with the Indiana University
School of Medicine in Indianapolis,
found surgeons excited about the possibility of using such a robot in a surgical setting.
The multimodal system is 55. 95 seconds faster ( 14.9% less) than a speech-only system on average. However, we
also found that gesture and voice together are no faster than gesture alone,
performance that could be due to having to switch between modalities (and
related additional cognitive load) that
affects performance time. Our future
work aims to address the kind of performance (in terms of functionality, usability, and accuracy) a robotic system
must deliver to be a useful, cost-effective alternative to traditional human-only practice.