used to capture, clean, animate, and
share sketches, is a demonstration of
the former [ 8]. With this app, users
define regions of a sketch, optionally add
audio annotations to each region, and
ultimately generate a movie from the
sketch and annotated regions (Figure
1). In SketchScan, users do not actually
shoot video. Instead, the system creates
a video from a sequence of multimedia
bookmarks. Each bookmark includes a
highlighted subregion of an image and
an optional audio clip. Users capture
a static image, then create bookmarks
and arrange them to tell a story. The
sequenced bookmarks and their
annotations are then forwarded to a
remote server that combines them all
into a single video.
The reverse case, extracting media
from videos for use in static documents,
has been explored previously, mostly
for summarization purposes. For
example, video-summary tools have
been developed that extract keyframes
into a pleasing static design [ 9]. But
there are many other ways to leverage
video content in user interfaces. As part
of Cemint, we are building tools that
allow users to extract any keyframe
from a video, or automatically detected
subregions of keyframes, at any time.
With these tools, users directly interact
with video content using familiar
techniques such as dragging a selection
box over an area to highlight text,
using the mouse wheel to scroll up and
down, or double-clicking to identify
rectangular areas of importance (Figure
2). Users then use familiar copy-and-paste or drag-and-drop techniques
to extract content to multimedia
documents [ 10].
One effect of supporting the flexible
repurposing of content across media is
freeing users to compose thoughts in
the domain of their choice for ideation.
Users can then reuse media directly,
without having to shoehorn their work
to fit a particular tool. This could be a
boon for new learners, as, for example,
many novice users of word processors
tend to spend more time constructing
their thoughts outside the context of
the program than within the word
processor itself [ 11]. Furthermore,
content analysis can support users’
compositions. Analysis can leverage
user input solicited via familiar
interactions with both the original
content and exposed intermediate
results of real-time analysis. This user-
driven approach to content analysis
can avert many difficulties that plague
the predominant automatic end-to-end
We are just beginning our work in
this area—we are far from providing
full-fledged multimedia document
support. And there are many other ways
to apply text document concepts to help
users navigate and extract content from
video. For example, we are currently
exploring how real-time analysis of live
video, such as from video conferences or
lectures, can enable better note-taking,
review, and content reuse. Tools or
techniques that make it easy for spatial
navigation to trigger side effects that
enrich the reading experience without
detracting from the comprehension of
main concepts represent another gap
in current support. Finally, we believe
that better integration of video and
demonstration tools could dramatically
improve the way that many research
results in the HCI community are
communicated. As David Weinberger
writes, “If your medium doesn’t
easily allow you to correct mistakes,
knowledge will tend to be carefully
vetted. If it’s expensive to publish,
then you will create mechanisms that
winnow out contenders. If you’re
publishing on paper, you will create
centralized locations where you amass
books.... Traditional knowledge has
been an accident of paper” [ 12].
The main goal for any multimedia
document tool is to allow users to tell
a story using the most appropriate
combination of rich and traditional
media. As reading continues to move
to mobile and tablet devices, a rich
multimedia approach will increasingly
be the most natural way to convey formal
and informal concepts. Ultimately,
this will lead to a reformulation of the
very notion of knowledge.
2. NEA. To Read or Not To Read: A Question
of National Consequence. 2007.
4. Vardi, M. Y. Will MOOCs destroy academia?
Comm. of the ACM 55, 11 (2012), 5.
5. Wolf, M. Proust and the Squid: The Story
and Science of the Reading Brain. Harper
6. Hauptmann, A., Yan, R., Lin, W-H.,
Christel, M., and Wactlar, H. Can high-level concepts fill the semantic gap in video
retrieval? A case study with broadcast
news. IEEE Transactions on Multimedia 9,
5 (2007), 958–966.
7. Weilenmann, A., Säljö, R., and Engström,
A. Mobile video literacy: Negotiating the
use of a new visual technology. Personal
and Ubiquitous Computing 18, 3 (2014),
9. Uchihashi, S., Foote, J., Girgensohn, A.,
and Boreczky, J. Video Manga: Generating
semantically meaningful video summaries.
Proc. of the ACM International Conference
on Multimedia. 1999, 383–392.
10. Denoue, L., Carter, S., and Cooper, M. 2013.
Content-based copy and paste from video
documents. Proc. of the ACM Symposium on
Document Engineering. 2013, 215–218.
11. Huh, J. Why Microsoft Word does not
work for novice writers. Interactions 20, 2
12. Weinberger, D. Too Big to Know: Rethinking
Knowledge Now That the Facts Aren’t the
Facts, Experts Are Everywhere, and the
Smartest Person in the Room Is the Room.
Basic Books, 2012.
Scott Carter is a senior research scientist
at FX Palo Alto Laboratory. His primary
research focus is developing innovative
multimedia user interfaces.
Matthew Cooper is a senior research
scientist at FX Palo Alto Laboratory, leading
the Interactive Media group. His primary
research focus is developing content-analysis
techniques that enable multimedia information
management and retrieval applications.
Laurent Denoue is a researcher at FX Palo
Alto Laboratory interested in user interaction
design and document and video processing.
He worked on XLibris, an annotation system;
ProjectorBox, an appliance for capturing
meetings; and TalkMiner, a service that detects
slides in online lectures. His recent interest is
client-based video processing to manipulate
video documents in real time.
John Doherty is a senior media specialist at
FX Palo Alto Laboratory. His primary interest
is in designing processes and systems that
make video easier to produce, repurpose, and
integrate into multimedia documents.
Vikash Rugoobur is a visiting researcher at
FX Palo Alto Laboratory. His interests include
quantified self, wearable devices, and user
DOI: 10.1145/2617379 COPYRIGHT HELD BY AUTHORS. PUBLICATION RIGHTS LICENSED TO ACM. $15.00