their semantics can be relatively
primitive in comparison with textual
analogs.
Additionally, many existing tools
treat video monolithically, rather than
as a potentially interactive, mineable,
sharable, and reconfigurable medium.
Many startup systems exist that
allow users to remix video, but they
tend to operate breadth-first, simply
allowing users to string together clips
rather than organizing or exposing
the content buried within. Research
has focused on the related problem of
understanding and developing visual
literacy toward the production of video
(e.g., A. Weilenmann et al. [ 7]). While
this work is valuable, it limits media to a
particular representation.
In his book Mindstorms, Seymour
Papert suggests that “in the most
fundamental sense, we, as learners, are
all bricoleurs” and that we build our
understanding of complicated processes
by tinkering and reconfiguration. But in
order to tinker you need building blocks,
fully ready-to-hand components so that
learners and creators can engage in
what Lévi-Strauss called the bricoleur’s
“dialogue with…materials.” Once video
content can be manipulated using the
same techniques and metaphors we
apply to text, such as cut-and-paste,
drag-and-drop, and spatial editing,
we can build tools that support the
construction of multimedia documents
that richly convey procedural and
analytical content in concert with the
most appropriate media.
We further suggest that we need
tools that focus on content rather than
markup. When he created HTML,
Tim Berners-Lee never intended for
people to “have to deal with HTML.”
Multimedia documents have been
supported somewhat (e.g., wikis),
but these tools are not conceptually
different from HTML—they still
require users to mark up text rather
than directly manipulate content.
Media bricolage tools must allow
users to extract media so that it can
be seamlessly remixed in multimedia
documents. But what exactly do we
mean by multimedia document For
our purposes, a multimedia document
does not simply place different pieces
of multimedia in proximity—websites
have done this quite well for years.
Rather, we mean documents in which
spatial and temporal layouts have equal
weight, can influence one another, and
through which content can flow in any
direction. Text documents are designed
to be consumed spatially, while videos
are designed for temporal navigation.
In a multimedia document, the goal
is to take advantage of a traditional
document’s spatial qualities to augment
video, and vice versa. Spatial-navigation
events should be able to trigger changes
in time-based media. Several Web-
based journalism sites have been
exploring this approach. For example,
in ESPN’s long-form piece on the
Iditarod, the reader follows the author
as he travels through Alaska across the
course. As the reader scrolls, a map at
the top tracks his progress. Similarly, in
the Ne w York Times piece “Snow Fall,”
animations respond to a user’s spatial
navigation. Multimedia documents
should also support spatial changes
triggered by temporal events. For
example, Mozilla’s PopcornMaker tool
allows content creators to trigger the
appearance of documents when a video
reaches a certain time point (SMIL-
based authoring tools have supported
similar features for many years).
We can expand the idea of
responsive documents more broadly to
include spatial events that trigger other
spatial changes (e.g., a background
changes as the user navigates) and
temporal events triggering other
temporal changes (e.g., pausing a video
upon reaching a marked time and then
playing an animated GIF in a separate
window to emphasize a point).
It is important that content flow
easily between media types so it can
be tightly integrated. We are currently
developing a suite of tools to support
such seamless intermedia synthesis. The
suite, called Cemint (for Component
Extraction from Media for Interaction,
Navigation, and Transformation),
includes mobile- and Web-based tools
that allow users to create temporal
content from spatial resources and vice
versa. SketchScan, a mobile application
Figure 1. SketchScan overview screen, with the second of three bookmarks selected (left). The
bookmark includes a region of a static image as well as an audio clip. Users can rearrange the
order of clips (right). When users are satisfied with their bookmarks and annotations, they send
the data to a server, which generates a video.
Figure 2. Directly interacting with video content with Cemint. Users can highlight text (top),
manipulate the mouse wheel to scroll (middle), and select regions of importance (bottom) to
crop the video or to copy content to their personal notepad.