Doi: 10.1145/1435417.1435436
technical Perspective
customizing media
to Displays
By Harry shum
a FeW YearS ago
, I bought a wide-screen TV with an aspect ratio of 16: 9. It
is great for watching movies shot with a
wide-screen format. However, on most
other occasions, I am faced with a dilemma: If I choose the option to fill the
entire screen, everything looks wider
than normal, while preserving the aspect ratio of the video means seeing
wasted space at both ends.
There is a mind-boggling array of
displays that are readily available, from
large plasma displays and high-resolu-tion LCDs to low-resolution cellphone
screens. These displays differ greatly
in resolutions and aspect ratios. The
problem is, images and videos are captured at fixed resolutions and aspect
ratios, and from personal experience,
viewing them properly in a display can
be a challenge.
What, then, is the correct way of displaying media? Global scaling solves
part of the problem, but naively stretching or squashing one of the dimensions to fill the screen introduces undesirable distortions. Cropping is not
a satisfactory solution either because
important elements in the scene may
either be partially removed or totally
cut out. We need a solution that intelligently customizes media to displays.
The answer may well lie in the work
of Ariel Shamir and Shai Avidan. Their
technique, intriguingly called “seam
carving,” cuts out or adds pixels to
swaths of areas deemed less important.
The importance can be measured by
contrast or need to preserve humans or
objects. Given the energy function that
measures this importance, the process
of removing pixels to minimize this energy function is nontrivial. This is because we must preserve both the rectangular shape and visual coherence of
the image. Shamir and Avidan devised
a simple but powerful idea: carve (
remove) seams iteratively.
A seam is a connected path of low-energy pixels crossing the image from
top to bottom or from left to right. Their
seam carving algorithm changes the
aspect ratio of the image by iteratively
carving the seams with the lowest importance, horizontally or vertically. The
optimal seam at each iteration can be
found using dynamic programming.
Herein lies the magic of seam carving: removing a seam has only a local
impact and the produced visual artifacts are globally imperceptible. As a
result, seam carving maintains both
a rectangular shape and visual coherence of the image.
To enlarge the image, the seam
carving process is run in reverse—by
adding interpolated pixels along the
lowest energy seam. The authors have
also demonstrated other applications
of seam carving, such as content amplification, object removal, multisize
image format, and last but not least,
video resizing. Video resizing is non-trivial because of the need for temporal coherence in addition to spatial
coherence. Shamir and Avidan cleverly
achieved video resizing by casting the
problem as a 3D graph with 2D manifold (instead of 2D graph with 1D curve
for images).
We need a solution like the one
Shamir and Avidan explore here. However, there are two important issues
that must be addressed before such a
solution is exposed to the masses. First,
there must be real-time performance
(that is, real-time rendering of media).
Even if the algorithm is highly optimized, I would imagine it is difficult to
achieve real-time resizing for high-res-olution images and HD videos. Shamir
and Avidan recommend precomputing
the resizing operations for the most
popular resolutions and aspect ratios
and storing the vertical and horizontal
seam index maps. The player or TV set
recognizes the display format, fetches
such relevant metadata information,
and re-renders the original video appropriately. While this is a good idea, in
order for this solution to be practical,
there needs to be an efficient compres-
sion scheme for the seam index maps,
especially for video. The other issue
is related to algorithmic robustness:
How can the intent, tenor, and attractiveness of media be preserved after it
has been resized? Can these qualities
be reliably codified? Human visual attention has been modeled to some extent (see, for example, the work of Itti,
Koch, and Niebur1), but is such a model
enough?
Independent of these questions,
as any computer vision scientist will
tell you, completely automatic vision
techniques are typically not foolproof.
All techniques come with assumptions
that may not be satisfied all the time.
Consumers may not be forgiving if
their video looks less than attractive—
a horde of people brandishing pitch
forks and torches come to mind. Like it
or not, I believe we will need a human
in the loop for media customization. I
am a big proponent of interactive computer vision, that is, the concept of judiciously adding interaction to complement what can be automated. This is
acceptable in the context of media customization because it needs only to be
done once for each video. (Plus, it may
spawn a sizeable cottage industry.) The
trick, then, is to design an interface that
minimizes manual input. Shamir and
Avidan’s innovative algorithm should
be adapted to take into consideration
manual annotation to preserve the intent of the media.
I now look at my wide-screen TV and
wistfully think, if only it is possible to
customize media to displays now…
Reference
11. itti, l., Koch, c., and niebur, e. a model of saliency-based visual attention for rapid scene analysis.
IEEE Transactions on Pattern Analysis and Machine
Intelligence 20, 11 (nov. 1998), 1254–1259
Harry Shum ( hshum@microsoft.com ) is a fellow of acm
and a corporate Vice President of microsoft corporation,
redmond, wa.