billion videos on YouTube (and more
than 400 hours of new content being
uploaded every hour)—introduces further complexities in terms of both
process and technology.
Many researchers think the long-term
solution to monitoring online content
will inevitably involve artificial intelligence. Recent advances in big data, machine learning, and embedded Graphics
Processing Units (GPUs) are starting to
pave the way for more scalable approaches to computer vision, allowing neural
networks to identify emerging patterns
in user-generated content that may demand further human scrutiny.
“With machine learning, we can
now understand a lot more about
what’s going in a video or an image,”
says Reza Zadeh, an adjunct professor
at Stanford University and founder and
CEO of Matroid, a Palo Alto, CA-based
computer vision software start-up that
is developing tools for video analysis.
Built atop TensorFlow (Google’s
open-source library for machine intelligence), Matroid uses a video player coupled with a so-called detector program
to identify similar images in a given video stream. For example, a Matroid detector could look for images of Donald
Trump across five hours of video—like a
few weeks’ worth of network news
broadcasts, or large volumes of YouTube videos—and pinpoint the spots
where those images appear. It can also
easily detect images containing gore or
violence, nudity and other forms of
NSFW (not safe for work) content, and
look for “more like this” elsewhere in
other video streams.
The company offers a self-service
tool for non-technical users to train the
system to spot particular images, as well
as a more advanced version geared toward machine learning engineers that
enables them to edit the neural network
architecture, explore histograms, and
ultimately create their own detectors for
others to use.
While deep learning approaches are
yielding advances in analyzing videos
and other image-based content, the
wide variety in the type and quality of
video across different capture devices
poses additional obstacles.
“Applying machine learning tech-
niques to analyzing video content works
reasonably well when the right condi-
tions exist,” says George Awad, project
director of TRECVID, a U.S. National In-
stitute of Standards and Technology
(NIST)-sponsored project that evaluates
video search engines and explores new
approaches to content-based video re-
trieval. “The major challenges occur
when dealing with user videos in the
wild that are not professionally edited.”
Even if it were possible to identify
visual elements across all kinds of video
files with 100% accuracy, that alone
wouldn’t solve the problem of screening
for potentially objectionable material.
Much of the “content” of a video involves
spoken words, after all, or other contex-
tual cues that won’t be readily apparent
from simply identifying an image. It’s
notoriously difficult for computers to
distinguish news from satire, for exam-
ple—or an editorial opinion piece about
terrorism from a call to arms.
In order to automate the process of
screening video content at scale, researchers will likely need to apply natural language search techniques to begin parsing videos for deeper levels of
nuance. “The gap between what the
videos demonstrate and what an automated system would generate for a natural language description is still very
big and challenging,” says Awad.
Looking ahead, Zadeh also sees
plenty of opportunity on the hardware
front, with semiconductor makers de-
vising computer vision-capable chips
that can work on devices like next-gen-
eration smartphones and cameras,
self-driving cars, and a wide range of
other video-capable devices through-
out our homes and offices.
Whereas today, machine learning
happens primarily over the network—
with supercomputers in datacenters
analyzing big datasets stored in the
cloud—eventually some of those pro-
cesses will migrate toward edge-layer
devices. Over time, the task of identi-
fying objectionable content may be-
come more diffuse, as computer vi-
sion algorithms increasingly come
pre-coded on chips embedded on
these devices. “The more algorithms
move to the source of data capture,
the more challenging it will be to cope
with real-world factors,” says Awad.
Ultimately, the future of monitoring
digital content may have less to do with
policy-making and brute-force processing at the platform provider level,
and more to do with algorithmic filters
making their way into the devices all
around us—for better or worse.
As Zadeh puts it: “Computers are
opening their eyes.”
Abu-El-Haija, S., Kothari, N., Lee, J.,
Natsev, P., Toderici, G., Varadarajan, B.,
and Vijayanarasimhan, S.
You Tube-8M: A Large-Scale Video
Classification Benchmark. arxiv.org/
Hard Questions: How we Combat Terrorism,
Facebook blog post,
Hegde, V., Zadeh, R.
FusionNet: 3D Object Classification
Using Multiple Data Representations,
3D Deep Learning Workshop at
NIPS 2016. Barcelona, Spain, 2016.
Real, S., Shlens, J., Mazzocchi, S.,
Pan, X., and Vanhoucke, V.
You Tube-Bounding Boxes: A Large
High-Precision Human-Annotated Data Set
for Object Detection in Video. (preprint)
Accepted at the Conference on Computer
Vision and Pattern Recognition (CVPR) 2017.
Venugopalan, S., Xu, H., Donahue, J.,
Rohrbach, M., Mooney, R., and Saenko, K.
Translating Videos to Natural Language
Using Deep Recurrent Neural Networks.
arXiv:1412.4729 [ cs.CV]
Alex Wright is a writer and researcher based in Brooklyn, NY.
© 2017 ACM 0001-0782/17/11 $15.00
Even if it were
possible to identify
across all kinds of
video files with
that alone wouldn’t
solve the problem
of screening for