contributed articles
Doi: 10.1145/1400214.1400231
How he helped develop the SkyServer,
delivering computation directly to terabytes
of astronomical data.
BY aLeXanDeR s. szaLaY
Jim Gray,
astronomer
JiM grAY WorKEd with astronomers for more than
a decade, right up to the time he went missing in
2007. My collaboration with him created some of the
world’s largest astronomy databases and enabled us
to test many unorthodox data-management ideas in
practice. The astronomers collaborating with us have
continued to be very receptive to them, embracing
Jim as a card-carrying member of their community.
Jim’s contributions have left a permanent mark on
astronomy worldwide, as well as on e-science
in general.
Astronomy data has doubled in size every year
for the past 20 years, due mostly to the emergence
of electronic sensors. The largest sky survey of the
past decade, the Sloan Digital Sky Survey, or SDSS
photoGraph by alexander szalay
( www.sdss.org), is often called the cosmic genome
project. When it began in 1992, the size of the data
set to be used for scientific analysis was measured in
terabytes, shockingly large for the time. My group at
Johns Hopkins University was selected by the SDSS
Collaboration to build the science archive for the
SDSS, a task we quickly realized would
require a powerful search engine with
spatial search capabilities. Our experimental system, based on object-oriented technologies, was good enough
to develop an understanding of how
the eventual system should function,
though we knew we would also need to
do something different, most notably
in terms of query performance.
One SDSS collaboration meeting
in the mid-1990s took me to Seattle
where I had dinner with Charles Simo-nyi, then at Microsoft, who recognized
the similarities between our problem
and the Microsoft TerraServer (www.
terraserver.com), which provides free
online access to U.S. Geological Survey
digital aerial photographs, and immediately called Jim to arrange a meeting.
A few weeks later I flew to San Francisco and visited him at the Bay Area
Research Center. Thus began a lively
discussion about the TerraServer, how
it could be turned inside out for a new
(astronomical) purpose, and how spatial searches over the Earth were both
similar to and different from spatial
searches over the sky. We spent a full
day dissecting the problem.
Jim asked about our “ 20 queries,”
his incisive way of learning about an
application, as a deceptively simple
way to jump-start a dialogue between
him (a database expert) and me (an
astronomer or any scientist). Jim said,
“Give me your 20 most important questions you would like to ask of your data
system and I will design the system for
you.” It was amazing to watch how well
this simple heuristic approach, combined with Jim’s imagination, worked
to produce quick results.
Jim then came to Baltimore to look
over our computer room and within
30 seconds declared, with a grin, we
had the wrong database layout. My
colleagues and I were stunned. Jim
explained later that he listened to
the sounds the machines were making as they operated; the disks rattled
too much, telling him there was too
much random disk access. We began
mapping SDSS database hardware re-