the join10—one of the finest examples
of SQL wizardry I have ever seen. He
worked with Maria Nieto-Santisteban
of Johns Hopkins to create parallel implementations of this cross-matching
operation across many servers; performance is nothing short of stunning. 6, 12
These ideas are the basis of the next-generation SkyQuery engine we are
building today.
It was around 2001 that astronomers began to explore the idea of a U.S.
National Virtual Observatory ( www.us-vo.org/). 10, 21 Given the fact that most
of the world’s astronomy data is public (“worthless”) and online, the time
seemed right to develop a framework
where all of it would appear as part of
a single system. Jim was an enthusiastic supporter of the idea and an active participant in all the discussions
about its design. His ideas are still at
the heart of its service-based architecture. His advice helped us avoid many
computational and design pitfalls we
would undoubtedly have fallen into.
He helped many different groups from
around the world bring their data into
databases; his astronomy collaborators are found everywhere, from Edinburgh to Beijing, Pasadena, Munich,
and Budapest. He bought several
sneakernet boxes, inexpensive servers
that travel the world as an inexpensive
way to transport data, and was highly
amused by the fact that in spite of
the delays due to postal services and
customs checks the bandwidth still
exceeds that of the scientific world’s
high-speed networks. 11
The SkyServer also turned out to be
a groundbreaking exercise in publishing and curating digital scientific data.
We learned that once a data set is released, it cannot be changed and must
be treated like an edition of a printed
book, in the sense that one would not
destroy an old copy just because a new
one appears on the shelves. To date,
we carry forward all the old releases of
SDSS data.
We also aimed to capture all relevant information in the database. We
created a framework for automatically
supporting physical units and descriptions by the database, using markup
tags in the comments of our SQL
scripts. We recently (2008) archived all
email sent during the project in a free-text searchable database.
We were indeed anxious to see how
scientists would interact with the database. Analyses, we knew, must be
done as close to the data as possible,
but it is also difficult to allow general
users to create and run their own functions inside a shared, public database.
Nolan Li, a graduate student at Johns
Hopkins, and Wil O’Mullane, a senior
programmer in the Johns Hopkins
figure 4: aggregate skyserver monthly traffic 2001–2006
when the number of Web hits doubled each year.
Traffic by Month
hits
page views
1.e + 7
1.e + 6
1.e + 5
2001/4
2002/4
2003/4
2004/4
2005/4
2006/4
SDSS group, proposed giving users
their own serverside databases (called
MyDB/CasJobs) where they could do
anything yet still link to the main database as well. Jim embraced the idea
and was instrumental in turning it into
generic dataspace. 13
Over the years, we also noticed another interesting user pattern. Even
though the MyDB interface gave users who wanted to run long jobs a way
around our five-minute timeouts for
anonymous queries, many astronomers and non-astronomers alike
were writing Python and Perl crawlers
where a simple query template was repeatedly submitted with a different set
of parameters, occasionally leading to
problems.
In one case someone was submitting a query every 10 seconds that was
less than optimally written and so
took more than 10 seconds to execute.
As a result, the requests kept piling
up, and the server became extremely
overloaded. As we noted this odd behavior and identified and isolated the
“guilty” query, Jim quickly modified
the stored procedure that executed
the user-written free-form SQL queries. He put in a statement conditional
to the IP address of the user running
the particular robot script, so, for that
user alone, the query would not be executed but instead give the message:
“Please contact Jim Gray at the following email address:…” The queries
stopped immediately. We later learned
they were coming from a CS graduate
student in Tokyo who had the shock of
his life from Jim’s email, which (for a
student of CS) must have sounded like
the voice of God. Jim followed up and
sent the student an email that said: “It
is OK to use the system and OK to send
an email.”
We logged all traffic from day one
and were amazed to see how it grew
(see Figure 4) and how a New York
Times article on a new SDSS result
caused a huge spike in user traffic. It
was gratifying to see that afterward the
traffic continued to stay higher than
before, indicating that many people,
astronomers and non-astronomers
alike, liked what they saw. Our analysis
of SkyServer traffic found that most of
the one million users were non-astronomers and that there is a power law
with no obvious breaks in any of the