the join10—one of the finest examples of SQL wizardry I have ever seen. He worked with Maria Nieto-Santisteban of Johns Hopkins to create parallel implementations of this cross-matching operation across many servers; performance is nothing short of stunning. 6, 12 These ideas are the basis of the next-generation SkyQuery engine we are building today.

It was around 2001 that astronomers began to explore the idea of a U.S. National Virtual Observatory ( www.us-vo.org/). 10, 21 Given the fact that most of the world’s astronomy data is public (“worthless”) and online, the time seemed right to develop a framework where all of it would appear as part of a single system. Jim was an enthusiastic supporter of the idea and an active participant in all the discussions about its design. His ideas are still at the heart of its service-based architecture. His advice helped us avoid many computational and design pitfalls we would undoubtedly have fallen into. He helped many different groups from around the world bring their data into databases; his astronomy collaborators are found everywhere, from Edinburgh to Beijing, Pasadena, Munich, and Budapest. He bought several sneakernet boxes, inexpensive servers that travel the world as an inexpensive way to transport data, and was highly

amused by the fact that in spite of the delays due to postal services and customs checks the bandwidth still exceeds that of the scientific world’s high-speed networks. 11

The SkyServer also turned out to be a groundbreaking exercise in publishing and curating digital scientific data. We learned that once a data set is released, it cannot be changed and must be treated like an edition of a printed book, in the sense that one would not destroy an old copy just because a new one appears on the shelves. To date, we carry forward all the old releases of SDSS data.

We also aimed to capture all relevant information in the database. We created a framework for automatically supporting physical units and descriptions by the database, using markup tags in the comments of our SQL scripts. We recently (2008) archived all email sent during the project in a free-text searchable database.

We were indeed anxious to see how scientists would interact with the database. Analyses, we knew, must be done as close to the data as possible, but it is also difficult to allow general users to create and run their own functions inside a shared, public database. Nolan Li, a graduate student at Johns Hopkins, and Wil O’Mullane, a senior programmer in the Johns Hopkins

 

figure 4: aggregate skyserver monthly traffic 2001–2006 when the number of Web hits doubled each year.

Traffic by Month

hits page views

1.e + 7

1.e + 6

1.e + 5

2001/4

2002/4

2003/4

2004/4

2005/4

2006/4

SDSS group, proposed giving users their own serverside databases (called MyDB/CasJobs) where they could do anything yet still link to the main database as well. Jim embraced the idea and was instrumental in turning it into generic dataspace. 13

Over the years, we also noticed another interesting user pattern. Even though the MyDB interface gave users who wanted to run long jobs a way around our five-minute timeouts for anonymous queries, many astronomers and non-astronomers alike were writing Python and Perl crawlers where a simple query template was repeatedly submitted with a different set of parameters, occasionally leading to problems.

In one case someone was submitting a query every 10 seconds that was less than optimally written and so took more than 10 seconds to execute. As a result, the requests kept piling up, and the server became extremely overloaded. As we noted this odd behavior and identified and isolated the “guilty” query, Jim quickly modified the stored procedure that executed the user-written free-form SQL queries. He put in a statement conditional to the IP address of the user running the particular robot script, so, for that user alone, the query would not be executed but instead give the message: “Please contact Jim Gray at the following email address:…” The queries stopped immediately. We later learned they were coming from a CS graduate student in Tokyo who had the shock of his life from Jim’s email, which (for a student of CS) must have sounded like the voice of God. Jim followed up and sent the student an email that said: “It is OK to use the system and OK to send an email.”

We logged all traffic from day one and were amazed to see how it grew (see Figure 4) and how a New York Times article on a new SDSS result caused a huge spike in user traffic. It was gratifying to see that afterward the traffic continued to stay higher than before, indicating that many people, astronomers and non-astronomers alike, liked what they saw. Our analysis of SkyServer traffic found that most of the one million users were non-astronomers and that there is a power law with no obvious breaks in any of the

References:

http://www.usvo.org

http://www.usvo.org

Archives