handle more than 15,000 messages per
second, and several such services are
typically used in parallel. This ensures
that communication among the agents
is highly reliable, even at very high mes-sage-passing rates.
The set of agents is also used to create a global path or tree, as it knows the
state and performance of each local
and wide area network link, and the
state of the cross connections in each
switch. The routing algorithm provides
global optimization by considering the
“cost” of each link or cross-connect.
This makes the optimization algorithm
capable of being adapted to handle
various policies on priorities and prer-eservation schemes. The time to determine and construct an optical path (or
a multicast tree) end to end is typically
less than one second, independent of
the number of links along the path and
the overall length of the path. If network errors are detected, an alternative
path is set up rapidly enough to avoid
a TCP timeout, so that data transport
continues uninterrupted.
The most laborious part of developing such global services that try to
control the connectivity in the WAN
is the handling of communication errors. Parts of our environment are in
hybrid networks—some in research
or dedicated networks only and some
reachable from both academic and
commercial networks. Most of the
time everything works as expected and
problems do not occur very frequently.
When they do occur, however, it is important to understand what’s happening before acting upon it. In particular,
we would like to discuss two possible
cases of asymmetry in the system.
When this happens only at the routing
level, both sides involved in communication can reach each other, but by using different routes—this impacts the
throughput and reliability of the communication, is not hard to detect, and
is usually easy to recover from.
Another more serious problem occurs when different parts of the distributed framework involved in decisions
have different views of the system. We
had a case where some services in Europe could not reach the services in
the U.S., while at the same time, some
of them could see all the others. When
you have a partial but consistent view
of the system, you can act locally, but
in this case we reached the conclusion
that the best approach was to stay on
the safe side and not make any decisions. Such problems do not occur frequently in our environment, but it is
really difficult to detect them and avoid
making wrong decisions for the types
of systems we described.
conclusion
During the past seven years we have
been developing a monitoring platform that provides the functionality to
acquire, process, analyze, and create
hierarchical structures for information on the fly in a large distributed
environment. The system is based on
principles that allow for scalability and
reliability together with easing communication among the distributed entities. This approach to collecting any
type of monitoring information in such
a flexible distributed framework can be
used in further developments to help
operate and efficiently use distributed
computing facilities.
It is fair to say that at the beginning of this project we underestimated
some of the potential problems in developing a large distributed system in
WAN, and indeed the “eight fallacies
of distributed computing” are very important lessons. 7
The distributed architecture we
used, without single points of failure,
proved to offer a reliable distributed
service system. In round-the-clock
operation over the past five years we
never had a breakdown of the entire
system. Replicated major services in
several academic centers successfully
handled major network breakdowns
and outages.
As of this writing, more than 350
MonALISA services are running around
the clock throughout the world. These
services monitor more than 20,000
compute servers, hundreds of WAN
links, and tens of thousands of concurrent jobs. Over 1. 5 million parameters
are monitored in near real time with an
aggregate update rate of approximately
25,000 parameters per second. Global
MonALISA repositories are used by
many communities to aggregate information from many sites, properly
organize them for the users, and keep
long-term histories. During the past
year, the repository system served more
than 8 million user requests.
Related articles
on queue.acm.org
Monitoring, at Your Service
Bill Hoffman
http://queue.acm.org/detail.cfm?id=1113335
Modern Performance Monitoring
Mark Purdy
http://queue.acm.org/detail.cfm?id=1117404
Web Services and IT Management
Pankaj Kumar
http://queue.acm.org/detail.cfm?id=1080876
References
1. ALICE Collaboration; http://aliceinfo.cern.ch/
Collaboration.
2. Caltech High Energy Physics. High Energy Physicists
set new Record for network Data Transfer.
supercomputing 2007 (Reno); http://media.caltech.
edu/press_releases/13073.
3. Caltech High Energy Physics. High Energy Physics
Team sets new Data-Transfer World Records.
supercomputing 2008 (Austin); http://media.caltech.
edu/press_releases/13216.
4. CERn; http://www.cern.ch.
5. Default TCP implementation in Linux 2. 6 kernels;
http://netsrv.csc.ncsu.edu/twiki/bin/view/Main/BIC.
6. EVo Collaboration network; http://evo.caltech.edu.
7. Fallacies of distributed computing; http://
en.wikipedia.org/wiki/Fallacies_of_Distributed_
Computing.
8. HPC Wire. Physicists set record for network data
transfer. supercomputing 2006; http://www.hpcwire.
com/topic/networks/ 17889729.html.
9. java; http://java.sun.com/.
10. jini: http://www.jini.org/.
11. MonALIsA; http://monalisa.caltech.edu.
12. MonALIsA Repository for ALICE; http://
pcalimonitor.cern.ch.
13. Voicu, R., Legrand, I., newman, H., Dobre, C., and
Tapus, n. A distributed agent system for dynamic
optical path provisioning. In Proceedings of
Intelligent Systems and Agents (ISA), part of the
IADIs Multi Conference on Computer science and
Information systems (Lisbon, 2007).
14. Ws notification; http://www.oasis-open.org/
committees/ tc_home.php?wg_abbrev=wsn.
15. XDR; http://en.wikipedia.org/wiki/External_Data_
Representation.
Iosif Legrand is a senior research engineer at Caltech
and the technical lead of the MonALIsA project. He
has worked for more than 16 years in high-performance
computing, algorithms, modeling, and simulation,
control and optimization for distributed systems.
Ramiro Voicu is a research engineer at Caltech working
for UsLHCnet at CERn, where he was a Marie Curie
fellow. His research interests include global optimization
in distributed systems and high-performance data
transfers.
Catalin Cirstoiu is a software engineer in the finance
industry. He works on parallel and distributed systems,
focusing on reliability, optimizations, and high-performance issues.
Costin Grigoras is a software engineer in ALICE at
CERn, where he is a fellow. His research interests
include distributed systems monitoring and automated
decision making.
Latchezar Betev is working in the offline team of the
ALICE collaboration at CERn and is responsible for the
operation of the Grid infrastructure of the experiment.
His main interests include large-scale distributed
computing, monitoring, and control of remote systems.
Alexandru Costan is a Ph.D. student and teaching
assistant at the computer science department of
the University Politehnica of Bucharest. His research
interests include grid computing, data storage and
modeling, and P2P systems.