puting center that runs the management software for the local resources.
The same node is also running a MonALISA service that collects monitoring
information from all computing nodes,
storage systems, data-transfer applications, and software running in the local
cluster. This yields more than 1. 1 million parameters published in MonALISA, each with an update frequency of
one minute. Moreover, ALICE-specific
filters aggregate the raw parameters to
produce system-overview parameters
in realtime. These higher-level values
are usually collected, stored, and displayed in the central MonALISA Repository for ALICE12 (see Figure 3) and are
the fuel for taking automatic actions.
In this particular case we have managed to reduce the data volume to only
about 35,000 parameters by aggregating, for example, the entire CPU usage
by all the jobs in the local cluster in a
single parameter, by summing up network traffic on all machines, and so on.
These overviews are usually enough to
identify problems and take global actions in the system, and they can be
stored in a central database for long-term archival and analysis. The details
are available on demand on the origi-
nating site and can be consulted with
the GUI client. This approach proved
to be very useful for debugging purposes—for example, to track the behavior
of a particular application or host.
The ALICE computing model
matched closely with MonALISA’s architecture so the pieces fit naturally
together, but it also provided us big opportunity to fulfill the project’s initial
goal: using the monitoring data to improve the observed system. Indeed, the
actions framework implemented within MonALISA represents the first step
toward the automation of decisions
that can be made based on the monitoring information. It is worth noting that
actions can be used in two key points:
locally, close to the data source (in the
MonALISA service) where simple actions can be taken; and globally, in a
MonALISA client where the logic for
triggering the action can be more sophisticated, as it can depend on several
flows of data. Hence, the central client is equipped with several decision-making agents that help in operating
this complex system: restarting remote
services when they don’t pass the functional tests, sending e-mail alerts or instant messages when automatic restart
procedures do not fix the problems,
coordinating network-bandwidth tests
between pairs of remote sites, managing the DNS-based load balancing of
the central machines, and automatically executing standard applications
when CPU resources are idle.
The actions framework has become
a key component of the ALICE grid.
Apart from monitoring the state of the
various grid components and alerting
the appropriate people to any problems
that occur during the operation, this
framework is also used to automate the
processes. One such automation takes
care of generating Monte Carlo data
that simulates the experiment’s behavior or analyzes the data. In normal cases
jobs run for 10 to 12 hours and generate or analyze files on the order of 10GB
each. ALICE jobs, however, can fail for
a number of reasons: among the most
frequent are network issues and local
machine, storage, or central services
problems. By continuously monitoring
the central task queue for production
jobs, the MonALISA repository takes action when the number of waiting jobs
goes below the preset threshold ( 4,000
jobs at the moment). First, it looks to
see whether any failed jobs can be re-
figure 4. eVo overlay network topology—a Dynamic minimum Spanning Tree.
100
100
urL_Gt
ueS_Sv
WINd_It
evo2_uS
deSY01_de
100
100
100
100
100 100
CItbeta_uS
bACKuP_NZ
100
PuCP_Pe
100
100
100
100
100
100
reuNA_CL
trIuMF_CA
deSY02_de
97. 5
99. 7
100
100 100
100
100
100 100
uPJS01_SK
redIrIS_eS
uPJS_SK
INFN_It
100
evoLA_uS
100 100
100
100 100
ZILINA_SK
100
KIt_de
100
KIStI01_Kr
100 evoChI_uS
Wut_PL
100
100
100
orSAY_Fr
100
100
100
uSP01_br
100
KIStI02_Kr
100
100
100
100
100
100
100
100
100
100
100
StubA_SK
95
100
evo07_uS
GrAZ_At
100 evo08_uS
CerNext_Ch
100
100
100
CorNeLL_uS
AArNet_Au
100
100
uKerNA1_uK
uKerNA_uK
100
100
100
FuNet_FI
beStGrId_NZ
100
100
100
100
100
evoeu_Ch
100
100
GrNet_Gr
evo03_Ch
renaLY_Fr
100
100