Given the need for collaboration to help
sysadmins share their understanding
of systems, it is possible to imagine better tools for sharing system state. These
tools should take best advantage of different forms of communication to share
more completely what is going on with
both system and sysadmin alike.
We now turn to another example of
collaboration we observed among system administrators working on a much
more complex system exhibiting a problem that required incredible effort to understand.
the crit-sit
A critical situation, or crit-sit, is a practice that is invoked when an IT system’s
performance becomes unacceptable
and the IT provider must devote specific resources to solving the problem
as quickly as possible. Several sysadmins—experts on different components—are brought into a room and
told to work together until the problem
is fixed. Crit-sits occur more often than
sysadmins would like (one we interviewed estimated taking part in four
crit-sits per year), and they can last days,
weeks, or even months.
We observed one crit-sit for a day,
just after it had started, and followed its
progress over two months until its solution was found. This was exceptionally
long for a crit-sit. It involved an intermittent Web application failure resulting
from a subtle interaction of a Web application server and back-end database.
Other potential problems were found
and fixed along the way, but it took more
than 80 days for a dedicated team of experts to determine the true root cause.
At a micro level, being in the room
during the crit-sit was fascinating. Eight
to 10 people were present in the large
conference room, either sitting at the
two tables or walking around the room
talking; an additional four to six people
joined in via conference call and chat
room (including technical support rep-
resentatives for the various software
products involved). At first, it seemed
amazing to us that this many people
had been instructed to work together
in a single room until the problem was
solved. Indeed, one of the people in the
room complained via an instant mes-
sage to a colleague offsite:
“We’re doing lots of PD [problem de-
termination], but nothing that I couldn’t
have done from home.”
After watching the people at work,
however, we saw real value in having all
of them together in one place. The room
was alive with different conversations,
usually many at once diverging and re-
joining, and with different experts ex-
changing ideas or asking questions. Peo-
ple would use the whiteboard to diagram
theories, and could see and supplement
what others were writing. When some-
thing important occurred, the attention
of everybody in the room was instantly
focused. A group chat room was also
used as a historical record for system
status, error messages, and ideas. Chat
was also used for private conversations
within the room and beyond, and for ex-
changing technical information. At one
point we saw them build a monitoring
script collaboratively through talking,
looking at each other’s screens, and ex-
changing code snippets over IM both in-
side and outside the room.