Eric, maybe you could say a little bit about why it works,
and how, for example, integration control and quality
control are done.
EA First of all—and I may be committing heresy here—but
what we see are the successful projects; we don’t see all the
ones that have failed, so it looks like open source development is this great solution to all these problems. In fact,
it can be done very badly. There’s an old maxim that you
can write spaghetti code in any language, and I think
that’s definitely true with development methodologies
as well. The ones that do it well use a couple of different
models. One is the benevolent dictator model, in which
one person ultimately does all of the integration and so
forth. This is pretty much how Linux works at this point,
where Linus [Torvalds] is the dictator and he has lieutenants who help out, but it’s a very tree-structured approach.
BC Surprisingly so, I might add. Linux is more hierarchical than any other project I have encountered in proprietary software products.
EA Yes, that is kind of surprising. In contrast, FreeBSD has
a much more spread-out network. There is no one who
is absolutely in control. A core team has to make some
of the big decisions, but that team consists of around 30
people. That seems to work really well, and they’ve done
a lot of work to structure that so that the communication paths work. One of the other reasons that it works
is that it’s a big project and there are a lot of folks who
are working on just their little pieces of the system, so
the integration doesn’t have to be done continuously. It’s
not like you change the kernel and everyone who is on
the system crashes. That was kind of the way we did it at
Berkeley when we were developing the VM/Unix stuff.
We ate our own dog food, and that meant there were
crashes a lot of the time, but it also meant that we fixed
them very quickly.
BC That’s a very important general principle in terms of
using your own software. We call that “avoiding the quality death spiral.” Solaris went through a very interesting
transition. Prior to Solaris 2. 5, there was much more of a,
for lack of a better word, waterfall model in terms of the
way new releases were dispersed to people. As a result,
people would not run the latest bits on their desktop
or on the server; they would develop their own little
bits and integrate them into a whole that they never
saw. Solaris was in the quality death spiral because once
people refused to use the latest stuff because it was known
to be broken, then people used the latest stuff less and
less and it got to be more and more broken.
To break the quality death spiral, you’ve got to force
people to use the latest stuff. I think it’s much more
important when you’re in a distributed environment
where you don’t necessarily have the kind of immediate
peer pressure to do that.
EA Probably, although I don’t like the concept of forcing
people to use things because they won’t. You can’t really
force them to do stuff. You’ve got to get them to want to
use it. I think if you’re providing good enough quality,
then most people will use it, particularly if they feel like
they’re part of the development effort.
The traditional models in which you hold the product
away from the users until you’ve done all the debugging,
and then throw it over the wall to them, produce a lower
quality than having the users always running the latest
test version so they’re giving constant feedback.
SB I was involved in the Solaris development early on
and one of the things we tried to do there was bring the
test capability to the desktop of the engineers so that they
had the tools they needed. I would be interested in how
that has progressed, because I’ve been out of the corporate engineering business for a while. Can the engineers
effectively test their code in the environment that it’s
being shipped in?
EA One of the things I learned about sendmail a long
time ago is that it’s really hard to write a simulator for
the Internet that’s not bigger than the Internet itself. To
a certain extent you can’t, but we certainly do have test
labs. We have special programs designed to create load
artificially. We have basic sources and sinks, and we will
go in and intentionally introduce errors. There’s actually
some code in sendmail to force timeouts and things, to
make sure that that kind of thing is working.
BC In our group we’re really focused on having easy-to-run test suites. I feel the mistake that we made in
DTrace development was not starting the test suite soon
enough. On this [new] project, we developed the test
suite moments after the first line of code was written, so
we have a complete test suite that we try to run. There
are problems when you do that. The test suite right now
takes a long time to run. It takes several hours now where
it used to take seconds and then minutes, so engineers
are running only those portions of the test suite that they
know affect their code. In general, that’s the right decision to make, but you do end up with tradeoffs when you
have a tightly integrated test suite.
EA You can automate running them so that everything
gets run every night.
BC You can do that, but someone has to watch the
results. We’ve got one engineer here who is very diligent
about watching the results, and he got frustrated because
he was the only one doing that. The test suite would be