and enables full deduplication (
massively reducing storage costs), integrity
checking, and tracking of reuse across
all software projects at the file level.
But it also poses novel challenges
when it comes to efficiently indexing
and querying its contents.
Sharing
The raw material that Software Heritage
collects must be properly organized
to ease its fruition. On top of the information captured by version-control
systems, we need metadata describing
the software and means to classify the
millions of harvested projects, written
in one of the thousands of known programming languages.e We need to extract and reconcile existing information
from many different sources, encoded
in one of the many different software
ontologies, and complete it using either
automatic tools or crowdsourcing.
We must also support the many use
cases that it enables. Programmers
may want to search for specific project
versions or code snippets to reuse, and
then browse them online or download
history-full source code bundles. Companies may want to access an API to
build applications that use the archive.
Researchers may want to access the
whole corpus to perform big data operations or train machine learning models.
We must carefully assess which
functionalities are generic enough to
be incorporated in the archive, and
which are so specific that they are best
implemented externally by third parties. And there are of course legal and
ethical issues to be dealt with when
redistributing parts—or all—of the
contents of the archive.
Current Status
Software Heritage is an active project
that has already assembled the largest
existing collection of software source
code. At the time of writing the Software
Heritage Archive contains more than
four billion unique source code files and
one billion individual commits, gath-
ered from more than 80 million pub-
licly available source code repositories
(including a full and up-to-date mirror
of GitHub) and packages (including a
full and up-to-date mirror of Debian).
Three copies are currently maintained,
e See http://hopl.info/
mission. While a full article detailing
our approach is available online, 2 we
focus here on the challenges raised by
the three main goals: collecting, pre-
serving, and sharing the source code
of all the software ever written.
Collection
There are various kinds of source code.
Some is current, actively developed,
and technically easy to make available;
some other is legacy source code that
must be painfully retrieved from offline
media. Some is open, and free for all to
read and reuse; some is closed behind
proprietary doors. Software Heritage’s
ambition is to collect it all.
For current, open source code, we
need an automated process to harvest all
software projects, with all the available
development history, from the many
places where development and distribution take place, like forges and package repositories. Yes, we really mean
harvesting everything available, with no
a priori filtering. Because the value of
an active software project will only be
known in the future, and because storing all present and future source code
can be done at a reasonable cost.
The technical challenge is to build
crawlers for each code-hosting platform, as there is no common protocol
available, and to develop adapters for
all version-control systems and package
formats. It is a significant undertaking,
but once a standard platform is available each of these crawlers and adapters
can be developed in parallel.
For legacy, open source code, we
need a crowdsourcing platform to
empower the volunteers that are willing to help recover their preferred
software artifacts. Guidelines must be
offered to help properly reconstruct
from the raw material the interesting
history that lies behind it, like in the
beautiful work that has been done for
the history of Unix. 5
Closed software contains precious
knowledge that is more difficult to re-
cover. For example, the Computer His-
tory Museumb and Living Computersc
have shown, in the case of the mythi-
cal Alto system,d that once the busi-
b See http://www.computerhistory.org/
c See http://www.livingcomputers.org/
d See http://xeroxalto.computerhistory.org and
http://www.livingcomputers.org/Discover/
News/ ContrAlto-A-Xerox-Alto-Emulator.aspx
ness need to keep software closed fades
away, a focused search (that requires a
costly and dedicated effort) can succeed
in recovering and liberating its source
code, growing our software commons.
Finally, by providing a means to
safely keep closed source software under embargo, much like what happens
already with software escrow, we may
succeed in collecting current and future
closed source, and be ready to liberate it
when time comes, dispensing altogether with costly technical recovery efforts.
Preservation
In the extensive literature on digital
preservation, it is now well established
that long-term preservation requires
full access to the source code of the
tools used for the task. Software Heritage uses and develops exclusively free
and open source software tools for
building its archive.
Also, replication and diversification are best practices to mitigate the
threats—from technical failures to
legal and economic decisions—that
endanger any long-term preservation
initiative. Hence, we want to foster a
geographically distributed network of
mirrors, implemented using a variety
of storage technologies, in different administrative domains, controlled by a
plurality of institutions, and located in
different jurisdictions.
Finally, preserving software source
code also requires preserving the development history of source code,
which carries precious insights into
the structure of programs and also
tracks inter-project relationships.
Software Heritage’s unique approach
is to store all available source code
and its revisions into a single Merkle
DAG (Directed Acyclic Graph), shared
among all software projects. This
data structure facilitates distribution
We are at a unique
turning point in
the history
of computer science
and technology.