When edited, an entry’s content
doesn’t use the Web’s relatively complex, error-prone HTML syntax but rather a simplified text annotation scheme
called wiki markup, or wikitext. Creating a link from one entry to another is
as simple as enclosing the other entry’s
identifying name in double square
brackets. Markup tags can also group
together related articles into categories
(such as “Nobel laureates in physics,”
“liberal democracies,” and “bowed instruments”). One use of a category tag
is to mark entries as stubs, indicating
to readers and future contributors that
a particular entry is incomplete and requires expansion. In the snapshot we
studied, about 20% of the entries were
marked as stubs. For a better idea of
Wikipedia’s process and technology,
access an entry in your own specialty
and contribute an improvement.
Existing research on Wikipedia
employs descriptive, analytic, and
empirical methodologies. A series of
measurements has been published
that identifies power laws in terms of
number of distinct authors per article,
articles edited per author, and ingoing,
outgoing, and broken links. 13 On the
analysis front, notable work has used
simulation models to demonstrate
preferential attachment, 3 visualization
techniques to identify cooperation and
conflict among authors, 12 social-activity
theories to understand participation, 2
and small-worlds network analysis to
locate genre-specific characteristics
in linking. 8 Finally, given the anarchic
nature of Wikipedia development, it is
not surprising that some studies have
also critically examined the quality of
Wikipedia’s articles. 7, 11 The work we describe here focuses on the dynamics of
Wikipedia growth, examining the relationship between existing and pending
articles, the addition of new articles as
a response to references to them, and
the building of a scale-free network of
articles and references.
methods
The complete content of the Wikipedia database is available online in the
form of compressed XML documents
containing separate revisions of every
entry, together with metadata (such
as the revision’s timestamp, contributor, and modification comment). We
processed the February 2006 complete
We hypothesize
that the addition
of new Wikipedia
articles is not a
purely random
process following
the whims of its
contributors but
that references
to nonexistent
articles trigger the
eventual creation
of a corresponding
article.
dump of the English-language Wikipedia, a 485GB XML document. (In June
2008, we looked to rerun the study with
more recent data, but complete dumps
were no longer available.) The text of
each entry was internally represented
through the wiki-specific annotation
format; we used regular expressions
and explicit state transitions in a flex-generated analyzer for parsing both
the XML document structure and the
annotated text. From the database’s
entries we skipped all entries residing in alternative namespaces (such
as “talk” pages containing discussions
about specific articles, user pages, and
category pages). In total, we processed
28. 2 million revisions on 1. 9 million
pages.
For each Wikipedia entry we maintained a record containing the contributor identifiers and timestamps for the
entry’s definition and for its first reference, the number of efferent (outgoing)
article references (unique references
to other Wikipedia articles in the current version of the entry), the number
of unique contributors, the number of
revisions, a vector containing the number of the entry’s afferent (incoming)
references from other Wikipedia articles for each month, and a corresponding vector of Boolean values identifying
the months during which the entry was
marked as a stub. (The source code for
the tools we used and the raw results
we obtained are at www.dmst.aueb.gr/
dds/sw/wikipedia.)
Growth and unresolved References
We were motivated to do this research
when one of us (Spinellis), in the
course of writing a new Wikipedia entry, observed that the article ended up
containing numerous links to other
nonexistent articles. This observation
led us to the “inflationary hypothesis”
of Wikipedia growth, that is, that the
number of links to nonexistent articles increases at a rate greater than
the rate new articles are entered into
Wikipedia; therefore Wikipedia utility decreases over time as its coverage
deteriorates by having more and more
references to concepts that lack a corresponding article. An alternative—the
“deflationary hypothesis”—involves
links to nonexistent articles increasing
at a rate less than the rate of the addition of new articles. Under this hypoth-