created in the month of their first reference. Interestingly, the reference and
subsequent definition of an article in
Wikipedia appear to be a collaborative
phenomenon; from the 1. 7 million entries for which both the contributor entering the first reference and the contributor entering the first definition are
known, that contributor is the same for
only 47,000, or 3%, of entries.
Similarly, the mean number of first
references to entries (see Figure 2b)
rises exponentially until the referenced entry becomes an article. (For
calculating the mean we offset each
entry’s time of definition and time
points in which it was referenced to
center them at time 0.) The point in
time when the referenced entry becomes an article marks an inflection
point; from then on the number of references to a defined article rises only
linearly (on average).
Building a scale-free network
We established that entries are added
to Wikipedia as a response to references to them, but what process adds
references and entries? Several models have been proposed to explain the
Figure 3a: Expected and actual number of references added each
month to an entry; quantile-quantile plot of the expected and
actual number of references added each month to each article.
1000-quantiles of actual number of
references added to article in a month
1000
100
10
1
1
10 100
1000-quantiles of expected number of
references added to article in a month
1000
Figure 3b: Expected and actual number of references added each
month to an entry; frequency distributions of the expected and actual
number of references added each month to each article.
expected
actual
1,000,000
100000
Number of articles
10000
1000
100
10
1
1
10 100 1000 10000
References added to article in a month
100000
appearance of scale-free networks like
the one formed by Wikipedia’s entries
and references. The models can be divided into two groups: 9 treating power
laws as the result of an optimization
process; and treating power laws as
the result of a growth model, the most
popular of which is Barabási’s preferential attachment model. 1 In-vitro
model simulations verify that the proposed growth models do indeed lead
to scale-free graphs. Having the complete record of Wikipedia history allows us to examine in-vivo whether a
particular model is indeed being followed.
Barabási’s model of the formation
of scale-free networks starts with a
small number (m0) of vertices. Every
subsequent time step involves the addition of a new vertex, with m ≤ m0 edges
linking it to m different vertices already
in the system. The probability P that a
new vertex will be connected to vertex
i is P(ki) = ki/∑jkj, where ki is the vertex’s
connectivity at that step.
The situation in Wikipedia is more
complex, as the number of vertices
and edges added in a time step is not
constant and new edges are added
between existing vertices as well. We
therefore consider a model where at
each time step t a month, a variable
number of entries and rt references
are added. The references are distributed among all entries following a
probability P(ki,t) = ki,t/∑j,tkj,t , with the
sums and the connectivities calculated at the start of t. The expected number of references added to entry i at
month t is then {ki, t} = rt P (ki, t). We find
a close match between the expected
and the actual numbers in our data.
Figure 3a is a quantile-quantile plot
of the expected and the actual numbers at the 1,000-quantiles; Figure 3b
outlines the frequency distributions
of the number of articles (expected vs.
actual) gaining a number of references in a month. The two data sets have
a Pearson’s product-moment correlation of 0.97, with the 95% confidence
interval being (0.9667, 0.9740). If nax
is the number of articles that gained x
> 30 (to focus on the tails) references
in a month and na'x is the expected
number of such articles, we have nax
1.11na'x (p-value < 0.001).
It has never been possible to examine the emergence of scaling in other