always appear in posts containing the
protomeme itself, their πm,μ always
equals 1. Finally, a low canonicity
score is obtained when there are many
words in the post and they have low πm,μ.
If a post includes only one unusual
word, its canonicity score is still high,
because it is still composed mostly of
the protomeme itself.
How canonicity is distributed in
Reddit is reported in the online appendix. To test the connection between
canonicity and popularity for posts using protomemes appearing on the
front page the previous day, we create a
rank binary variable φM that records
whether or not the post was among the
5% best-scoring posts of the day. This is
the target variable of the following logistic regression
φMl = α +β Γ(Ml,m) + um + εm.
This is the φ Model. Given a post M
containing protomeme m, the φ Model
estimates its probability of experienc-
ing a popularity spike on day i, after m
hit the front page on day i − 1. Note the
set of posts we include in the model is
still dependent on the l parameter. For
different l values the set of posts in-
cluded is different, because, for in-
creasing front page size, more pro-
tomemes hit the front page, and more
posts on the day after will thus be con-
sidered in the model.
Figure 3 reports φ Model’s βs for increasing l. For Reddit (see Figure 3a),
β never takes values greater than −0.7,
suggesting a noticeable and notable
effect: high canonicity halves the
odds of being a high-scoring post; for
a deeper discussion see the online appendix. As we increase l, the canonicity effect gets weaker and weaker. This
is expected, as we are considering the
regression posts that might have not
hit the front page. All β values in Figure 3a are significant (p < 0.0001). We
thus expect a null result in Hacker
News, given the result of the MAX
Model covered earlier. Indeed, Figure
3b reports the effect of canonicity in
Hacker News is zero, as no p-value reported for any l is less than 0.01.
We also run two Poisson mixed
models with the same form of the φ
Model, with the only difference being
the dependent variable (in this case the
post score) and the data included in
them. In the Zero Model, we consider
only the posts for which φMl = 0, while
in the One Model we focus on the posts
for which φMl = 1. In practice, the φ
Model tells us the effect of canonicity
on the odds of experiencing two popularity spikes in a row, while the One
and Zero models reveal the score effect
of canonicity on the posts that did and
did not experience two popularity
spikes in a row.
In the One Model, β has a negative
sign (see Figure 4a); all βs are significant with p < 0.0001. If the φ Model told
us that canonicity lowers the odds of
experiencing two popularity spikes in a
row, the One Model would tell us that if
a post can nevertheless overcome those
odds, it is additionally penalized with a
worse score. In the Zero Model (see Figure 4b), β is positive and significant.
For the unsuccessful posts in the Zero
Model, canonicity has a positive effect.
For robustness, we also ran a negative
binomial model, resulting in similar
estimates as the Poisson model (see
the online appendix).
The discordance of β signs in the
Zero and One models can be interpreted
as a similarity between protomeme
We introduce the concept of canonicity of a post, measuring how much a
post containing a protomeme m differs
from the usual usage of m. A post is said
to be canonic if it uses m as expected,
without introducing elements not
strongly associated with m itself. Consider, for example, a post M as a bag of
words. Each word μ in the bag of words
co-occurs with protomeme m with a given probability πm,μ. If m appears in 100
posts, and in 30 of them the post title
also includes μ, then πm,μ = 0.3. The canonicity of M is calculated like this
Γ(M,m) = Σ∀/∈Μ πm,μ / |M|.
The formula means the canonicity
of a post M is the average probability of
its words to appear with the meme it
contains. Note some posts contain no
other word than the words of the protomeme m itself. For this reason, the
formula includes the m words in M.
Otherwise, such posts will have
Γ(M,m) = 0/0, which is unacceptable. Moreover, posts including only
a protomeme’s words must have
Γ(M,m) = 1 because they use m in its
purest form. Since the protomeme’s μs
Figure 3. Distribution of the φ Model’s β for varying l; thin lines represent the 95% confidence
intervals.
- 1. 6
- 1. 5
- 1. 4
- 1. 3
- 1. 2
- 1. 1
- 1
-0.9
-0.8
-0.7
-0.6
20 30 40 50 60 70 80 90 100
β
l
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
30 40 50 60 70 80 90 100
β
l
(a) (b)
Figure 4. Distribution of the One Model’s and Zero Model’s β for varying l; thin lines represent
the 95% confidence intervals.
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
20 30 40 50 60 70 80 90 100
β
l
0
0.05
0.1
0.15
0.2
0.25
20 30 40 50 60 70 80 90100
β
l
(a) (b)