We used the frequent-itemset-mining
algorithm Eclat4 to extract the frequent
n-grams and discard an n-gram if it has
not been used for a certain number of
days. We also discarded all n-grams that
are proper subsets of another n-gram.
We performed the analysis using different thresholds to ensure independence between parameter choice and
results. From most- to least-restrictive
threshold choices, we obtained 2,731 to
5,585 protomemes on Reddit and 817
to 2,538 protomemes on Hacker News;
the preprocessed data we used is available for result replication.b
Results
We now first show how popularity
spikes result in reduced future popularity for a meme, then introduce the
concept of “canonicity” and how it can
shed light on this phenomenon.
Popularity curse. Common sense
tells us that popular ideas are likely to
be imitated; a protomeme used in a
very popular post today will be used in
many posts tomorrow. Such intuition
about the demand-supply relation on
the web is corroborated by several studies, including Ciampaglia et al. 9 However, dissimilarity-driven success theory would predict that flooding a system
with imitations of a protomeme will
cause the imitations to be less popular.
At first glance, such a prediction might
seem to find support in two observations: the average score of the posts
containing a protomeme is less than
expected the day after it experiences a
popularity spike (see Figure 1a and Figure 1c), and the number of posts containing that protomeme increases (see
Figure 1b and Figure 1d).
These observations support our
theory about viral connections but do
not prove it. First, the total score
awarded and the average score per
post are not constant over time (see
the online appendix, dl.acm.org/cita-
tion.cfm?doid=3158227&picked=for
mats). The lower score might be just
a relative change; if, for example,
there are fewer upvotes awarded on
that particular day, a lower absolute
number could still represent an increase in upvote share for the day.
Second, each protomeme is characterized by its own expected populari-
b http://www.michelecoscia.com/?page_id=870
highly visible forever. The most popular
(highest upvote/downvote ratio) posts appear on Reddit’s “front page,” giving
them a further boost in visibility. By
default, the front page hosts 25 posts.
Each entry in the dataset we studied
consisted of a post, its title, and its
number of upvotes/downvotes that
were combined in a post score by Reddit’s sorting algorithm. Note our research can observe only the final score
of a post, not its full upvote timeline.
This might introduce bias when looking to establish whether or not the post
hit the front page. We assume the final
post score is highly correlated with the
post score on its first day of life. We
base this assumption on the fact that
the vast majority of upvotes come within 24 hours following post submission.
Note the terms “score” and “
popularity” are not interchangeable, as they
refer to related but different concepts.
Score is the one-off measurement of a
single instance of a protomeme in a
day, and popularity is the overall success of all instances of all memes over a
longer period of time.
All 22,329,506 posts added to Reddit
from April 5, 2012 to April 26, 2013
were part of the dataset. To cross-test
our results, we also used a similar
dataset from Hacker News, which uses
the same dynamics as Reddit though
focuses on a more specialized technical audience and has a much smaller
user base. The Hacker News dataset included 1,194,436 posts from January 7,
2010 to May 29, 2014.
Here, we consider protomemes, or a
catchphrase with the potential for going
viral. Note there are more possible types
of memes (such as pictures and videos),
but given the nature of the data we limit
ourselves here to catchphrases. Catchphrases were also used as meme proxy
in Memetracker18 and Nifty. 22 We extracted them using information taken
exclusively from the post title.
We apply our definition by borrowing the bag-of-words methodology
from the text-mining literature, meaning a protomeme is seen as a set of at
least two “tokens” (also called an “
n-gram” with n ≥ 2). A token is a word that
is stemmed whereby “stop words” are
filtered out and are not tokens. To be
classified as a protomeme, an n-gram
must have been used frequently and
constantly over the observation period.
The more a meme
is imitated, the
less the original
meme (and all its
imitations) will be
successful in going
viral in the future.