the age of a submission in digg hours
at a given time t, we measure how many
diggs were received on the portal between t and the promotion time of the
story, divided by 5,478 diggs.
Similar hourly activity plots were
not possible for YouTube in 2008,
given that video view counts were provided by the API approximately only
once a day, in contrast to all the diggs
received by a Digg story. Moreover, we
were able to capture only a fraction of
the large amount of traffic the YouTube site handled by monitoring only
the selected videos in our sample.
Predicting the future
Here, we cover the process we used to
model and predict the future popularity of individual content and measure
the performance of the predictions:
First, we performed a logarithmic
transformation on the popularities of
submissions. The transformed variables exhibit strong correlations between early and later time periods; on
this scale, the naturally random fluctuations can be expressed as an additive
noise term. We call reference time tr
the time at which we intend to predict
the popularity of a submission whose
age with respect to its upload (
promotion) time is tr. By indicator time ti we
mean when in the life cycle of the submission we performed the prediction,
or how long we can observe submission history in order to extrapolate for
future popularity (ti < tr).
To help determine whether the
popularity of submissions early on is
a predictor of later popularity, see Figures 3 and 4, which show the popularity counts for submissions at the reference time tr = 30 days both for Digg
and YouTube vs. the popularity measured at the indicator times ti = 1 digg
hour and ti = 7 days for the two portals,
respectively. We measured the popularity of You Tube videos at the end of
the seventh day, so the view counts at
that time ranged from 101 to 104, similar to Digg in this measurement. We
logarithmically rescaled the horizontal
and vertical axes in the figures due to
the large variances present among the
popularity of different submissions,
which span three decades.
Observing the Digg data, we noted
the popularity of about 11% of the stories (lighter blue in Figure 3) grew much
While Digg
stories saturate
fairly quickly
(about a day)
to their respective
reference
popularities,
you Tube videos
keep attracting
views throughout
their lifetimes.
more slowly than the popularity of the
majority of submissions; by the end
of the first hour of their lifetimes, they
had received most of the diggs they will
ever receive. The difference in popularity growth of the two clusters is perceivable until approximately the seventh
digg hour, after which the separation
vanishes due to digg counts of stories
mostly saturating to their respective
maximum values, as in Figure 1.
A Bayesian network analysis of submission features (day of the week/hour
of the day of submission/promotion,
category of submission, number of
diggs in the upcoming phase) reveals
no obvious reason for the presence of
clustering; we assumed it arises when
the Digg promotion algorithm misjudged the expected future popularity
of stories, promoting stories from the
“upcoming” phase unlikely to sustain
user interest. Users lose interest much
sooner in them than in stories in the
upper cluster. We used k-means clustering, with k = 2 and cosine distance
measure to separate the two clusters,
as in Figure 3, and discarded the stories in the lower cluster.
Trends and randomness. Our in-depth analysis of the data found strong
linear correlations between early and
later times of the logarithmically transformed submission popularities, with
correlation coefficients between early
and later times exceeding 0.9. Such a
strong correlation suggests the more
popular submissions are at the beginning, the more popular they will also
be later on. The connection can be described by a linear model:
ln N (tr) = ln [r(ti, tr)N(ti)] + ξ(ti, tr)