only when its visibility is minuscule
compared to its other stages, and the
highest number of diggs accrues when
the social-network effect is nonexistent. We therefore do not consider this
feature (otherwise deemed important)
a main contributor from a prediction
point of view in terms of total popularity count.
conclusion
In this article we have presented our
method for predicting the long-term
popularity of online content based on
early measurements of user access.
Using two very popular content-shar-ing portals—Digg and YouTube—we
showed that by modeling the accrual
of votes on and views of content offered by these services we are able to
predict the dynamics of individual
submissions from initial data. In Digg,
measuring access to given stories during the first two hours after posting allowed us to forecast their popularity 30
days ahead with a remarkable relative
error of 10%, while downloads of YouTube videos had to be followed for 10
days to achieve the same relative error.
The differing time scales of the predictions are due to differences in how content is consumed on the two portals;
Digg stories quickly become outdated,
while YouTube videos are still found
long after they are submitted to the
portal. Predictions are therefore more
accurate for submissions for which attention fades quickly, whereas predictions for content with a longer life cycle
are prone to larger statistical error.
We performed experiments showing
that once content is exposed to a wide
audience, the social network provided
by the service does not affect which users will tend to look at the content, and
social networks are thus not effective
promoting downloads on a large scale.
However, they are important in the
stages when content exposure is constrained to a small number of users.
On a technical level, a strong linear
correlation exists between the logarithmically transformed popularity of content at early and later times, with the residual noise on this transformed scale
being normally distributed. Based on
our understanding of this correlation,
we presented a model to be used to predict future popularity, comparing its
performance to the data we collected.
in the presence
of a large user base,
predictions can
be based on
observed early
time series, while
semantic analysis
of content is more
useful when no
early click-through
information is
available.
We thus based our predictions of future popularity only on values measurable at the time we did the study and did
not consider the semantics of popularity and why some submissions become
more popular than others; however, this
semantics of popularity may be used to
predict click-through rates in the absence of early-access data. 8 In the presence of a large user base, predictions
can be based on observed early time
series, while semantic analysis of content is more useful when no early click-through information is available.
However, we could not explore several related areas here. For example,
it would be interesting to extend the
analysis by focusing on different sections of the portals (such as how the
YouTube “news & politics” section
differs from the YouTube “
entertainment” section). We would also like to
learn whether it is possible to forecast
a Digg submission’s popularity when
the diggs come from only a small number of users whose voting history is
known, as it is for stories in Digg’s “
upcoming” section.
References
1. alexa Web Information service; http://www.alexa.com
2. Cha, M., Kwak, H., rodriguez, P., ahn, y.-y., and Moon,
s. I tube, you tube, everybody tubes: analyzing the
world’s largest user-generated content video system.
In Proceedings of the Seventh ACM SIGCOMM
Conference on Internet Measurement (san diego, oct.
24–26). aCM Press, new york, 2007, 1–14.
3. Cheng, x., dale, C., and liu, J. statistics and social
network of youtube videos. In Proceedings of the
16th International Workshop on Quality of Service
(enschede, the netherlands, June 2–4, 2008),
229–238.
4. digg aPI; http://digg.com/api/docs/overview
5. Feller. W. An Introduction to Probability Theory and
Its Applications, Vol. 1. John Wiley & sons, Inc., new
york, 1968.
6. Gill, P., arlitt, M., li, Z., and Mahanti, a. youtube traffic
characterization: a view from the edge. In Proceedings
of the Seventh ACM SIGCOMM Conference on
Internet Measurement (san diego, oct. 24–26). aCM
Press, new york, 2007, 15–28.
7. lerman, K. social information processing in news
aggregation. IEEE Internet Computing (Special Issue
on Social Search) 11, 6 (nov. 2007), 16–28.
8. richardson, M., dominowska, e., and ragno, r.
Predicting clicks: estimating the click-through rate
for new ads. In Proceedings of the 16th International
Conference on the World Wide Web (banff, alberta,
Canada, May 8–12). aCM Press, new york, 2007,
521–530.
9. Wu, F. and Huberman, b.a. novelty and collective
attention. Proceedings of the National Academy of
Sciences 104, 45 (nov. 2007).
10. youtube aPI; http://code.google.com/apis/youtube/
overview.html
Gabor Szabo ( gabors@hp.com) is a research scientist in
the social Computing lab at Hewlett-Packard labs, Palo
alto, Ca.
Bernardo A. huberman ( bernardo.huberman@hp.com) is
an HP senior Fellow and director of the social Computing
lab at Hewlett-Packard labs, Palo alto, Ca.