lenges, with disambiguation being required at the artist, band, track, and
album levels.
Determining the entity being referred to in a particular text is akin to
a classification problem, whereby content (“comment” in our case) must be
assigned to a specific bucket, or category (artist, band, and/or track). Ellen
Riloff13 highlighted domain-cognizant
techniques for text classification; reflecting the need to focus on local linguistic context for classification and
retrieval.
In terms of engineering, the world
of mashups mirrors the music data requirements of Sound Index—a robust,
reliable, repeatable means of gathering data from multiple, diverse online sources. ScrAPIs (Screen-scraper
+ API) were proposed by John Musser
in 2006 as a means of mitigating the
problem of unreliable or unavailable
APIs from multiple content providers, 11 though they, too, suffer from the
issues facing traditional screen-scrapers (such as breaking down when site
changes are made).
Pilot
The BBC ran the Sound Index pilot
from March to August 2008. Its measures for success included feedback
from its editorial team, Web-use statistics, and general feedback from the
online community. Despite a complete lack of marketing and promotion budget and effort, Sound Index
went from a standing start as public
beta in April 2008 to attract 43,469 visits from 37,900 unique users in June
2008 when it attracted 140,383 page
views at an average of 3. 67 per user,
each spending an average of three
minutes and 40 seconds on the site, or
53 seconds per page. In August 2008,
it attracted more than 772,000 Web-page references.
The Sound Index team monitored
the online feedback by setting up
Google Alerts on all possible permutations of the project name, manually
evaluating each link. There was a lot
of positive comment from the Web
and from the traditional business and
technology press. It was named “Web
2.0 technology of the week” by the U.K.
Observer ( http://www.guardian.co.uk/
music) for several consecutive weeks
(during April to August 2008), as well
as “the hottest thing in music” (in
March 2008) by the U.K.’s Guardian
Music Monthly ( http://www.guardian.
co.uk/music). It also generated much
debate in European music circles
about what constitutes music popularity and what the results mean. The
pilot closed August 2008, with the BBC
planning for its future.
conclusion
Called the “first definitive music chart
for the Internet age,” 14 Sound Index is
a novel demonstration of research into
processing, analyzing, collating, ranking, and presenting large quantities
of unstructured and structured multimodal information in response to a
change in the behavior of key demographic groups and a pressing industry need to innovate or risk being irrelevant. It is a model for demonstrating
a new approach to service and product delivery, integrating (in real time)
multiple, relevant online information
with one’s own data to drive new and
significant value for, reinvigorate connection to, and strengthen brand affinity to one’s customer base.
Here, we’ve described the system’s
technical underpinnings, highlighted
some of the technical challenges already addressed, and showcased the
engineering and research themes that
require further investigation. The underlying concepts and processes are
also applicable to myriad other fields
that depend on the capture of Internet
buzz. We hope it inspires future software products and research projects
to harness the wisdom of the crowds.
acknowledgments
We would like to thank the BBC, specifically Geoff Goodwin, Head of BBC
Switch, for its vision, support, and
encouragement, as well as Alfredo
Alba (IBM Almaden Research Center),
Jan Pieper (IBM Almaden Research
Center), Anna Liu (IBM Almaden Research Center), Bill J. Scott (formerly
IBM Global Business Services), Aidan
Toase (IBM Global Business Services),
and IBM’s partners at NovaRising, who
helped make the Sound Index system a
reality.
References
1. Adali, s., Hill, B., and Magdon-Ismail, M. The impact
of ranker quality on rank-aggregation algorithms:
Information vs. robustness. In Proceedings of the
22nd International Conference on Data Engineering
Workshops (Atlanta, GA, Apr. 3–7). IEEE Computer
society, Washington D. C., 2006, 37.
2. Alba, A., Bhagwan, V., and Grandison, T. Accessing the
deep Web: When good ideas go bad. In Proceedings
of the ACM SIGPLAN International Conference on
Object-Oriented Programming, Systems, Languages
and Applications (OOPSLA) (nashville, Tn, oct.
25–29). ACM Press, new york, 2008, 815–818.
3. Alba, A., Bhagwan, V., Grace, j., Gruhl, D., Haas, K.,
nagarajan, M., Pieper, j., Robson, C., and sahoo, n.
Applications of voting theory to information mashups.
In Proceedings of the Second IEEE International
Conference on Semantic Computing. (santa Clara, CA,
Aug. 4–7). IEEE Press, 2008, 10–17.
4. de Borda, j.-C. Memoire sur les elections au scrutin.
Histoire de l’Académie Royale des Sciences 1781;
http://asklepios.chez.com/XIX/borda.htm.
5. Diaconis, P. and Graham, R spearman’s footrule as a
measure of disarray. Journal of the Royal Statistics
Society, Series B (Methodological) 39, 2 (1977),
262–268.
6. Ferrucci, D. and Lally, A. UIMA: An architectural
approach to unstructured information processing
in the corporate research environment. Journal
of Natural Language Engineering 10, 3–4 (2004),
327–348.
7. Han, j. and Kambert, M. Data Mining: Concepts and
Techniques. Morgan Kaufmann Publishers, Inc., san
Francisco, 2001.
8. Hassell, j., Aleman-meza, B., and Arpinar, I.B.
ontology-driven automatic entity disambiguation in
unstructured text. In Proceedings of the International
Semantic Web Conference LNCS 4273 (Athens, GA,
nov. 5–9). springer, 2006, 44–57.
9. Lloyd, L., Bhagwan, V., Gruhl, D., and Tomkins, A.
Disambiguation of References to Individuals. IBM
Research Report Rj10364 (A0510-011). san jose,
CA, oct. 28, 2005; http://domino.watson.ibm.com/
library/cyberdig.nsf/papers/D8265335C0AD4CD5852
570AB00514720/$File/rj10364.pdf.
10. Mayfield, G. Billboard Hot 100 to include digital
streams. (july 31, 2007); http://www.billboard.
com/bbcom/news/article_display.jsp?vnu_content_
id=1003619084.
11. Musser, j. scrAPIs. (Mar. 21, 2006). http://blog.
programmableweb.com/2006/03/21/scrapis/.
12. Quinn, M. and Chang, A. More teens dissing discs in
favor of online tunes. Los Angeles Times (Feb. 27,
2008); http://www.latimes.com/news/nationworld/
nation/la-fi-music-270208, 1, 2028285.story.
13. Riloff, E. Little words can make a big difference
for text classification. In Proceedings of the 18th
Annual ACM SIGIR Conference on Research and
Development in Information Retrieval (seattle, WA,
july 9–13). ACM Press, ny, 1995, 130–136.
14. salmon, C. Click to download. U. K. Guardian (Apr.
18, 2008); http://arts.guardian.co.uk/filmandmusic/
story/0,,2274132, 00.html.
15. styvén, M. Exploring the Online Music Market:
Consumer Characteristics and Value Perceptions. Ph.D.
Thesis. Department of Business Administration and
social sciences, Luleå University of Technology, Luleå,
sweden, 2007; http://epubl.ltu.se/14021544/2007/71/
LTU-DT-0771-sE.pdf.
16. Walsh, G., Mitchell, V.-W., Frenzel, T., and Wiedmann,
K.-P. Internet-induced changes in consumer music
procurement behavior: A German perspective. Journal
of Marketing Intelligence & Planning 21, 5 (2003),
305–317.
17. Zhu, H., siegel, M.D., and Madnick, s.E. Information
aggregation: A value-added e-service. In Proceedings
of the International Conference on Technology, Policy,
and Innovation: Critical Infrastructures ( The Hague,
The netherlands, june 26–29, 2001).
Varun Bhagwan ( vbhagwan@us.ibm.com) is an advisory
software engineer in the Computer science Department of
IBM Almaden Research Center, san jose, CA.
Tyrone Grandison ( tyroneg@us.ibm.com) is a manager
in the Computer science Department of IBM Almaden
Research Center, san jose, CA.
Daniel Gruhl ( dgruhl@almaden.ibm.com) is a senior
software engineer in the Computer science Department of
IBM Almaden Research Center, san jose, CA.