As a result, the all-time Billboard record
for single-week upward movement has
been broken five times since 2006.
Meanwhile, the possibility of a new
payola scandal continues to haunt
radio stations and record-company
executives. This illegal marketing
phenomenon involves record labels
paying radio stations and/or disc jockeys broadcasting, and more recently
streaming, records as part of a normal
day’s broadcast. U.S. federal law made
the practice illegal in 1934, yet as of
summer 2009, major record labels, including Clearchannel, CBS Radio, EMI,
Sony BMG, Universal Music, and Warner Music, have come under federal
investigation and in some cases had to
pay tens of millions of dollars in fines
and settlements. As radio airplay is a
major component of the music charts
and perceived popularity, these investigations in turn raise concerns about
the validity of the traditional music
charts themselves.
In order to address these issues and
incorporate today’s increasingly popular platform for music consumption,
the Web, the music-charts industry
must keep evolving or be left behind.
Solution
The Sound Index system catalogs the
hottest artists and tracks being talked
about on the Web. Incorporating “
listens,” plays, downloads, sales, and
comments from a multitude of online
communities and social networks, it
provides a current view of popular music content online; the associated filtering enables customized views of the
data to learn about, say, new tracks in a
particular genre of interest.
The system can be divided into four
distinct parts (see Figure 1), leveraging
technology called MONitoring Global
Online Opinions via Semantic Extraction, or MONGOOSE ( http://www.al-maden.ibm.com/cs/projects/iis/mon-goose/). The first, ingestion, is the act
of gathering relevant unstructured and
structured content from various Web
sites (such as Bebo, Google Groups,
iTunes, LastFM, MySpace, and You-Tube). These sources were chosen
because the BBC’s review team of mu-sic-domain experts identified them as
relevant and important to identifying
the tastes of its target demographic—
teens. The system analyzes and trans-
Sound index relies
on broken-english-
text analytics
technology,
techniques
for integrating
information from
different modalities,
and ranking
technologies.
forms the data into a standard schema.
The now-structured content is then
stored in the system’s database. Finally, the system generates music charts by
applying relevant ordering schemes.
Ingestion. In an ideal world, social
networking data, comments, and click
streams would all have a common
format that sites publish, facilitating
easy download and integration of information. However, most sites lack
functional application programming
interfaces (APIs). As a result, screen
scrapinga is the rule for data ingestion, 2
problematic because screen scrapers
are susceptible to (even fairly minor)
changes in Web sites. Unfortunately,
these changes are common, as sites
strive to stay fashionable in an ever-changing cultural and business environment.
Screen scrapers also require a fair
amount of monitoring and maintenance. They need to log into sites and
download necessary content (such as
comments and view counts), transforming it into a simple format, normally just a collection of running text
comments broken out (with markup
removed) for further processing.
Some sites provide really simple
syndication-typeb feeds that are especially useful for ingesting aggregated
data (such as total listens for a particular song). Sound Index uses a combination of screen scrapers, RSS feeds, and
APIs to ingest content based on the
quality and reliability of each ingestion
method for a given site.
Providing a reliable stream of data,
even from sites that are flaky and untrustworthy, is critical to Sound Index
success. As such we have developed a
suite of tools and techniques to deal
with common error conditions and
quickly identify exotic ones and bring
them to the operator’s attention. In addition to the sanity-checking of values,
the system monitors a number of bulk
statistics on the streams themselves
at each step in the processing. This
monitoring allows the system to detect
when, say, the quantity of documents
entering the database from MySpace
a Screen scraping extracts data from machine-and display-friendly code.
b RSS is a family of Web-feed formats used to
publish frequently updated works (such as
blog entries, news headlines, audio, and video)
in a standard format.