is, say, half of what it was yesterday. The
system then spot-checks the crawler
statistics; if it sees the number of documents fetched per hour has decreased,
some kind of format change is likely
preventing the low-level parsers from
correctly splitting the comments out
of the discussion pages. While these
bulk statistics don’t tell the operator
or Sound Index itself why something is
not working, they are quite effective at
helping reveal when something is not
working.
Sound Index automates simple corrective actions, including killing and
restarting fetchers and flushing domain name system cachesc to correctly
identify changes in, say, the targeted
servers being crawled. Developing and
automating these solutions is critical, as they reduce the need for early-morning service calls to system administrators. Sound Index uses Nagiosd to
monitor all aspects of the system’s performance, raising flags over problems
(such as no data in the ingest feed and
database-connection errors). Alba et
al. 2 detailed additional challenges affecting Sound Index data access.
Processing. All acquired data must
be “cleaned” before it undergoes processing and analysis. For example,
the cleaning of structured data generally consists of a few sanity checks.
For numeric data (such as total video
views), which is expected to constantly
increase, the system checks whether
fewer total mentions were made today
compared to yesterday. If they were,
the implication is a negative number of
views and something clearly in error.
Sound Index might report that there
were zero views during this period rather than a clearly broken number for
upstream processing, a scenario that
is surprisingly frequent in the music
domain. Also, some sources perform
corrections that result in big jumps in
structured numbers. As Sound Index reports data every six hours (some source
numbers are updated every week), the
system’s developers incorporated techniques for smoothing these numbers.
A major challenge in developing the
system was figuring out how to eliminate “spam” from comment streams.
Popular artists draw many visitors, a
fact advertisers are quick to capitalize
on. Up to 50% of a popular artist’s comments are what could be considered
spam (ranging from the blatant “Check
out my page <URL>” to the relatively
subtle “If you like this artist you will
love <URL>” to the simply off topic “I
like ducks!”). As they are not music-related expressions, Sound Index needs
to be able to remove them from the
tally; otherwise they could easily dominate (and distort) the results.
The Sound-Index topic-detection
methodology accounts for whether a
post is on- or off-topic, with the latter
consisting of spam or nonsense posts.
Employing a combination of template
spotting for extremely common spam
phrases and a domain dictionary, it
identifies the presence or absence of
music-related terminology. This approach provides reasonable spam
identification, down to where it has virtually no effect on relative counts. For
on-topic posts, Sound Index extracts
the relevant noun phrases, as well as
the associated sentiment.
The issue of how to identify and remove spam is even more challenging
due to unstructured data. Especially in
the music domain, slang and nontypical spellings and linguistic constructs
appear with some frequency. A good
example is the comment “U R 50 Bad.”
Parsing it is a complex, multi-step process. First, common variants must be
rewritten into their more common English equivalents; for example, numbers
as substitutions for letters must be reversed and texting abbreviations expanded. This technique results in “You
are so bad” as the comment. The next
step employs a feed of common slang
expressions from sources like Urban
Dictionary (
http://www.urbandiction-ary.com/) to rewrite slang. This gets the
system to “You are very good.”
Sound Index must also identify ambiguous references. To do so it looks
at all possible artists for “You.” If it appears on a fan page for, say, Amy Winehouse, the system would conclude that
she is the artist most likely being mentioned. The final parsed comment becomes “Amy Winehouse is very good,”
a specific mention of an artist with a
positive sentiment.
The system then examines the demographic data for the poster (if available), perhaps determining that the
poster is a 17-year-old female in the
U.K. This data is tallied as a single mention, positive, for Amy Winehouse, by
a user with said demographics. Each
such data point serves as a dimension
for aggregation in a subsequent step.
Resolving entity ambiguity is a ma-
figure 1. Sound index data flow.
System data Flow
transliteration
Source
Spam Filter
Source
Source
Crawlers
Profanity Filter
database
Sentiment
Sound Index
…
…
c DNS is the hierarchical naming system for
Internet resources; its caches help route, resolve, and link domains to IP addresses.
d Nagios ( http://www.nagios.org/) is open
source network-monitoring software.
Data ingest
Fetch data
from Web sites,
boards, blogs.
annotators
Process
incoming
content and
add annotations.
Join and upload
Perform joins
on the data
and prepare
for upload.
front-end
data presented based
on the preferences of
target demographics.