fectively over time, suggesting we
can expect higher-quality annotations with only a small amount of additional training.
We found the value of additional
annotators decreased after five to 10
annotators and that having 16 annotators was sufficient for capturing
90% of the gain in annotation quality. However, when resources are limited and cost is a concern, our findings suggest five annotators may be
a reasonable choice for reliable annotation with respect to the trade-off
between cost and quality. These findings are valuable for the design of
audio-annotation interfaces and the
use of crowdsourcing and citizen science strategies for audio annotation
at scale.
Noise Analytics
One main SONYC promise is its future
ability to analyze and understand noise
pollution at city-scale in an interactive
and efficient manner. As of December
2018, we had deployed 56 sensors, primarily in the city’s Greenwich Village
neighborhood, as well as in other locations in Manhattan, Brooklyn, and
Queens. Collectively, the sensors have
gathered the equivalent of 30 years of
audio data and more than 60 years of
sound-pressure levels and telemetry.
These numbers are a clear indication of
the magnitude of the challenge from a
data-analytics perspective.
We are currently developing a flexible, powerful visual-analytics framework that enables visualization of
noise levels in the context of the city,
together with other related urban data
streams. Working with urban data
poses further research challenges.
Although much work has focused on
scaling databases for big data, existing data-management technologies do
not meet the requirements needed to
interactively explore massive or even
reasonable-size datasets.
8
Accomplishing interactivity requires not only efficient techniques
for data and query management but
for scalable visualization techniques
capable of rendering large amounts of
information.
In addition, visualizations and in-
terfaces must be rendered in a form
that is easily understood by domain
experts and non-expert users alike, in-
studies that would be prohibitive at
this scale and precision using manu-
ally annotated data.
The combination of an augmented
training set and increased capacity and
representational power of deep-learning models yields state-of-the-art performance. Our current machine-listening
models can perform robust multi-label
classification for 10 common classes of
urban sound sources in real time running on a laptop. We will soon adapt
them to run under the computational
constraints of the Raspberry Pi.
However, despite the advantages
of data augmentation and synthesis,
the lack of a significant amount of annotated data for supervised learning
remains the main bottleneck in the
development of machine-listening solutions that can detect more sources
of noise. To address this need, we developed a framework for Web-based
human audio annotation and conducted a large-scale, experimental
study on how visualization aids and
acoustic conditions affect the annotation process and its effectiveness.
6 We
aimed to quantify the reliability/re-dundancy trade-off in crowdsourced
soundscape annotation, investigate
how visualizations affect accuracy
and efficiency, and characterize how
performance varies as a function of
audio characteristics. Our study followed a between-subjects factorial experimental design in which we tested
18 different experimental conditions
with 540 participants we recruited
through Amazon’s Mechanical Turk.
We found more complex audio
scenes result in lower annotator
agreement and that spectrogram
visualizations are superior at producing higher-quality annotations
at lower cost in terms of time and
human labor. Given enough time,
all tested visualization aids enable
annotators to identify sound events
with similar recall, but the spectrogram visualization enables annotators to identify sounds more
quickly. We speculate this may be
because annotators are able to more
easily identify visual patterns in the
spectrogram, in turn enabling them
to identify sound events and their
boundaries more precisely and efficiently. We also found participants
learn to use each interface more ef-
It is scalable in
terms of coverage
and power
consumption,
does not suffer
from the same
biases as 311-style
reporting, and
goes well beyond
SPL-based
measurements
of the acoustic
environment.