housed in an aluminum casing we
chose to reduce RFI interference and
solar heat gain. The microphone module is mounted externally via a flexible
metal gooseneck attachment, making
it possible to reconfigure the sensor
node for deployment in varying locations, including sides of buildings,
light poles, and building ledges.
Apart from continuous SPL measurements, we designed the nodes to
sample 10-second audio snippets at
random intervals over a limited period of time, collecting data to train
and benchmark our machine-listening solutions. SONYC compresses the
audio using the lossless FLAC audio
coding format, using 4,096-bit AES
encryption and the RSA public/pri-vate key-pair encryption algorithm.
Sensor nodes communicate with the
server via a virtual private network, uploading audio and SPL data at one-minute intervals.
As of December 2018, the parts of
each sensor cost approximately $80
using mostly off-the-shelf components. We fully expect to reduce the
unit cost significantly through custom
redesign for high-volume, third-party
assembly. However, even at the current price, SONYC sensors are significantly more affordable, and thus amenable to large-scale deployment, than
existing noise-monitoring solutions.
Moreover, this reduced cost does not
come at the expense of measurement
accuracy, with our sensors’ performance comparable to high-quality
devices that are orders of magnitude
more costly while outperforming solutions in the same price range. Finally,
the dedicated computing core opens
the possibility for edge computing,
particularly for in-situ machine listening intended to automatically and
robustly identify the presence of common sound sources. This unique feature of SONYC goes well beyond the
capabilities of existing noise-monitoring solutions.
Machine Listening at the Edge
Machine listening is the auditory coun-
terpart to computer vision, combining
techniques from signal processing and
machine learning to develop systems
able to extract meaningful information
from sound. In the context of SONYC,
we focus on developing computational
methods to automatically detect specif-
ic types of sound sources (such as jack-
hammers, idling engines, car horns,
and police sirens) from environmental
audio. Detection is a challenge, given
the complexity and diversity of sources,
auditory scenes, and background con-
ditions routinely found in noisy urban
acoustic environments.
We thus created an urban sound taxonomy, annotated datasets, and various cutting-edge methods for urban
sound-source identification.
25, 26 Our
research shows that feature learning,
using even simple dictionary-based
methods (such as spherical k-means)
makes for significant improvement in
performance over the traditional approach of feature engineering. Moreover, we have found that temporal-shift invariance, whether through
modulation spectra or deep convolutional networks, is crucial not only for
overall accuracy but also to increase
robustness in low signal-to-noise-ra-tio (SNR) conditions, as when sources
of interest are in the background of
acoustic scenes. Shift invariance also
results in more compact machines
that can be trained with less data,
thus adding greater value for edge-computing solutions. More recent results highlight the benefits of using
convolutional recurrent architectures,
as well as ensembles of various models
via late fusion.
Deep-learning models necessitate
large volumes of labeled data tradi-
tionally unavailable for environmental
sound. Addressing this lack of data, we
have developed an audio data augmen-
tation framework that systematically
deforms the data using well-known
audio transformations (such as time
stretching, pitch shifting, dynamic
range compression, and addition of
background noise at different SNRs),
significantly increasing the amount of
data available for model training. We
also developed an open source tool
for soundscape synthesis.
27 Given a
collection of isolated sound events,
it functions as a high-level sequencer
that can generate multiple sound-
scapes from a single probabilistically
defined “specification.” We generated
large datasets of perfectly annotated
data in order to assess algorithmic
performance as a function of, say,
maximum polyphony and SNR ratio,
safety, traffic, and taxi activity to con-
struction, making much of it publicly
available.c Our work involves close
collaboration with city agencies, in-
cluding DEP, DOHMH, various busi-
ness improvement districts, and
private initiatives (such as LinkNYC)
that provide access to existing infra-
structure. As a powerful sensing-and-
analysis infrastructure, SONYC thus
holds the potential to empower new
research in environmental psychol-
ogy, public health, and public policy,
as well as empower citizens seeking
to improve their own communities.
We next describe the technology and
methods underpinning the project,
presenting some of our early findings
and future challenges.
Acoustic Sensor Network
As mentioned earlier, SONYC’s intelligent sensing platform should be
scalable and capable of source identification and high-quality, round-the-clock noise monitoring. To that
end we have developed an acoustic
sensor18 (see Figure 2) based on the
popular Raspberry Pi single-board
computer outfitted with a custom
microelectromechanical systems
(MEMS) microphone module. We
chose MEMS microphones for their
low cost and consistency across units
and size, which can be 10x smaller
than conventional microphones.
Our custom standalone microphone
module includes additional circuitry,
including in-house analog-to-digital
converters and pre-amp stages, as
well as an on-board microcontroller
that enables preprocessing of the
incoming audio signal to compensate for the microphone’s frequency
response. The digital MEMS microphone features a wide dynamic range
of 32dBA–120dBA, ensuring all urban
sound pressure levels are monitored
effectively. We calibrated it using a
precision-grade sound-level meter as
reference under low-noise anechoic conditions and was empirically
shown to produce sound-pressure-level data at an accuracy level compliant with the ANSI Type- 2 standard20
required by most local and national
noise codes.
The sensor’s computing core is
c https://nycopendata.socrata.com