Machine Listening for Bird Migration Monitoring

Two posters at the North East Music Information Special Interest Group (NEMISIG)

On February 9th, 2019, was held the “North East Music Information Special Interest Group” (NEMISIG) workshop at Brooklyn College in New York, NY, USA. Elizabeth Mendoza, from Forest Hills High School, presented a poster on current BirdVox research, initiated during her ARISE internship with Vincent Lostanlen in summer 2018. Vincent Lostanlen presented a poster on computational music analysis.

We reproduce the presentation material below.

Vincent Lostanlen. Sparsity bounds in rhythmic tiling canons

Sparsity bounds in rhythmic tiling canons. Poster by Vincent Lostanlen at NEMISIG 2019


Elizabeth Mendoza. Sparsity bounds in rhythmic tiling canons

Elizabeth Mendoza's poster for NEMISIG 2019. Synthesizing Training Data for Automatic Detection and Classification of Bird Songs

Keynote and poster at the “Speech and Audio in the North-East” workshop

On October 18th, 2018, was held the “Speech and Audio in the North-East” (SANE) workshop at the headquarters of Google in Cambridge, MA, USA. Justin Salamon gave a 50-minute keynote on the recent research activities of BirdVox and Kendra Oudyk presented a poster, following her internship under the supervision of Vincent Lostanlen.

We reproduce the abstract and presentation material below.


Robust sound event detection in acoustic sensor networks

Justin Salamon

The combination of remote acoustic sensors with automatic sound recognition represents a powerful emerging technology for studying both natural and urban environments. At NYU we’ve been working on two projects whose aim is to develop and leverage this technology: the Sounds of New York City (SONYC) project is using acoustic sensors to understand noise patterns across NYC to improve noise mitigation efforts, and the BirdVox project is using them for the purpose of tracking bird migration patterns in collaboration with the Cornell Lab of Ornithology. Acoustic sensors present both unique opportunities and unique challenges when it comes to developing machine listening algorithms for automatic sound event detection: they facilitate the collection of large quantities of audio data, but the data are unlabeled, constraining our ability to leverage supervised machine learning algorithms. Training generalizable models becomes particularly challenging when training data come from a limited set of sensor locations (and times), and yet our models must generalize to unseen natural and urban environments with unknown and sometimes surprising confounding factors. In this talk I will present our work towards tackling these challenges along several different lines with neural network architectures, including novel pooling layers that allow us to better leverage weakly labeled training data, self-supervised audio embeddings that allow us to train high-accuracy models with a limited amount of labeled data, and context-adaptive networks that improve the robustness of our models to heterogenous acoustic environments.



BirdVox-Imitation: A dataset of human imitations of birdsong with potential for research in psychology and machine listening

Kendra Oudyk, Vincent Lostanlen, Justin Salamon, Andrew Farnsworth, and Juan Pablo Bello

Bird watchers imitate bird sounds in order to elicit vocal responses from birds in the forest, and thus locate the birds. Field guides offer various strategies for learning birdsong, from visualizing spectrograms to memorizing onomatopoeic sounds such as “fee bee”. However, imitating birds can be challenging for humans because we have a different vocal apparatus. Many birds can sing at higher pitches and over a wider range of pitches than humans. In addition, they can alternate between notes more rapidly and some produce complex timbres. Little is known about how humans spontaneously imitate birdsong, and the imitations themselves pose an interesting problem for machine listening. In order to facilitate research into these areas, here we present BirdVox-Imitation, an audio dataset on human imitations of birdsong. This dataset includes audio of a) 1700 imitations from 17 participants who each performed 10 imitations of 10 bird species; b) 100 birdsong stimuli that elicited the imitations,  and c) over 6500 excerpts of birdsong, from which the stimuli were selected. These excerpts are ‘clean’ 3-10 second excerpts that were manually annotated and segmented from field recordings of birdsong. The original recordings were scraped from an online, open access, crowdsourced repository of bird sounds. This dataset holds potential for research in both psychology and machine learning. In psychology, questions could be asked about how humans imitate birds – for example, about how humans imitate the pitch, timing, and timbre of birdsong, about when they use different imitation strategies (e.g., humming, whistling, and singing), and about the role of individual differences in musical training and bird-watching experience. In machine learning, this may be the first dataset that is both multimodal (human versus bird) and domain-adversarial (wherein domain refers to imitation strategy, such as whistling vs. humming), so there is plenty of for developing new methods. This dataset will soon be released on Zenodo to facilitate research in these novel areas of investigation.



The official page of the workshop is:


“PCEN: Why and How” published in IEEE SPL

We are happy to announce that our article: “Per-channel energy normalization: Why and How” is featured in the latest issue of IEEE Signal Processing Letters.

In the context of automatic speech recognition and acoustic event detection, an adaptive procedure named per-channel energy normalization (PCEN) has recently shown to outperform the pointwise logarithm of mel-frequency spectrogram (logmelspec) as an acoustic frontend. This article investigates the adequacy of PCEN for spectrogram-based pattern recognition in far-field noisy recordings, both from theoretical and practical standpoints. First, we apply PCEN on various datasets of natural acoustic environments and find empirically that it Gaussianizes distributions of magnitudes while decorrelating frequency bands. Secondly, we describe the asymptotic regimes of each component in PCEN: temporal integration, gain control, and dynamic range compression. Thirdly, we give practical advice for adapting PCEN parameters to the temporal properties of the noise to be mitigated, the signal to be enhanced, and the choice of time-frequency representation. As it converts a large class of real-world soundscapes into additive white Gaussian noise (AWGN), PCEN is a computationally efficient frontend for robust detection and classification of acoustic events in heterogeneous environments.

Per-channel energy normalization: Why and How
V. Lostanlen, J. Salamon, M. Cartwright, B. McFee, A. Farnsworth, S. Kelling, and J. P. Bello
In IEEE Signal Processing Letters, vol. 26, no. 1, pp. 39-43, January 2019.
[PDF][IEEE][Companion website][BibTeX][Copyright]

Below is a plot from our paper comparing the application of log vs PCEN on a mel-spectrogram computed from an audio recording captured by a remote acoustic sensor for avian flight call detection (as part of our BirdVox project). In the top plot (log) we clearly see energy from undesired noise sources such as insects and a car, whereas in the bottom plot (PCEN) we see these confounding factors have been attenuated, while the flight calls we wish to detect (which appear as very short chirps) are kept.

A soundscape comprising bird calls, insect stridulations, and a passing vehicle. %, as recorded from an omnidirectional acoustic sensor. The logarithmic transformation of the mel-frequency spectrogram (a) maps all magnitudes to a decibel-like scale, whereas per-channel energy normalization (b) enhances transient events (bird calls) while discarding stationary noise (insects) as well as slow changes in loudness (vehicle). Data provided by BirdVox.



Elizabeth Mendoza joins BirdVox

We are delighted to announce that Elizabeth Mendoza, a junior student from Forest Hills High School (New York City borough of Queens) is joining BirdVox for a one-month research internship under the mentorship of Dr. Vincent Lostanlen, as part of the ARISE program at the NYU Tandon School of Engineering. Below is her research proposal. 

ARISE (Applied Research Innovations in Science and Engineering)  is an intensive program for academically strong, current 10th and 11th grade New York City students with a demonstrated interest in science, technology, engineering and math. More information on ARISE can be found at this address.

Mendoza and Lostanlen working at the whiteboard

Synthesizing training data for automatic detection and classification of bird songs

Annual variations in the migratory routes of passerines is among the predominant markers of ecological disruption at temperate latitudes. Yet, although it is well established that migratory birds face an ever-increasing number of threats — including habitat loss, invasive species, and collisions with buildings or vehicles — little is known about the respective factors of risk influencing the abundancy of a given species at a fire spatiotemporal resolution. In this context, the deployment of an acoustic sensor network of autonomous recording units (ARU) offers an interesting trade-off between a relatively low cost and a highly informative output. Yet, despite the growing interest for bioacoustic analysis in avian ecology, the scalability of ARU deployment is currently hampered by the shortage of human experts that are trained to pinpoint and identify bird vocalizations in continuous audio recordings. In this context, closing the discrepancy between the cost of hardware ($1k/year) and the cost of human labor ($1M/year) is crucial to achieving the long-term goal of enabling the deployment of an acoustic sensor network for bird migration monitoring at the continental scale. One way to reduce this annotation overhead human experts by software. As the past years have witnessed a relative democratization of high-performance computing (HPC), it has become possible to design more ambitious software architectures, and notably deep learning, for large-scale automated species classification of bird songs and calls. The main contribution of this ARISE internship is to address the lack of diversity in training data in the context of avian flight call detection in audio. To this aim, the intern will synthesize artificial sound recordings containing bird calls, alongside a computer-generated annotation. The release of these synthetic recordings to the international research community could enable the deployment of larger deep learning models while avoiding statistical overfitting, by virtue of a source of training data that is virtually infinite.

Interview of BirdVox for NYU Scienceline podcast

Brianna Abbott, a graduate student in the Science, Health, and Environmental Reporting Program at New York University, has interviewed Andrew Farnsworth and Vincent Lostanlen to discuss their research as part of the BirdVox project. Her podcast is published by Scienceline, the online media for scientific journalism of the Arthur J. Carter Institute. 


It’s a bird! It’s a plane! No, wait, can you hear that? It actually is a bird. Keeping tabs on our feathered friends during migration is vital for conservation efforts, though dark skies and massive amounts of data make it tricky to do so. But individual species of birds talk with each other through flight calls, so we can listen in to determine exactly which species are flying overhead. And now, researchers are developing a machine learning system — dubbed BirdVox — that automatically picks out and identifies the different calls. In this podcast, creators of BirdVox lay out how they cut through the noise to get to the birds.
— Brianna Abbott, June 2018

Kendra Oudyk joins BirdVox

We are delighted to announce that Kendra Oudyk from Jyväskylä University (Finland) is joining BirdVox for a research internship. She is working on developing new computational tools for understanding how humans imitate bird songs.

Below is her research proposal and biography.


What was that bird? Birdsong query-by-humming using asymmetric set inclusion of pitch-curve segments 

The purpose of this project is to create a query-by-humming system for birdsong. Such a system would take a human imitation of birdsong as input, and output likely species classifications, as well as retrieved bird audio recordings that resemble the query.
This presents a unique methodological situation for two reasons:
  1. many methods for birdsong classification may not be applicable because they rely on spectral features that may not be imitable by humans; and
  2. alternatively, many methods for music query-by-humming may not be ideal because birdsong query by humming involves classifying a species rather than a particular song, and birdsong may vary both between and within individual birds of a species.
Therefore, this project will test a novel methodology for query-by-humming; the proposed method involves asymmetric set inclusion of query pitch-curve segments in the set of birdsong pitch-curve segments for each species in the system. This proof-of-concept research may have applications for creating a birdsong query-by-humming tool for everyday users, and additionally it may further our understanding of how humans imitate birdsong.


Kendra is in the second and final year of the Music, Mind, and Technology Masters Degree Program at the University of Jyväskylä in Finland, where she is also completing a minor in Cognitive Neuroscience. For her masters thesis, she used functional Magnetic Resonance Imaging to investigate how personality modulates brain responses to emotion in music, under the supervision of Dr.’s Iballa Burunat, Elvira Brattico, and Petri Toiviainen. She received funding from the European Commission to work on this project during the summer of 2017 at the Center for Music in the Brain in Aarhus University in Denmark. Previously, Kendra completed her undergraduate studies in Music Cognition as well as a Diploma in Music Performance (piano) from McMaster University in Canada. At McMaster, she received two Undergraduate Student Research Awards to investigate choral-conducting gestures using three-dimensional motion-capture technology, under the supervision of Dr.’s Steven Livinstone and Rachel Rensink-Hoff. Additionally, she has worked as a research assistant, teaching assistant, private piano teacher, and leader of wilderness camping trips. Kendra will begin doctoral studies in September at McGill University in the Integrated Program in Neuroscience’s Rotation Program. 

New publication in ICASSP 2018: BirdVox-full-night

We have recently released Birdvox-full-night, a new challenging dataset for machine learning on bioacoustic data.

Details about the dataset and the models we benchmarked are provided in our ICASSP 2018 paper:

BirdVox-full-night: a dataset and benchmark for avian flight call detection

V. Lostanlen, J. Salamon, J. P. Bello, A. Farnsworth, and S. Kelling
In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, April 2018.
[PDF][Poster][Companion website][BibTeX][Copyright]

This article addresses the automatic detection of vocal, nocturnally migrating birds from a network of acoustic sensors. Thus far, owing to the lack of annotated continuous recordings, existing methods had been benchmarked in a binary classification setting (presence vs. absence). Instead, with the aim of comparing them in event detection, we release BirdVox-full-night, a dataset of 62 hours of audio comprising 35402 flight calls of nocturnally migrating birds, as recorded from 6 sensors. We find a large performance gap between energy based detection functions and data-driven machine listening. The best model is a deep convolutional neural network trained with data augmentation. We correlate recall with the density of flight calls over time and frequency and identify the main causes of false alarm.

You can download the dataset after filling in the form on the companion website of the paper:

New publication in ICASSP 2017: Fusing Shallow and Deep Learning

Following on the heels of the PLOS ONE article, the second BirdVox publication will be presented at the ICASSP 2017 conference:

Fusing Shallow and Deep Learning for Bioacoustic Bird Species Classification
J. Salamon, J. P. Bello, A. Farnsworth and S. Kelling
In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, March 2017.


Automated classification of organisms to species based on their vocalizations would contribute tremendously to abilities to monitor biodiversity, with a wide range of applications in the field of ecology. In particular, automated classification of migrating birds’ flight calls could yield new biological insights and conservation applications for birds that vocalize during migration. In this paper we explore state-of-the-art classification techniques for large-vocabulary bird species classification from flight calls. In particular, we contrast a “shallow learning” approach based on unsupervised dictionary learning with a deep convolutional neural network combined with data augmentation. We show that the two models perform comparably on a dataset of 5428 flight calls spanning 43 different species, with both significantly outperforming an MFCC baseline. Finally, we show that by combining the models using a simple late-fusion approach we can further improve the results, obtaining a state-of-the-art classification accuracy of 0.96.

New publication in PLOS ONE

The first study to come out of the BirdVox project has just been published in PLOS ONE:

Towards the Automatic Classification of Avian Flight Calls for Bioacoustic Monitoring
J. Salamon , J. P. Bello, A. Farnsworth, M. Robbins, S. Keen, H. Klinck and Steve Kelling
PLOS ONE 11(11): e0166866, 2016. doi: 10.1371/journal.pone.0166866.


Automatic classification of animal vocalizations has great potential to enhance the monitoring of species movements and behaviors. This is particularly true for monitoring nocturnal bird migration, where automated classification of migrants’ flight calls could yield new biological insights and conservation applications for birds that vocalize during migration. In this paper we investigate the automatic classification of bird species from flight calls, and in particular the relationship between two different problem formulations commonly found in the literature: classifying a short clip containing one of a fixed set of known species (N-class problem) and the continuous monitoring problem, the latter of which is relevant to migration monitoring. We implemented a state-of-the-art audio classification model based on unsupervised feature learning and evaluated it on three novel datasets, one for studying the N-class problem including over 5000 flight calls from 43 different species, and two realistic datasets for studying the monitoring scenario comprising hundreds of thousands of audio clips that were compiled by means of remote acoustic sensors deployed in the field during two migration seasons. We show that the model achieves high accuracy when classifying a clip to one of N known species, even for a large number of species. In contrast, the model does not perform as well in the continuous monitoring case. Through a detailed error analysis (that included full expert review of false positives and negatives) we show the model is confounded by varying background noise conditions and previously unseen vocalizations. We also show that the model needs to be parameterized and benchmarked differently for the continuous monitoring scenario. Finally, we show that despite the reduced performance, given the right conditions the model can still characterize the migration pattern of a specific species. The paper concludes with directions for future research.

BirdVox awarded grant from the National Science Foundation (NSF)

BirdVox has been awarded a $1.5 million Big Data program grant, awarded to the project, BirdVox: Automatic Bird Species Identification from Flight Calls, conducted jointly by NYU and the Cornell Lab of Ornithology (CLO), who lead the project.

Further information is provided in the NYU press release.

Collecting reliable, real-time data on the migratory patterns of birds can help foster more effective conservation practices, and – when correlated with other data – provide insight into important environmental phenomena. Scientists at CLO currently rely on information from weather surveillance radar, as well as reporting data from over 400,000 active birdwatchers, one of the largest and longest-standing citizen science networks in existence. However, there are important gaps in this information since radar imaging cannot differentiate between species, and most birds migrate at night, unobserved by citizen scientists. The combination of acoustic sensing and machine listening in this project addresses these shortcomings, providing valuable species-specific data that can help biologists complete the bird migration puzzle.

« Older posts

© 2019 BirdVox

Theme by Anders NorenUp ↑