BirdVox

Sripathi Sridhar presents Helicality at ISMIR LBD

October 7, 2020 / vl1019

On October 15th, NYU student Sripathi Sridhar will present a poster by the BirdVox team to the attendees of the International Society on Music Information Retrieval (ISMIR) Late-Breaking / Demo session (LBD). We reproduce the abstract of the paper below.

Helicality: An Isomap-based Measure of Octave Equivalence in Audio Data

Sripathi Sridhar, Vincent Lostanlen

Octave equivalence serves as domain-knowledge in MIR systems, including chromagram, spiral convolutional networks, and harmonic CQT. Prior work has applied the Isomap manifold learning algorithm to unlabeled audio data to embed frequency sub-bands in 3-D space where the Euclidean distances are inversely proportional to the strength of their Pearson correlations. However, discovering octave equivalence via Isomap requires visual inspection and is not scalable. To address this problem, we define “helicality” as the goodness of fit of the 3-D Isomap embedding to a Shepherd-Risset helix. Our method is unsupervised and uses a custom Frank-Wolfe algorithm to minimize a least-squares objective inside a convex hull. Numerical experiments indicate that isolated musical notes have a higher helicality than speech, followed by drum hits.

We have uploaded the video of Sripathi’s presentation to YouTube:

Link to the preprint of the ICASSP paper:
https://arxiv.org/abs/2010.00673

Link to the TinySOL dataset of isolated musical notes:
https://zenodo.org/record/3685367

Link to the source code to reproduce the figures of the paper: https://github.com/sripathisridhar/sridhar2020ismir

Sripathi Sridhar presents “Learning the helix topology of musical pitch” at IEEE ICASSP

August 20, 2020 / vl1019

In May 2020, NYU student Sripathi Sridhar presented a new paper by the BirdVox team to the attendees of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). We reproduce the abstract of the paper below.

Learning the helix topology of musical pitch.

Vincent Lostanlen, Sripathi Sridhar, Andrew Farnsworth, Juan Pablo Bello.

To explain the consonance of octaves, music psychologists represent pitch as a helix where azimuth and axial coordinate correspond to pitch class and pitch height respectively. This article addresses the problem of discovering this helical structure from unlabeled audio data. We measure Pearson correlations in the constant-Q transform (CQT) domain to build a K-nearest neighbor graph between frequency subbands. Then, we run the Isomap manifold learning algorithm to represent this graph in a three-dimensional space in which straight lines approximate graph geodesics. Experiments on isolated musical notes demonstrate that the resulting manifold resembles a helix which makes a full turn at every octave. A circular shape is also found in English speech, but not in urban noise. We discuss the impact of various design choices on the visualization: instrumentarium, loudness mapping function, and number of neighbors K.

We have uploaded the video of Sripathi’s presentation to YouTube:

The preprint of the ICASSP paper can be found at:
https://arxiv.org/abs/1910.10246

The TinySOL and SONYC-UST dataset can be downloaded on Zenodo:
https://zenodo.org/record/3685367

https://zenodo.org/record/3873076

The source code to reproduce the figures of the paper can be cloned from GitHub: https://github.com/BirdVox/lostanlen2020icassp

Jason Cramer presents TaxoNet at IEEE ICASSP

July 17, 2020 / vl1019

On May 5th, 2020, PhD candidate Jason Cramer presented a new paper by the BirdVox team to the attendees of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).

We reproduce the abstract of the paper below.

Chirping up the right tree: Incorporating biological taxonomies in deep bioacoustic classifiers

Jason Cramer, Vincent Lostanlen, Andrew Farnsworth, Justin Salamon, Juan Pablo Bello

Class imbalance in the training data hinders the generalization ability of machine listening systems. In the context of bioacoustics, this issue may be circumvented by aggregating species labels into super-groups of higher taxonomic rank: genus, family, order, and so forth. However, different applications of machine listening to wildlife monitoring may require different levels of granularity. This paper introduces TaxoNet, a deep neural network for structured classification of signals from living organisms. TaxoNet is trained as a multitask and multilabel model, following a new architectural principle in end-to-end learning named “hierarchical composition”: shallow layers extract a shared representation to predict a root taxon, while deeper layers specialize recursively to lower-rank taxa. In this way, TaxoNet is capable of handling taxonomic uncertainty, out-of-vocabulary labels, and open-set deployment settings. An experimental benchmark on two new bioacoustic datasets (ANAFCC and BirdVox-14SD) leads to state-of-the-art results in bird species classification. Furthermore, on a task of coarse-grained classification, TaxoNet also outperforms a flat single-task model trained on aggregate labels.

We have uploaded the video of Jason’s presentation to YouTube:

The preprint of the ICASSP paper can be found at: http://www.justinsalamon.com/uploads/4/3/9/4/4394963/cramer_taxonet_icassp_2020.pdf

The BirdVox-ANAFCC dataset can be downloaded on Zenodo:
https://zenodo.org/record/3666782#.XxG5LJNzS1s

The source code for the TaxoNet deep learning system can be cloned from GitHub:
https://github.com/BirdVox/birdvoxclassify

Kendra Oudyk presents “Matching human vocal imitations to birdsong” at VIHAR workshop

December 10, 2019 / vl1019

We are happy to announce that our article: “Matching human vocal imitations to birdsong: An exploratory analysis” is featured in the proceedings of the 2nd international workshop on Vocal Interactivity in-and-between Humans, Animals, and Robots (VIHAR).

This paper was written by two MSc students: Kendra Oudyk (now at McGill University) and Yun-Han Wu (now at Fraunhofer IIS). They were supervised by the BirdVox team: Vincent Lostanlen, Justin Salamon, Andrew Farnsworth, and Juan Pablo Bello.

The abstract of the paper is reproduced below.

We explore computational strategies for matching human vocal imitations of birdsong to actual birdsong recordings. We recorded human vocal imitations of birdsong and subsequently analysed these data using three categories of audio features for matching imitations to original birdsong: spectral, temporal, and spectrotemporal. These exploratory analyses suggest that spectral features can help distinguish imitation strategies (eg whistling vs. singing) but are insufficient for distinguishing species. Similarly, whereas temporal features are correlated between human imitations and natural birdsong, they are also insufficient. Spectrotemporal features showed the greatest promise, in particular when used to extract a representation of the pitch contour of birdsong and human imitations. This finding suggests a link between the task of matching human imitations to birdsong to retrieval tasks in the music domain such as query-by-humming and cover song retrieval; we borrow from such existing methodologies to outline directions for future research.

Kendra Oudyk presented the paper on August 30th in London, UK. The website of the VIHAR workshop is: http://vihar-2019.vihar.org/

Vincent Lostanlen presents at the Acoustical Society of America meeting

December 8, 2019 / vl1019

On December 5th, 2019, BirdVox researcher Vincent Lostanlen gave a 20-minute talk in San Diego, CA, USA, by invitation of the 178th meeting of the Acoustical Society of America (ASA). The talk was entitled “BirdVox: From flight call classification to full-season migration monitoring”, and was featured in the special session on machine learning for animal bioacoustics.

We reproduce the abstract below. This abstract appeared on the latest issue of the Journal of the Acoustical Society of America (JASA), volume 146, number 4, page 2984.

The BirdVox project aims at inventing new machine listening methods for the bioacoustic analysis of avian migration at the continental scale. It relies on an acoustic sensor network of low-cost, autonomous recording units to detect nocturnal flight calls and classify them in terms of family, genus, and species. As a result, each sensor produces a daily checklist of the species currently aloft, next to their respective individual counts. In this talk, I describe the research methods of BirdVox and their implications for advancing the understanding of animal behavior and conservation biology. The commonality of these methods is that they tightly integrate data-driven components alongside the induction of domain-specific knowledge. Furthermore, the resort to machine learning is not restricted to supervised acoustic event classification tasks, but also encompasses audio representation learning, few-shot active learning for efficient annotation, and Bayesian inference for adapting to multiple acoustic environments. I conclude with an overview of some open-source software tools for large-scale bioacoustics: librosa (spectrogram analysis), pysox (audio transformations), JAMS (rich annotation of audio events), muda (data augmentation), scaper (soundscape synthesis), pescador (stochastic sampling), and mir_eval (evaluation).

https://asa.scitation.org/doi/abs/10.1121/1.5137328

Vincent Lostanlen presents at Dolby in SoHo

November 14, 2019 / vl1019

On April 25th, 2019, BirdVox researcher Vincent Lostanlen gave a 25-minute in the New York City neighborhood of SoHo, as part of a scientific workshop named “Artificial Intelligence in Audio: Applications, Advancements, and Trends” and organized by Dolby.

The video of the event can be found below.

More details about the event can be found at: https://soho.dolby.com/artificialintelligenceinaudio

“Long-distance detection of bioacoustic events with PCEN” presented at DCASE workshop

November 12, 2019 / vl1019

We are happy to announce that our article: “Long-distance detection of bioacoustic events with per-channel energy normalization” is featured in the proceedings of the DCASE 2019 workshop. This paper is a collaboration between the BirdVox project; Kaitlin Palmer from San Diego State University; Elly Knight from the University of Alberta; Christopher Clark and Holger Klinck from the Cornell Lab of Ornithology; NYU ARISE student Tina Wong; and Jason Cramer from New York University.

This paper proposes to perform unsupervised detection of bioacoustic events by pooling the magnitudes of spectrogram frames after per-channel energy normalization (PCEN). Although PCEN was originally developed for speech recognition, it also has beneficial effects in enhancing animal vocalizations, despite the presence of atmospheric absorption and intermittent noise. We prove that PCEN generalizes logarithm-based spectral flux, yet with a tunable time scale for background noise estimation. In comparison with pointwise logarithm, PCEN reduces false alarm rate by 50x in the near field and 5x in the far field, both on avian and marine bioacoustic datasets. Such improvements come at moderate computational cost and require no human intervention, thus heralding a promising future for PCEN in bioacoustics.

Long-distance detection of bioacoustic events with per-channel energy normalization
V. Lostanlen, K. Palmer, E. Knight, C. Clark, H. Klinck, A. Farnsworth, T. Wong, J. Cramer, J.P. Bello
In Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE), 2019.
[PDF][Companion website]

@inproceedings{lostanlen2019dcase,
    author = "Lostanlen, Vincent and Palmer, Kaitlin and Knight, Elly and Clark, Christopher and Klinck, Holger and Farnsworth, Andrew and Wong, Tina and Cramer, Jason and Bello, Juan",
    title = "Long-distance Detection of Bioacoustic Events with Per-channel Energy Normalization",
    booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)",
    address = "New York University, NY, USA",
    month = "October",
    year = "2019",
    pages = "144--148",
}

The figure below displays the mel-frequency spectrogram of one Common Nighthawk call at various distances, after processing them with either pointwise logarithm (left) or PCEN (right). Atmospheric absorption is particularly noticeable above 200 meters, especially in the highest frequencies. Furthermore, we observe that max-pooled spectral flux is numerically unstable, because it triggers at different time-frequency bins from one sensor to the next. In comparison, PCEN is more consistent in reaching maximal magnitude at the onset of the call, and at the same frequency band.

Effect of pointwise logarithm (left) and per-channel energy normalization (right) on the same Common Nighthawk vocalization, as recorded from various distances. White dots depict the time-frequency locations of maximal spectral flux (left) or maximal PCEN magnitude (right). The spectrogram covers a duration of 700 ms and a frequency range between 2 and 10 kHz.

Phincho Sherpa presents BirdVoxPaint at NYU ARISE colloquium

October 28, 2019 / vl1019

We are delighted to announce that Phincho Sherpa, a junior student from Long Island City School (New York City borough of Queens) has joined BirdVox for a one-month research internship under the mentorship of Dr. Vincent Lostanlen, as part of the ARISE program at the NYU Tandon School of Engineering. Below is her research proposal.

ARISE (Applied Research Innovations in Science and Engineering) is an intensive program for academically strong, current 10th and 11th grade New York City students with a demonstrated interest in science, technology, engineering and math. More information on ARISE can be found at this address.

Portrait of Phincho Sherpa

Visualization of bird call activity in the time-frequency domain

Understanding the sounds of wildlife is paramount to the conservation of ecosystems. To this end, researchers in bioacoustics deploy recording devices on the field, thus resulting in massive amounts of digital audio data. The goal of this internship is to develop a new method for making sense of long audio files, lasting an entire day or more, without need to actually listen to them in full. In the case of short recordings, a convenient way to visualize sounds is the spectrogram, which represents the vocalizations of animals as bursts of acoustic energy in the time-frequency domain. For bird sounds, this method cannot be directly scaled up to long recordings, because the duration of a bird call (about one tenth of a second) is too short in comparison with the size of a long-term spectrogram pixel (about one minute). Phincho Sherpa will address this problem by modifying the representation of visual intensity in the time-frequency domain. In addition to energy, as is done in the conventional spectrogram, Phincho will explore other indices of acoustic complexity, such as entropy, flux, phase deviation, and per-channel energy normalization magnitude; and assign them to a multidimensional color map.

On August 16th, 2019, was held the yearly colloquium of the NYU ARISE program. Phincho Sherpa’s poster can be found below:

BirdVoxPaint: False-color spectrograms for long-duration bioacoustic monitoring

During her internship, Phincho Sherpa created an open-source Python module, named BirdVoxPaint, for computing spectrotemporal indices of acoustic complexity in long-duration recordings. The initial release (v0.1.0) offers five of these indices:

acoustic complexity index (ACI) of Farina and Pieretti
average energy
entropy-based concentration of Sueur et al.
maximum energy
maximum spectral flux

BirdVoxPaint can be downloaded at the following link:

https://github.com/BirdVox/birdvoxpaint/releases/tag/v0.1.0

Two posters at the North East Music Information Special Interest Group (NEMISIG)

March 17, 2019 / vl1019

On February 9th, 2019, was held the “North East Music Information Special Interest Group” (NEMISIG) workshop at Brooklyn College in New York, NY, USA. Elizabeth Mendoza, from Forest Hills High School, presented a poster on current BirdVox research, initiated during her ARISE internship with Vincent Lostanlen in summer 2018. Vincent Lostanlen presented a poster on computational music analysis.

We reproduce the presentation material below.

Vincent Lostanlen. Sparsity bounds in rhythmic tiling canons

Sparsity bounds in rhythmic tiling canons. Poster by Vincent Lostanlen at NEMISIG 2019

Elizabeth Mendoza. Synthesizing Training Data for Automatic Detection & Classification of Bird Songs

Elizabeth Mendoza's poster for NEMISIG 2019. Synthesizing Training Data for Automatic Detection and Classification of Bird Songs

Keynote and poster at the “Speech and Audio in the North-East” workshop

February 14, 2019 / vl1019

On October 18th, 2018, was held the “Speech and Audio in the North-East” (SANE) workshop at the headquarters of Google in Cambridge, MA, USA. Justin Salamon gave a 50-minute keynote on the recent research activities of BirdVox and Kendra Oudyk presented a poster, following her internship under the supervision of Vincent Lostanlen.

We reproduce the abstract and presentation material below.

Robust sound event detection in acoustic sensor networks

Justin Salamon

The combination of remote acoustic sensors with automatic sound recognition represents a powerful emerging technology for studying both natural and urban environments. At NYU we’ve been working on two projects whose aim is to develop and leverage this technology: the Sounds of New York City (SONYC) project is using acoustic sensors to understand noise patterns across NYC to improve noise mitigation efforts, and the BirdVox project is using them for the purpose of tracking bird migration patterns in collaboration with the Cornell Lab of Ornithology. Acoustic sensors present both unique opportunities and unique challenges when it comes to developing machine listening algorithms for automatic sound event detection: they facilitate the collection of large quantities of audio data, but the data are unlabeled, constraining our ability to leverage supervised machine learning algorithms. Training generalizable models becomes particularly challenging when training data come from a limited set of sensor locations (and times), and yet our models must generalize to unseen natural and urban environments with unknown and sometimes surprising confounding factors. In this talk I will present our work towards tackling these challenges along several different lines with neural network architectures, including novel pooling layers that allow us to better leverage weakly labeled training data, self-supervised audio embeddings that allow us to train high-accuracy models with a limited amount of labeled data, and context-adaptive networks that improve the robustness of our models to heterogenous acoustic environments.

[YouTube][slides]

BirdVox-Imitation: A dataset of human imitations of birdsong with potential for research in psychology and machine listening

Kendra Oudyk, Vincent Lostanlen, Justin Salamon, Andrew Farnsworth, and Juan Pablo Bello

Bird watchers imitate bird sounds in order to elicit vocal responses from birds in the forest, and thus locate the birds. Field guides offer various strategies for learning birdsong, from visualizing spectrograms to memorizing onomatopoeic sounds such as “fee bee”. However, imitating birds can be challenging for humans because we have a different vocal apparatus. Many birds can sing at higher pitches and over a wider range of pitches than humans. In addition, they can alternate between notes more rapidly and some produce complex timbres. Little is known about how humans spontaneously imitate birdsong, and the imitations themselves pose an interesting problem for machine listening. In order to facilitate research into these areas, here we present BirdVox-Imitation, an audio dataset on human imitations of birdsong. This dataset includes audio of a) 1700 imitations from 17 participants who each performed 10 imitations of 10 bird species; b) 100 birdsong stimuli that elicited the imitations, and c) over 6500 excerpts of birdsong, from which the stimuli were selected. These excerpts are ‘clean’ 3-10 second excerpts that were manually annotated and segmented from field recordings of birdsong. The original recordings were scraped from xeno-canto.org: an online, open access, crowdsourced repository of bird sounds. This dataset holds potential for research in both psychology and machine learning. In psychology, questions could be asked about how humans imitate birds – for example, about how humans imitate the pitch, timing, and timbre of birdsong, about when they use different imitation strategies (e.g., humming, whistling, and singing), and about the role of individual differences in musical training and bird-watching experience. In machine learning, this may be the first dataset that is both multimodal (human versus bird) and domain-adversarial (wherein domain refers to imitation strategy, such as whistling vs. humming), so there is plenty of for developing new methods. This dataset will soon be released on Zenodo to facilitate research in these novel areas of investigation.

[PDF]

The official page of the workshop is: http://www.saneworkshop.org/sane2018/

Helicality: An Isomap-based Measure of Octave Equivalence in Audio Data

News