YAMNet: A pretrained audio event classifier

Dan Ellis

unread,

Nov 21, 2019, 4:30:06 PM11/21/19

to audioset-users, [email protected], Manoj Plakal

We are pleased to announce the release of YAMNet. To quote the README on github:

YAMNet is a pretrained deep net that predicts 521 audio event classes based on the AudioSet-YouTube corpus, and employing the Mobilenet_v1 depthwise-separable convolution architecture.
This directory contains the Keras code to construct the model, and example code for applying the model to input sound files.

This release includes a Jupyter notebook that illustrates reading an audio file (at 16 kHz sampling rate) and displaying the scores for the most likely audio event classes at a 10 Hz frame rate:

YAMNet is trained on 1,574,587 10-second YouTube soundtrack excerpts from within the AudioSet unbalanced train segments. We included a few refinements to mitigate the challenges of imbalanced priors and weak labels, which we will describe in a forthcoming paper (details TBA). Over the 521 labels it predicts, on the 20,366-segment AudioSet Eval set, YAMNet achieves a d-prime of 2.318, balanced mAP of 0.306, and a balanced average lwlrap of 0.393 (lwlrap is a per-sample label-ranking measure described in section 2.1 of the DCASE 2019 Task 2 Overview Paper).

We are releasing this model to provide a baseline for audio event classification, and to stimulate the development of novel audio event classification applications. We hope you enjoy it!

DAn.

on behalf of Sound Understanding in Google AI Perception

https://research.google.com/teams/perception/

Honghe Wu

unread,

Nov 29, 2019, 3:16:44 AM11/29/19

to audioset-users

Thank you for your job and model! And the forthcoming paper would be wonderful, I can't wait.

Alex Kravchenko

unread,

Dec 13, 2019, 3:41:07 AM12/13/19

to audioset-users

Hello

The yamnet more likely to recognize than vggish+youtube8m. but how we can train YAMNet at first.

Best Regards.

пятница, 22 ноября 2019 г., 0:30:06 UTC+3 пользователь Dan Ellis написал:

Dan Ellis

unread,

Dec 13, 2019, 7:33:42 AM12/13/19

to Alex Kravchenko, audioset-users

The yamnet more likely to recognize than vggish+youtube8m. but how we can train YAMNet at first.

We are working on a paper describing how YAMNet was trained, but we will not be releasing the actual training data (nor the actual code to implement our training scheme).

We hope that the pre-trained network will be useful nonetheless.

DAn.

--
You received this message because you are subscribed to the Google Groups "audioset-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/ce488727-0727-47e7-977e-f0233df90ded%40googlegroups.com.

Keunhong Park

unread,

Dec 19, 2019, 7:49:59 PM12/19/19

to audioset-users

Hello,

Could you elaborate what the position of YAMNet is? Does it supersede the VGGish model?

Thanks,

Keunhong

Dan Ellis

unread,

Dec 20, 2019, 9:16:23 AM12/20/19

to Keunhong Park, audioset-users

YAMNet is trained on AudioSet data to predict specific AudioSet labels. It includes a bunch of innovations to improve the quality of those predictions.

VGGish was trained on YT8M (generic video topic labels, not audio-specific) to provide a general-purpose embedding, not specific class outputs.

YAMNet is about 1/20th the size of VGGish (because it employs the efficient scheme of depth-separable convolutions). Like VGGish, It can also be used to generate an audio embedding vector describing each 960ms frame of audio (by taking the values before the final logistic layer), but this is 1024D instead of the 128D of VGGish.

The VGGish model we released included PCA rotation of the final output, to match the data released with YT8M. YAMNet doesn't need this extra complication, since we're not trying to match any other system.

I hope this helps clarify the distinctions. Unless you need compatibility with other work, or are looking for a compact embedding, I think YAMNet should cover most needs for a pretrained audio network.

DAn.

--
You received this message because you are subscribed to the Google Groups "audioset-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].

To view this discussion on the web visit https://groups.google.com/d/msgid/audioset-users/cf3b7d0b-1a37-4e2e-adfe-c5c295e79a5a%40googlegroups.com.

Reply all

Reply to author

Forward

Message has been deleted