| # YAMNet | |
| YAMNet is a pretrained deep net that predicts 521 audio event classes based on | |
| the [AudioSet-YouTube corpus](http://g.co/audioset), and employing the | |
| [Mobilenet_v1](https://arxiv.org/pdf/1704.04861.pdf) depthwise-separable | |
| convolution architecture. | |
| This directory contains the Keras code to construct the model, and example code | |
| for applying the model to input sound files. | |
| ## Installation | |
| YAMNet depends on the following Python packages: | |
| * [`numpy`](http://www.numpy.org/) | |
| * [`resampy`](http://resampy.readthedocs.io/en/latest/) | |
| * [`tensorflow`](http://www.tensorflow.org/) | |
| * [`pysoundfile`](https://pysoundfile.readthedocs.io/) | |
| These are all easily installable via, e.g., `pip install numpy` (as in the | |
| example command sequence below). | |
| Any reasonably recent version of these packages should work. TensorFlow should | |
| be at least version 1.8 to ensure Keras support is included. Note that while | |
| the code works fine with TensorFlow v1.x or v2.x, we explicitly enable v1.x | |
| behavior. | |
| YAMNet also requires downloading the following data file: | |
| * [YAMNet model weights](https://storage.googleapis.com/audioset/yamnet.h5) | |
| in Keras saved weights in HDF5 format. | |
| After downloading this file into the same directory as this README, the | |
| installation can be tested by running `python yamnet_test.py` which | |
| runs some synthetic signals through the model and checks the outputs. | |
| Here's a sample installation and test session: | |
| ```shell | |
| # Upgrade pip first. Also make sure wheel is installed. | |
| python -m pip install --upgrade pip wheel | |
| # Install dependences. | |
| pip install numpy resampy tensorflow soundfile | |
| # Clone TensorFlow models repo into a 'models' directory. | |
| git clone https://github.com/tensorflow/models.git | |
| cd models/research/audioset/yamnet | |
| # Download data file into same directory as code. | |
| curl -O https://storage.googleapis.com/audioset/yamnet.h5 | |
| # Installation ready, let's test it. | |
| python yamnet_test.py | |
| # If we see "Ran 4 tests ... OK ...", then we're all set. | |
| ``` | |
| ## Usage | |
| You can run the model over existing soundfiles using inference.py: | |
| ```shell | |
| python inference.py input_sound.wav | |
| ``` | |
| The code will report the top-5 highest-scoring classes averaged over all the | |
| frames of the input. You can access greater detail by modifying the example | |
| code in inference.py. | |
| See the jupyter notebook `yamnet_visualization.ipynb` for an example of | |
| displaying the per-frame model output scores. | |
| ## About the Model | |
| The YAMNet code layout is as follows: | |
| * `yamnet.py`: Model definition in Keras. | |
| * `params.py`: Hyperparameters. You can usefully modify PATCH_HOP_SECONDS. | |
| * `features.py`: Audio feature extraction helpers. | |
| * `inference.py`: Example code to classify input wav files. | |
| * `yamnet_test.py`: Simple test of YAMNet installation | |
| ### Input: Audio Features | |
| See `features.py`. | |
| As with our previous release | |
| [VGGish](https://github.com/tensorflow/models/tree/master/research/audioset/vggish), | |
| YAMNet was trained with audio features computed as follows: | |
| * All audio is resampled to 16 kHz mono. | |
| * A spectrogram is computed using magnitudes of the Short-Time Fourier Transform | |
| with a window size of 25 ms, a window hop of 10 ms, and a periodic Hann | |
| window. | |
| * A mel spectrogram is computed by mapping the spectrogram to 64 mel bins | |
| covering the range 125-7500 Hz. | |
| * A stabilized log mel spectrogram is computed by applying | |
| log(mel-spectrum + 0.001) where the offset is used to avoid taking a logarithm | |
| of zero. | |
| * These features are then framed into 50%-overlapping examples of 0.96 seconds, | |
| where each example covers 64 mel bands and 96 frames of 10 ms each. | |
| These 96x64 patches are then fed into the Mobilenet_v1 model to yield a 3x2 | |
| array of activations for 1024 kernels at the top of the convolution. These are | |
| averaged to give a 1024-dimension embedding, then put through a single logistic | |
| layer to get the 521 per-class output scores corresponding to the 960 ms input | |
| waveform segment. (Because of the window framing, you need at least 975 ms of | |
| input waveform to get the first frame of output scores.) | |
| ### Class vocabulary | |
| The file `yamnet_class_map.csv` describes the audio event classes associated | |
| with each of the 521 outputs of the network. Its format is: | |
| ```text | |
| index,mid,display_name | |
| ``` | |
| where `index` is the model output index (0..520), `mid` is the machine | |
| identifier for that class (e.g. `/m/09x0r`), and display_name is a | |
| human-readable description of the class (e.g. `Speech`). | |
| The original Audioset data release had 527 classes. This model drops six of | |
| them on the recommendation of our Fairness reviewers to avoid potentially | |
| offensive mislabelings. We dropped the gendered versions (Male/Female) of | |
| Speech and Singing. We also dropped Battle cry and Funny music. | |
| ### Performance | |
| On the 20,366-segment AudioSet eval set, over the 521 included classes, the | |
| balanced average d-prime is 2.318, balanced mAP is 0.306, and the balanced | |
| average lwlrap is 0.393. | |
| According to our calculations, the classifier has 3.7M weights and performs | |
| 69.2M multiplies for each 960ms input frame. | |
| ### Contact information | |
| This model repository is maintained by [Manoj Plakal](https://github.com/plakal) and [Dan Ellis](https://github.com/dpwe). | |