BirdNET Audio Prediction Script

This script loads a WAV file and uses the BirdNET ONNX model to predict bird species from audio recordings. It supports both single-window analysis (first 3 seconds) and moving window analysis (entire file) with species name mapping.

Features

Species Name Mapping: Uses BirdNET_GLOBAL_6K_V2.4_Labels.txt to display actual bird species names instead of class indices
Moving Window Analysis: Analyzes entire audio files using overlapping 3-second windows
Single Window Mode: Quick analysis of just the first 3 seconds
Configurable Parameters: Adjustable confidence thresholds, overlap ratios, and result counts
Detection Summary: Comprehensive overview of all detections with timestamps and confidence scores

Requirements

Python 3.7+
The model expects audio input of exactly 3 seconds duration at 48kHz sample rate (144,000 samples)
BirdNET labels file: BirdNET_GLOBAL_6K_V2.4_Labels.txt

Installation

Install the required dependencies:

pip install -r requirements.txt

Required packages:

numpy>=1.21.0
librosa>=0.9.0
onnxruntime>=1.12.0

Usage

Moving Window Analysis (Full File)

Analyze the entire audio file with overlapping windows:

python predict_audio.py audio.wav

Single Window Analysis (First 3 seconds only)

Quick analysis of just the beginning:

python predict_audio.py audio.wav --single-window

Advanced Usage Examples

# High sensitivity analysis with more results
python predict_audio.py audio.wav --confidence 0.1 --top-k 15

# Fine-grained analysis with 75% window overlap
python predict_audio.py audio.wav --overlap 0.75 --confidence 0.3

# Custom model and labels files
python predict_audio.py audio.wav --model custom_model.onnx --labels custom_labels.txt

Command Line Arguments

audio_file: Path to the WAV audio file (required)
--model: Path to the ONNX model file (default: model.onnx)
--labels: Path to the species labels file (default: BirdNET_GLOBAL_6K_V2.4_Labels.txt)
--top-k: Number of top predictions to show (default: 5)
--overlap: Window overlap ratio 0.0-1.0 (default: 0.5 = 50% overlap)
--confidence: Minimum confidence threshold for detections (default: 0.1)
--batch-size: Batch size for inference processing (default: 128)
--single-window: Analyze only first 3 seconds instead of full file

Output Examples

Single Window Output

Loading labels from: BirdNET_GLOBAL_6K_V2.4_Labels.txt
Loaded 6522 species labels
Loading ONNX model: model.onnx
Loading first 3 seconds of audio file: bird_recording.wav
Audio loaded successfully. Shape: (144000,)
Running inference on single window...

Top 5 predictions for first 3 seconds:
 1. American Robin: 0.892456
 2. Song Sparrow: 0.234567
 3. House Finch: 0.123789
 4. Northern Cardinal: 0.089234
 5. Blue Jay: 0.056789

Moving Window Output

Loading labels from: BirdNET_GLOBAL_6K_V2.4_Labels.txt
Loaded 6522 species labels
Loading ONNX model: model.onnx
Loading full audio file: long_recording.wav
Audio loaded successfully. Duration: 45.32 seconds
Creating windows with 50% overlap...
Created 28 windows of 3 seconds each
Running inference on all windows...
Processing window 1/28 (t=0.0s)
Processing window 11/28 (t=15.0s)
Processing window 21/28 (t=30.0s)
Completed inference on 28 windows
Analyzing detections with confidence threshold 0.1...

=== DETECTION SUMMARY ===
Audio duration: 45.32 seconds
Windows analyzed: 28
Species detected (>0.10 confidence): 4

Top detections:

American Robin
  Max confidence: 0.892456
  Detections: 12
  Time range: 0.0s - 18.0s
      1.5s: 0.892456
      3.0s: 0.845231
      4.5s: 0.723456

Song Sparrow
  Max confidence: 0.567890
  Detections: 6
  Time range: 22.5s - 36.0s
     24.0s: 0.567890
     25.5s: 0.445678
     27.0s: 0.334567

House Finch
  Max confidence: 0.345678
  Detections: 3
  Time range: 38.5s - 42.0s
     39.0s: 0.345678

Technical Details

Model Input/Output

Input: Audio array of shape [batch_size, 144000] (3 seconds at 48kHz)
Output: Classification scores for 6522 bird species

Audio Preprocessing

The script automatically handles:

Loading audio files with librosa (supports WAV, MP3, FLAC, etc.)
Resampling to 48kHz if necessary
Padding with zeros or truncating to exactly 3 seconds (144,000 samples)
Converting to float32 format

Moving Window Analysis

Creates overlapping 3-second windows from the full audio
Default 50% overlap means windows at 0s, 1.5s, 3s, 4.5s, etc.
Higher overlap (e.g., 75%) provides more fine-grained analysis but takes longer
Each window is analyzed independently, then results are aggregated

Batch Processing

Windows are processed in configurable batches (default: 128 windows per batch)
Significantly improves performance by utilizing vectorized operations
Automatically handles memory management and progress reporting
Optimal batch size depends on available system memory and model complexity

Species Labels

Uses the official BirdNET labels file with 6522 species
Format: Scientific_name_Common Name per line
Script extracts and displays the common names (part after underscore)

Performance Tips

Use --single-window for quick identification of prominent species
Increase --overlap (0.75-0.9) for detailed analysis of complex recordings
Lower --confidence (0.05-0.1) to catch weaker signals
Higher --confidence (0.3-0.5) for only very confident detections
Use --top-k 1 to see only the most confident detection per analysis
Batch Processing: Default --batch-size 128 provides optimal performance
- Increase batch size (256, 512) if you have more GPU/RAM memory
- Decrease batch size (32, 64) if you encounter memory issues
- Batch processing significantly improves performance on longer audio files