justinchuby's picture
Simplify model (#2)
1f1b5fd verified

BirdNET Audio Prediction Script

This script loads a WAV file and uses the BirdNET ONNX model to predict bird species from audio recordings. It supports both single-window analysis (first 3 seconds) and moving window analysis (entire file) with species name mapping.

Features

  • Species Name Mapping: Uses BirdNET_GLOBAL_6K_V2.4_Labels.txt to display actual bird species names instead of class indices
  • Moving Window Analysis: Analyzes entire audio files using overlapping 3-second windows
  • Single Window Mode: Quick analysis of just the first 3 seconds
  • Configurable Parameters: Adjustable confidence thresholds, overlap ratios, and result counts
  • Detection Summary: Comprehensive overview of all detections with timestamps and confidence scores

Requirements

  • Python 3.7+
  • The model expects audio input of exactly 3 seconds duration at 48kHz sample rate (144,000 samples)
  • BirdNET labels file: BirdNET_GLOBAL_6K_V2.4_Labels.txt

Installation

Install the required dependencies:

pip install -r requirements.txt

Required packages:

  • numpy>=1.21.0
  • librosa>=0.9.0
  • onnxruntime>=1.12.0

Usage

Moving Window Analysis (Full File)

Analyze the entire audio file with overlapping windows:

python predict_audio.py audio.wav

Single Window Analysis (First 3 seconds only)

Quick analysis of just the beginning:

python predict_audio.py audio.wav --single-window

Advanced Usage Examples

# High sensitivity analysis with more results
python predict_audio.py audio.wav --confidence 0.1 --top-k 15

# Fine-grained analysis with 75% window overlap
python predict_audio.py audio.wav --overlap 0.75 --confidence 0.3

# Custom model and labels files
python predict_audio.py audio.wav --model custom_model.onnx --labels custom_labels.txt

Command Line Arguments

  • audio_file: Path to the WAV audio file (required)
  • --model: Path to the ONNX model file (default: model.onnx)
  • --labels: Path to the species labels file (default: BirdNET_GLOBAL_6K_V2.4_Labels.txt)
  • --top-k: Number of top predictions to show (default: 5)
  • --overlap: Window overlap ratio 0.0-1.0 (default: 0.5 = 50% overlap)
  • --confidence: Minimum confidence threshold for detections (default: 0.1)
  • --batch-size: Batch size for inference processing (default: 128)
  • --single-window: Analyze only first 3 seconds instead of full file

Output Examples

Single Window Output

Loading labels from: BirdNET_GLOBAL_6K_V2.4_Labels.txt
Loaded 6522 species labels
Loading ONNX model: model.onnx
Loading first 3 seconds of audio file: bird_recording.wav
Audio loaded successfully. Shape: (144000,)
Running inference on single window...

Top 5 predictions for first 3 seconds:
 1. American Robin: 0.892456
 2. Song Sparrow: 0.234567
 3. House Finch: 0.123789
 4. Northern Cardinal: 0.089234
 5. Blue Jay: 0.056789

Moving Window Output

Loading labels from: BirdNET_GLOBAL_6K_V2.4_Labels.txt
Loaded 6522 species labels
Loading ONNX model: model.onnx
Loading full audio file: long_recording.wav
Audio loaded successfully. Duration: 45.32 seconds
Creating windows with 50% overlap...
Created 28 windows of 3 seconds each
Running inference on all windows...
Processing window 1/28 (t=0.0s)
Processing window 11/28 (t=15.0s)
Processing window 21/28 (t=30.0s)
Completed inference on 28 windows
Analyzing detections with confidence threshold 0.1...

=== DETECTION SUMMARY ===
Audio duration: 45.32 seconds
Windows analyzed: 28
Species detected (>0.10 confidence): 4

Top detections:

American Robin
  Max confidence: 0.892456
  Detections: 12
  Time range: 0.0s - 18.0s
      1.5s: 0.892456
      3.0s: 0.845231
      4.5s: 0.723456

Song Sparrow
  Max confidence: 0.567890
  Detections: 6
  Time range: 22.5s - 36.0s
     24.0s: 0.567890
     25.5s: 0.445678
     27.0s: 0.334567

House Finch
  Max confidence: 0.345678
  Detections: 3
  Time range: 38.5s - 42.0s
     39.0s: 0.345678

Technical Details

Model Input/Output

  • Input: Audio array of shape [batch_size, 144000] (3 seconds at 48kHz)
  • Output: Classification scores for 6522 bird species

Audio Preprocessing

The script automatically handles:

  • Loading audio files with librosa (supports WAV, MP3, FLAC, etc.)
  • Resampling to 48kHz if necessary
  • Padding with zeros or truncating to exactly 3 seconds (144,000 samples)
  • Converting to float32 format

Moving Window Analysis

  • Creates overlapping 3-second windows from the full audio
  • Default 50% overlap means windows at 0s, 1.5s, 3s, 4.5s, etc.
  • Higher overlap (e.g., 75%) provides more fine-grained analysis but takes longer
  • Each window is analyzed independently, then results are aggregated

Batch Processing

  • Windows are processed in configurable batches (default: 128 windows per batch)
  • Significantly improves performance by utilizing vectorized operations
  • Automatically handles memory management and progress reporting
  • Optimal batch size depends on available system memory and model complexity

Species Labels

  • Uses the official BirdNET labels file with 6522 species
  • Format: Scientific_name_Common Name per line
  • Script extracts and displays the common names (part after underscore)

Performance Tips

  • Use --single-window for quick identification of prominent species
  • Increase --overlap (0.75-0.9) for detailed analysis of complex recordings
  • Lower --confidence (0.05-0.1) to catch weaker signals
  • Higher --confidence (0.3-0.5) for only very confident detections
  • Use --top-k 1 to see only the most confident detection per analysis
  • Batch Processing: Default --batch-size 128 provides optimal performance
    • Increase batch size (256, 512) if you have more GPU/RAM memory
    • Decrease batch size (32, 64) if you encounter memory issues
    • Batch processing significantly improves performance on longer audio files