RT-DisDINOv3-ConvNext: A Distilled RT-DETR-L Model
This model is an RT-DETR-L whose backbone and encoder have been pre-trained using knowledge distillation from a powerful DINOv3 ConvNeXt-Base teacher model. The distillation process was performed on feature maps from the TACO (Trash Annotations in Context) dataset.
This pre-trained checkpoint contains the "distilled knowledge" and is intended to be used as a starting point for fine-tuning on downstream object detection tasks, potentially leading to better performance compared to standard pre-trained weights.
This work is part of the RT-DisDINOv3 project. For full details on the training pipeline, baseline comparisons, and analysis, please visit the main GitHub repository.
How to Use
You can load these distilled weights and apply them to the original RT-DETR-L model's backbone and encoder before fine-tuning.
import torch
from torch.hub import load_state_dict_from_url
# 1. Load the original RT-DETR-L model architecture
# Make sure you have the 'rtdetr' repository cloned locally or installed
rtdetr_l = torch.hub.load('lyuwenyu/RT-DETR', 'rtdetrv2_l', pretrained=True)
model = rtdetr_l.model
# 2. Load the distilled weights from this Hugging Face Hub repository
MODEL_URL = "https://huggingface.co/hnamt/RT-DisDINOv3-ConvNext-Base/resolve/main/distilled_rtdetr_convnext_teacher_BEST.pth"
distilled_state_dict = load_state_dict_from_url(MODEL_URL, map_location='cpu')['model']
# 3. Load the weights into the model's backbone and encoder
# The `strict=False` flag ensures that only matching keys (backbone + encoder) are loaded.
model.load_state_dict(distilled_state_dict, strict=False)
print("Successfully loaded and applied distilled knowledge from ConvNeXt teacher!")
# Now the 'model' is ready for fine-tuning on your own dataset.
# For example:
# optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
# model.train()
# ... your fine-tuning loop ...
Training Details
- Student Model: RT-DETR-L (rtdetrv2_lfrom lyuwenyu/RT-DETR).
- Teacher Model: DINOv3 ConvNeXt-Base (facebook/dinov3-convnext-base-pretrain-lvd1689m).
- Dataset for Distillation: TACO dataset images.
- Distillation Procedure: The student model's backbone and encoder were trained to minimize the Mean Squared Error (MSE) between their output feature maps and those of the teacher model.
Evaluation Results
After the distillation pre-training, the model was fine-tuned on the TACO dataset. The results show a significant improvement over the baseline.
| Model | mAP@50-95 | mAP@50 | Speed (ms) | Notes | 
|---|---|---|---|---|
| RT-DETR-L (Baseline) | 2.80% | 4.60% | 50.05 | Fine-tuned from COCO pre-trained. | 
| RT-DisDINOv3 (w/ ConvNeXt) | 3.60% | 5.30% | 49.80 | +28.6% mAP increase over baseline. | 
License
The weights in this repository are released under the Apache 2.0 License. Please be aware that the models used for training (RT-DETR, DINOv3) have their own licenses.