RT-DisDINOv3-ConvNext: A Distilled RT-DETR-L Model

This model is an RT-DETR-L whose backbone and encoder have been pre-trained using knowledge distillation from a powerful DINOv3 ConvNeXt-Base teacher model. The distillation process was performed on feature maps from the TACO (Trash Annotations in Context) dataset.

This pre-trained checkpoint contains the "distilled knowledge" and is intended to be used as a starting point for fine-tuning on downstream object detection tasks, potentially leading to better performance compared to standard pre-trained weights.

This work is part of the RT-DisDINOv3 project. For full details on the training pipeline, baseline comparisons, and analysis, please visit the main GitHub repository.

How to Use

You can load these distilled weights and apply them to the original RT-DETR-L model's backbone and encoder before fine-tuning.

import torch
from torch.hub import load_state_dict_from_url

# 1. Load the original RT-DETR-L model architecture
# Make sure you have the 'rtdetr' repository cloned locally or installed
rtdetr_l = torch.hub.load('lyuwenyu/RT-DETR', 'rtdetrv2_l', pretrained=True)
model = rtdetr_l.model

# 2. Load the distilled weights from this Hugging Face Hub repository
MODEL_URL = "https://huggingface.co/hnamt/RT-DisDINOv3-ConvNext-Base/resolve/main/distilled_rtdetr_convnext_teacher_BEST.pth"
distilled_state_dict = load_state_dict_from_url(MODEL_URL, map_location='cpu')['model']

# 3. Load the weights into the model's backbone and encoder
# The `strict=False` flag ensures that only matching keys (backbone + encoder) are loaded.
model.load_state_dict(distilled_state_dict, strict=False)

print("Successfully loaded and applied distilled knowledge from ConvNeXt teacher!")

# Now the 'model' is ready for fine-tuning on your own dataset.
# For example:
# optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
# model.train()
# ... your fine-tuning loop ...

Training Details

Student Model: RT-DETR-L (rtdetrv2_l from lyuwenyu/RT-DETR).
Teacher Model: DINOv3 ConvNeXt-Base (facebook/dinov3-convnext-base-pretrain-lvd1689m).
Dataset for Distillation: TACO dataset images.
Distillation Procedure: The student model's backbone and encoder were trained to minimize the Mean Squared Error (MSE) between their output feature maps and those of the teacher model.

Evaluation Results

After the distillation pre-training, the model was fine-tuned on the TACO dataset. The results show a significant improvement over the baseline.

Model	mAP@50-95	mAP@50	Speed (ms)	Notes
RT-DETR-L (Baseline)	2.80%	4.60%	50.05	Fine-tuned from COCO pre-trained.
RT-DisDINOv3 (w/ ConvNeXt)	3.60%	5.30%	49.80	+28.6% mAP increase over baseline.

License

The weights in this repository are released under the Apache 2.0 License. Please be aware that the models used for training (RT-DETR, DINOv3) have their own licenses.

Downloads last month: -; Downloads are not tracked for this model. How to track