You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

CodeModernBERT-Owl-v1🦉

Model type: Bi-encoder architecture based on ModernBERT
Architecture:
- Hidden size: 768
- Layers: 22
- Attention heads: 12
- Intermediate size: 1,152
- Max position embeddings: 8,192
- Local attention window size: 128
- RoPE positional encoding: θ = 160,000
- Local RoPE positional encoding: θ = 10,000
Sequence length: up to 2,048 tokens for code and docstring inputs during pretraining
Implementation: Back-end in Python; integrated into OwlSpotLight, a Visual Studio Code extension.

Tokenizer: Custom BPE tokenizer trained for code and docstring pairs.
Data: Functions and natural language descriptions extracted from GitHub repositories.
Masking strategy: Two-phase pretraining.
- Phase 1: Random Masked Language Modeling (MLM)
  30% of tokens in code functions are randomly masked and predicted using standard MLM.
- Phase 2: Line-level Span Masking
  Inspired by SpanBERT, continued pretraining on the same data with span masking at line granularity:
  1. Convert input tokens back to strings.
  2. Detect newline tokens with regex and segment inputs by line.
  3. Exclude whitespace-only tokens from masking.
  4. Apply padding to align sequence lengths.
  5. Randomly mask 30% of tokens in each line segment and predict them.
Pretraining hyperparameters:
- Batch size: 20
- Gradient accumulation steps: 6
- Effective batch size: 120
- Optimizer: AdamW
- Learning rate: 5e-5
- Scheduler: Cosine
- Epochs: 2
- Precision: Mixed precision (fp16) using transformers

Safetensors

Model size

0.1B params

Tensor type

F32

Finetunes