wissamantoun/WebOrganizer-FormatClassifier-ModernBERT

[Paper] [Website] [GitHub]

All credit goes to the original authors of the model and dataset. This is a retraining of the original model with a different base model

The TopicClassifier organizes web content into 17 categories based on the URL and text contents of web pages. The model is a ModernBERT-base with 140M parameters fine-tuned on the following training data:

WebOrganizer/TopicAnnotations-Llama-3.1-8B: 1M documents annotated by Llama-3.1-8B (first-stage training)
WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8: 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)

All Domain Classifiers

wissamantoun/WebOrganizer-FormatClassifier-ModernBERT ← you are here!
wissamantoun/WebOrganizer-TopicClassifier-ModernBERT

Usage

This classifier expects input in the following input format:

{url}

{text}

Example:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("wissamantoun/WebOrganizer-FormatClassifier-ModernBERT")
model = AutoModelForSequenceClassification.from_pretrained(
    "wissamantoun/WebOrganizer-FormatClassifier-ModernBERT",
    trust_remote_code=True,
    use_memory_efficient_attention=False)

web_page = """http://www.example.com

How to build a computer from scratch? Here are the components you need..."""

inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)

probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 5 ("Hardware" topic)

You can convert the logits of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see id2label and label2id in the model config):

Academic Writing
Content Listing
Creative Writing
Customer Support
Comment Section
FAQ
Truncated
Knowledge Article
Legal Notices
Listicle
News Article
Nonfiction Writing
About (Org
News (Org
About (Pers
Personal Blog
Product Page
Q&A Forum
Spam / Ads
Structured Data
Documentation
Audio Transcript
Tutorial
User Review

The full definitions of the categories can be found in the taxonomy config.

Scores

***** pred metrics *****
  test_accuracy                      =     0.8154
  test_accuracy__0                   =      0.855
  test_accuracy__1                   =     0.7558
  test_accuracy__10                  =     0.9071
  test_accuracy__11                  =     0.6869
  test_accuracy__12                  =     0.8055
  test_accuracy__13                  =     0.7897
  test_accuracy__14                  =     0.8592
  test_accuracy__15                  =     0.8541
  test_accuracy__16                  =     0.8788
  test_accuracy__17                  =     0.7733
  test_accuracy__18                  =     0.7286
  test_accuracy__19                  =     0.6989
  test_accuracy__2                   =     0.7474
  test_accuracy__20                  =     0.7609
  test_accuracy__21                  =     0.7807
  test_accuracy__22                  =     0.7703
  test_accuracy__23                  =     0.7931
  test_accuracy__3                   =     0.6351
  test_accuracy__4                   =      0.871
  test_accuracy__5                   =     0.8333
  test_accuracy__6                   =     0.6125
  test_accuracy__7                   =     0.6416
  test_accuracy__8                   =       0.78
  test_accuracy__9                   =     0.7668
  test_accuracy_conf50               =     0.8312
  test_accuracy_conf50__0            =     0.8852
  test_accuracy_conf50__1            =     0.7651
  test_accuracy_conf50__10           =     0.9167
  test_accuracy_conf50__11           =     0.7168
  test_accuracy_conf50__12           =     0.8256
  test_accuracy_conf50__13           =     0.7996
  test_accuracy_conf50__14           =     0.8696
  test_accuracy_conf50__15           =     0.8684
  test_accuracy_conf50__16           =     0.8878
  test_accuracy_conf50__17           =     0.7838
  test_accuracy_conf50__18           =     0.7663
  test_accuracy_conf50__19           =     0.7276
  test_accuracy_conf50__2            =     0.7609
  test_accuracy_conf50__20           =     0.7907
  test_accuracy_conf50__21           =        0.8
  test_accuracy_conf50__22           =     0.7927
  test_accuracy_conf50__23           =     0.7904
  test_accuracy_conf50__3            =     0.6617
  test_accuracy_conf50__4            =      0.877
  test_accuracy_conf50__5            =     0.8571
  test_accuracy_conf50__6            =     0.6299
  test_accuracy_conf50__7            =     0.6786
  test_accuracy_conf50__8            =     0.7755
  test_accuracy_conf50__9            =     0.7796
  test_accuracy_conf75               =     0.9003 <--- Metric from the paper
  test_accuracy_conf75__0            =     0.9412
  test_accuracy_conf75__1            =     0.8318
  test_accuracy_conf75__10           =     0.9542
  test_accuracy_conf75__11           =     0.8478
  test_accuracy_conf75__12           =     0.8841
  test_accuracy_conf75__13           =     0.8724
  test_accuracy_conf75__14           =      0.914
  test_accuracy_conf75__15           =     0.9345
  test_accuracy_conf75__16           =     0.9316
  test_accuracy_conf75__17           =     0.8667
  test_accuracy_conf75__18           =     0.8446
  test_accuracy_conf75__19           =     0.8209
  test_accuracy_conf75__2            =     0.8333
  test_accuracy_conf75__20           =     0.9333
  test_accuracy_conf75__21           =     0.8587
  test_accuracy_conf75__22           =     0.8708
  test_accuracy_conf75__23           =     0.8309
  test_accuracy_conf75__3            =     0.7292
  test_accuracy_conf75__4            =     0.9357
  test_accuracy_conf75__5            =     0.9032
  test_accuracy_conf75__6            =     0.7816
  test_accuracy_conf75__7            =     0.8011
  test_accuracy_conf75__8            =     0.8409
  test_accuracy_conf75__9            =     0.8592
  test_accuracy_label_average        =     0.7744
  test_accuracy_label_average_conf50 =     0.7919
  test_accuracy_label_average_conf75 =     0.8676
  test_accuracy_label_min            =     0.6125
  test_accuracy_label_min_conf75     =     0.7292 <--- Metric from the paper
  test_loss                          =     0.6023
  test_proportion_conf50             =     0.9638
  test_proportion_conf75             =     0.7951
  test_runtime                       = 0:00:08.38
  test_samples_per_second            =   1192.262
  test_steps_per_second              =     37.318

Citation

@article{wettig2025organize,
  title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
  author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
  journal={arXiv preprint arXiv:2502.10341},
  year={2025}
}