llm-jp-modernbert-base / README.md

speed

Update README.md

1dab128 verified 5 months ago

preview code

raw

history blame contribute delete

4.11 kB

metadata

library_name: transformers
license: apache-2.0
language:
  - ja

llm-jp-modernbert-base

📄 Paper | 🧑‍💻 Code

This model is based on the modernBERT-base architecture with llm-jp-tokenizer. It was trained using the Japanese subset (3.4TB) of the llm-jp-corpus v4 and supports a max sequence length of 8192.

For detailed information on the training methods, evaluation, and analysis results, please visit at llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length

Usage

Please install the transformers library.

pip install "transformers>=4.48.0"

If your GPU supports flash-attn 2, it is recommended to install flash-attn.

pip install flash-attn --no-build-isolation

Using AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "llm-jp/llm-jp-modernbert-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = "日本の首都は<MASK|LLM-jp>です。"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token:  東京

Training

This model was trained with a max_seq_len of 1024 in stage 1, and then with a max_seq_len of 8192 in stage 2.

Training code can be found at https://github.com/llm-jp/llm-jp-modernbert

Model	stage 1	stage 2
max_seq_len	1024	8192
max_steps	500,000	200,000
Total batch size	3328	384
Peak LR	5e-4	5e-5
warmup step	24,000
LR schedule	Linear decay
Adam beta 1	0.9
Adam beta 2	0.98
Adam eps	1e-6
MLM prob	0.30
Gradient clipping	1.0
weight decay	1e-5
line_by_line	True

The blank in stage 2 indicate the same value as in stage 1.

Evaluation

JSTS, JNLI, and JCoLA from JGLUE were used. Evaluation code can be found at https://github.com/llm-jp/llm-jp-modernbert

Model	JSTS (pearson)	JNLI (accuracy)	JCoLA (accuracy)	Avg
tohoku-nlp/bert-base-japanese-v3	0.920	0.912	0.880	0.904
sbintuitions/modernbert-ja-130m	0.916	0.927	0.868	0.904
sbintuitions/modernbert-ja-310m	0.932	0.933	0.883	0.916
llm-jp/llm-jp-modernbert-base	0.918	0.913	0.844	0.892

LICENSE

Apache License, Version 2.0

Citation

@misc{sugiura2025llmjpmodernbertmodernbertmodeltrained,
      title={llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length}, 
      author={Issa Sugiura and Kouta Nakayama and Yusuke Oda},
      year={2025},
      eprint={2504.15544},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.15544}, 
}