YaRN: Efficient Context Window Extension of Large Language Models
Paper
•
2309.00071
•
Published
•
78
Modern Multilingual BERT with 32K context length - Extended from 8K to 32K tokens using YaRN RoPE scaling.
This model extends jhu-clsp/mmBERT-base (Modern Multilingual BERT supporting 1800+ languages) from 8,192 to 32,768 maximum context length using YaRN (Yet another RoPE extensioN) scaling method.
| Property | Value |
|---|---|
| Base Model | jhu-clsp/mmBERT-base |
| Architecture | ModernBERT (RoPE + Flash Attention 2) |
| Parameters | 307M |
| Max Context | 32,768 tokens (extended from 8,192) |
| Languages | 1800+ languages |
| Vocab Size | 256,000 (Gemma 2 tokenizer) |
| Scaling Method | YaRN RoPE (4x extension) |
This model is designed for:
Part of the vLLM Semantic Router Mixture-of-Models (MoM) family.
| Distance (tokens) | Top-1 Accuracy | Top-5 Accuracy |
|---|---|---|
| 64 | 100% | 100% |
| 128 | 100% | 100% |
| 256 | 100% | 100% |
| 512 | 100% | 100% |
| 1024 | 100% | 100% |
| 2048 | 100% | 100% |
| 4096 | 0% | 0% |
| 8192 | 0% | 0% |
Summary: Perfect retrieval up to 2048 tokens. Long-range capability improved from baseline ~33% to 50% (averaged across all distances ≥1024).
| Language | Correct |
|---|---|
| English (en) | ✅ |
| German (de) | ✅ |
| French (fr) | ✅ |
| Spanish (es) | ✅ |
| Chinese (zh) | ✅ |
| Japanese (ja) | ✅ |
| Russian (ru) | ✅ |
| Arabic (ar) | ✅ |
| Korean (ko) | ✅ |
| Portuguese (pt) | ✅ |
Overall: 100% (10/10 languages tested)
| Context Length | Loss | Perplexity |
|---|---|---|
| 512 | 0.0110 | 1.01 |
| 1024 | 0.0082 | 1.01 |
| 2048 | 0.0065 | 1.01 |
| 4096 | 0.0036 | 1.00 |
| 8192 | 0.0014 | 1.00 |
| 16384 | 0.0014 | 1.00 |
| 24576 | 0.0014 | 1.00 |
| 32768 | 0.0003 | 1.00 |
| Position Range | Accuracy |
|---|---|
| 0-2048 | 100% |
| 2048-4096 | 100% |
| 4096-6144 | 100% |
| 6144-8192 | 100% |
| 8192-10240 | 100% |
| 10240-12288 | 100% |
| 12288-14336 | 100% |
| 14336-16384 | 100% |
base_model: jhu-clsp/mmBERT-base
rope_scaling_type: yarn
original_max_position_embeddings: 8192
target_max_position_embeddings: 32768
scaling_factor: 4.0
yarn_beta_fast: 32.0
yarn_beta_slow: 1.0
# Training hyperparameters
learning_rate: 1e-5
batch_size: 1 (effective: 16 with gradient accumulation)
gradient_accumulation_steps: 16
num_epochs: 1
warmup_steps: 100
lr_scheduler: constant_with_warmup
mlm_probability: 0.3
bf16: true
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrained("llm-semantic-router/mmbert-32k-yarn")
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-32k-yarn")
# Multilingual MLM example
text = "The capital of France is <mask>."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[0, 1]
logits = outputs.logits[0, mask_idx]
top5 = tokenizer.decode(logits.topk(5).indices)
print(top5) # ['Paris', 'Strasbourg', 'Nice', 'Brussels', 'Lyon']
# Process long documents (up to 32K tokens)
long_document = "..." * 30000 # Your long text in any of 1800+ languages
inputs = tokenizer(
long_document,
return_tensors="pt",
max_length=32768,
truncation=True
)
outputs = model(**inputs)
import torch
# Get embeddings for downstream tasks
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
# Use last hidden state or pooled output
embeddings = outputs.hidden_states[-1].mean(dim=1) # Mean pooling
An ONNX export is available for high-performance inference with ONNX Runtime.
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
# Load tokenizer and ONNX model
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-32k-yarn")
sess = ort.InferenceSession(
"onnx/model.onnx", # or download from HF
providers=['ROCmExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
)
# Inference
text = "What is the weather like today?"
inputs = tokenizer(text, return_tensors="np", padding=True)
outputs = sess.run(None, {
'input_ids': inputs['input_ids'].astype(np.int64),
'attention_mask': inputs['attention_mask'].astype(np.int64)
})
embeddings = outputs[0].mean(axis=1) # Mean pooling
use onnx_semantic_router::MmBertEmbeddingModel;
let model = MmBertEmbeddingModel::load("./mmbert-32k-yarn-onnx", false)?;
let embeddings = model.embed("What is the weather?")?;
| Backend | Single Text | Batch(4)/text |
|---|---|---|
| CPU | 10.1ms | 6.8ms |
| ROCm GPU | 4.7ms | 1.2ms |
@misc{mmbert-32k-yarn,
title={mmBERT-32K-YaRN: Extended Context Modern Multilingual BERT},
author={vLLM Semantic Router Team},
year={2026},
publisher={Hugging Face},
url={https://huggingface.co/llm-semantic-router/mmbert-32k-yarn}
}
MIT License (same as mmBERT base model)
Base model
jhu-clsp/mmBERT-base