mmBERT-32K-YaRN

Modern Multilingual BERT with 32K context length - Extended from 8K to 32K tokens using YaRN RoPE scaling.

This model extends jhu-clsp/mmBERT-base (Modern Multilingual BERT supporting 1800+ languages) from 8,192 to 32,768 maximum context length using YaRN (Yet another RoPE extensioN) scaling method.

Model Description

Property Value
Base Model jhu-clsp/mmBERT-base
Architecture ModernBERT (RoPE + Flash Attention 2)
Parameters 307M
Max Context 32,768 tokens (extended from 8,192)
Languages 1800+ languages
Vocab Size 256,000 (Gemma 2 tokenizer)
Scaling Method YaRN RoPE (4x extension)

Intended Use

This model is designed for:

  • Long-document understanding in any of 1800+ languages
  • Semantic routing for LLM request classification
  • Document classification with extended context
  • Information retrieval from long texts
  • Multilingual NLP tasks requiring long context

Part of the vLLM Semantic Router Mixture-of-Models (MoM) family.

Evaluation Results

Distance-Based Retrieval (Key Metric for Long-Context)

Distance (tokens) Top-1 Accuracy Top-5 Accuracy
64 100% 100%
128 100% 100%
256 100% 100%
512 100% 100%
1024 100% 100%
2048 100% 100%
4096 0% 0%
8192 0% 0%

Summary: Perfect retrieval up to 2048 tokens. Long-range capability improved from baseline ~33% to 50% (averaged across all distances ≥1024).

Multilingual MLM Accuracy

Language Correct
English (en)
German (de)
French (fr)
Spanish (es)
Chinese (zh)
Japanese (ja)
Russian (ru)
Arabic (ar)
Korean (ko)
Portuguese (pt)

Overall: 100% (10/10 languages tested)

Perplexity by Context Length

Context Length Loss Perplexity
512 0.0110 1.01
1024 0.0082 1.01
2048 0.0065 1.01
4096 0.0036 1.00
8192 0.0014 1.00
16384 0.0014 1.00
24576 0.0014 1.00
32768 0.0003 1.00

Position-wise Accuracy (16K context)

Position Range Accuracy
0-2048 100%
2048-4096 100%
4096-6144 100%
6144-8192 100%
8192-10240 100%
10240-12288 100%
12288-14336 100%
14336-16384 100%

Training Details

Training Configuration

base_model: jhu-clsp/mmBERT-base
rope_scaling_type: yarn
original_max_position_embeddings: 8192
target_max_position_embeddings: 32768
scaling_factor: 4.0
yarn_beta_fast: 32.0
yarn_beta_slow: 1.0

# Training hyperparameters
learning_rate: 1e-5
batch_size: 1 (effective: 16 with gradient accumulation)
gradient_accumulation_steps: 16
num_epochs: 1
warmup_steps: 100
lr_scheduler: constant_with_warmup
mlm_probability: 0.3
bf16: true

Training Data

  • Dataset: CC-100 (Common Crawl) multilingual corpus
  • Samples: 30,774 sequences
  • Sequence Length: 32,768 tokens each
  • Total Tokens: ~1B tokens

Hardware

  • GPU: AMD Instinct MI300X (192GB VRAM)
  • Training Time: ~6.5 hours
  • Framework: PyTorch 2.3 + ROCm 6.2

Usage

Basic Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("llm-semantic-router/mmbert-32k-yarn")
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-32k-yarn")

# Multilingual MLM example
text = "The capital of France is <mask>."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[0, 1]
logits = outputs.logits[0, mask_idx]
top5 = tokenizer.decode(logits.topk(5).indices)
print(top5)  # ['Paris', 'Strasbourg', 'Nice', 'Brussels', 'Lyon']

Long Context Usage (32K tokens)

# Process long documents (up to 32K tokens)
long_document = "..." * 30000  # Your long text in any of 1800+ languages
inputs = tokenizer(
    long_document, 
    return_tensors="pt", 
    max_length=32768, 
    truncation=True
)
outputs = model(**inputs)

Feature Extraction

import torch

# Get embeddings for downstream tasks
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)
    # Use last hidden state or pooled output
    embeddings = outputs.hidden_states[-1].mean(dim=1)  # Mean pooling

ONNX Runtime Usage

An ONNX export is available for high-performance inference with ONNX Runtime.

Python

import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

# Load tokenizer and ONNX model
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-32k-yarn")
sess = ort.InferenceSession(
    "onnx/model.onnx",  # or download from HF
    providers=['ROCmExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
)

# Inference
text = "What is the weather like today?"
inputs = tokenizer(text, return_tensors="np", padding=True)
outputs = sess.run(None, {
    'input_ids': inputs['input_ids'].astype(np.int64),
    'attention_mask': inputs['attention_mask'].astype(np.int64)
})
embeddings = outputs[0].mean(axis=1)  # Mean pooling

Rust (ort-binding)

use onnx_semantic_router::MmBertEmbeddingModel;

let model = MmBertEmbeddingModel::load("./mmbert-32k-yarn-onnx", false)?;
let embeddings = model.embed("What is the weather?")?;

Latency Benchmarks (AMD MI300X)

Backend Single Text Batch(4)/text
CPU 10.1ms 6.8ms
ROCm GPU 4.7ms 1.2ms

Limitations

  • Long-range retrieval: While the model handles 32K context, retrieval accuracy drops significantly beyond 2048 tokens distance
  • Training data: Trained on CC-100 which may have biases from web crawl data
  • Compute requirements: Full 32K context requires significant GPU memory (~180GB for batch size 1)

Citation

@misc{mmbert-32k-yarn,
  title={mmBERT-32K-YaRN: Extended Context Modern Multilingual BERT},
  author={vLLM Semantic Router Team},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/llm-semantic-router/mmbert-32k-yarn}
}

References

License

MIT License (same as mmBERT base model)

Downloads last month
502
Safetensors
Model size
0.3B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for llm-semantic-router/mmbert-32k-yarn

Quantized
(4)
this model
Adapters
4 models
Finetunes
2 models
Quantizations
5 models

Dataset used to train llm-semantic-router/mmbert-32k-yarn

Paper for llm-semantic-router/mmbert-32k-yarn

Evaluation results