TermiGen-32B

TermiGen-32B achieves 31.3% pass@1 on TerminalBench 1.0, establishing a new open-weight state-of-the-art and surpassing proprietary models like o4-mini with Codex CLI (20.0%).

📄 Paper: TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents
💻 Environments: https://github.com/ucsb-mlsec/terminal-bench-env
🧪 Benchmark: https://github.com/laude-institute/terminal-bench


Model Description

This model is fine-tuned from Qwen2.5-Coder-32B-Instruct using the TermiGen pipeline, which synthesizes high-fidelity training data through two phases:

Phase I: Environment Synthesis

  • Multi-agent system generates 3,500+ verified Docker environments
  • Tasks span 11 categories: system administration, security forensics, scientific computing, MLOps, etc.
  • 420 unique command-line tools across 16 functional domains
  • Automated unit test validation ensures task solvability

Phase II: Error-Correction Trajectory Collection

  • Generator-Critic framework with 20% error injection rate
  • Teaches error → diagnosis → recovery cycles
  • 3,291 trajectories (avg. 25.5 turns, 8,722 tokens each)
  • Teacher model: Claude-4.5-Sonnet

Training Details

Training Hyperparameters:

  • Base Model: Qwen2.5-Coder-32B-Instruct
  • Learning Rate: 5e-6 (cosine schedule, 10% warmup)
  • Batch Size: 32 (8 GPUs × 4 gradient accumulation)
  • Sequence Length: 20,000 tokens
  • Epochs: 5
  • Precision: BF16 with DeepSpeed ZeRO-3
  • Hardware: 8× AMD MI325X GPUs

Dataset Statistics:

  • 3,500+ verified environments across 11 task categories
  • 3,291 training trajectories
  • Tool diversity: 420 unique CLI tools
  • Average trajectory: 25.5 turns, 8,722 tokens

Evaluation Results

TerminalBench Performance

Benchmark Pass@1
TerminalBench 1.0 31.3%
TerminalBench 2.0 18.0%

Usage

We implemented a minimal BashAgent framework based on TerminalBench for agentic terminal execution. The agent interacts with Docker containers via bash shell, generating ReAct-style responses at each turn.

For detailed usage and integration examples, please refer to our GitHub repository.


Citation

@article{zhu2026termigen,
  title={TermiGen: High-Fidelity Environment and Robust Trajectory Synthesis for Terminal Agents},
  author={Zhu, Kaijie and Nie, Yuzhou and Li, Yijiang and Huang, Yiming and Wu, Jialian and Liu, Jiang and Sun, Ximeng and Yin, Zhenfei and Wang, Lun and Liu, Zicheng and Barsoum, Emad and Wang, William Yang and Guo, Wenbo},
  journal={arXiv preprint arXiv:2602.07274},
  url={https://arxiv.org/abs/2602.07274}, 
  year={2026}
}

License

Apache 2.0 (inherited from Qwen2.5-Coder base model)

Contact: Kaijie Zhu ([email protected])

Downloads last month
55
Safetensors
Model size
1.12M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for UCSB-SURFI/TermiGen-32B

Base model

Qwen/Qwen2.5-32B
Finetuned
(116)
this model
Quantizations
2 models

Paper for UCSB-SURFI/TermiGen-32B