2026
IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL
Zhoujun Cheng, Yutao Xie, Yuxiao Qu, Amrith Setlur, Shibo Hao, Varad Pimpalkhute, Tongtong Liang, Feng Yao, Zhengzhong Liu, Eric Xing, Virginia Smith, Ruslan Salakhutdinov, Zhiting Hu, Taylor W. Killian, Aviral Kumar
arXiv pre-print
We establish scaling laws for LLM Reinforcement Learning by identifying how to optimally allocate compute across parallel rollouts, problem batch size, and update steps. It reveals that increasing parallel rollouts per problem is the primary driver of performance—improving solution quality for easy tasks and coverage for hard ones—while providing practical rules for compute-efficient post-training.
Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization
Matt Landers, Taylor W. Killian, Tom Hartvigsen, Afsaneh Doryab
ICLR 2026
SPIN is a two-stage framework that optimizes reinforcement learning in complex combinatorial spaces by first pre-training an Action Structure Model to learn valid action patterns and then training lightweight heads for control. This approach significantly improves performance and stability, outperforming current methods by up to 39% in rewards while achieving convergence up to $12.8\times$ faster.
2025
SAINT: Attention-Based Policies for Discrete Combinatorial Action Spaces
Matt Landers, Taylor W. Killian, Tom Hartvigsen, Afsaneh Doryab
arXiv pre-print
SAINT is a novel policy architecture that uses Transformers to model combinatorial action spaces as unordered sets, capturing complex sub-action dependencies via self-attention. This permutation-invariant approach significantly outperforms traditional baselines in environments with up to $1.35 \times 10^{18}$ possible actions by improving sample efficiency and joint behavior modeling.
BraVE: Offline Reinforcement Learning for Discrete Combinatorial Action Spaces
Matt Landers, Taylor W. Killian, Hugo Barnes, Tom Hartvigsen, Afsaneh Doryab
NeurIPS 2025
BraVE addresses the computational challenges of high-dimensional, discrete action spaces in offline RL by using tree-structured traversal to capture sub-action dependencies efficiently. This approach enables the evaluation of a linear number of joint actions, outperforming existing methods by up to 20x in complex environments.
Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective
Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Yonghao Zhuang, Nilabjo Dey, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Haonan Li, Taylor W. Killian, Mikhail Yurochkin, Zhengzhong Liu, Eric P. Xing, Zhiting Hu
NeurIPS 2025
Reinforcement learning has emerged as a promising approach to improve large language model reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. We introduce Guru, a curated RL reasoning corpus spanning six reasoning domains.
K2-Think: A Parameter-Efficient Reasoning System
Zhoujun Cheng, Richard Fan, Shibo Hao, Taylor W. Killian, Haonan Li, Suqi Sun, Hector Ren, Alexander Moreno, Daqian Zhang, Tianjun Zhong, Yuxin Xiong, Yuanzhe Hu, Yutao Xie, Xudong Han, Yuqi Wang, Varad Pimpalkhute, Yonghao Zhuang, Aaryamonvikram Singh, Xuezhi Liang, Anze Xie, Jianshu She, Desai Fan, Chengqian Gao, Liqun Ma, Mikhail Yurochkin, John Maggs, Xuezhe Ma, Guowei He, Zhiting Hu, Zhengzhong Liu, Eric P. Xing
MBZUAI IFM Technical Report
K2-Think is a parameter-efficient 32B model that rivals much larger systems by combining long chain-of-thought training with advanced test-time computation techniques. It achieves state-of-the-art reasoning performance in math, code, and science while delivering ultra-fast inference speeds of over 2,000 tokens per second.
Robust Autonomy Emerges from Self-Play
Marco Cusumano-Towner, David Hafner, Alex Hertzberg, Brody Huval, Aleksei Petrenko, Eugene Vinitsky, Erik Wijmans, Taylor W. Killian, Stuart Bowers, Ozan Sener, Philipp Krahenbuhl, Vladlen Koltun
ICML 2025
We developed a robust autonomous driving agent, in simulation, via self-play at massive scale. This simulator was designed to run in extensively parallel settings where we could aggressively randomize each agent's physical and behavior characteristics and generate substantial amounts of experience.