Publications

← Back to Home

2026

From Reasoning Traces to Reusable Modules: Reinforcement Learning for Compositional Generalization in Language Model Reasoning
Lingjing Kong, Xin Liu, Guangyi Chen, Martin Q. Ma, Xiangchen Song, Yuekai Sun, Mikhail Yurochkin, Taylor W. Killian, Ruslan Salakhutdinov, Kun Zhang, Eric P. Xing, Zhengzhong Liu
ICML 2026
We demonstrate that RL enables LLMs to achieve compositional generalization by extracting and recombining reusable atomic modules from reasoning traces. Our theoretical framework shows RL's exploratory nature provides the coverage needed to identify latent structure, while experiments confirm that training on compound traces yields stronger generalization than isolated modules alone.
IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL
Zhoujun Cheng, Yutao Xie, Yuxiao Qu, Amrith Setlur, Shibo Hao, Varad Pimpalkhute, Tongtong Liang, Feng Yao, Zhengzhong Liu, Eric Xing, Virginia Smith, Ruslan Salakhutdinov, Zhiting Hu, Taylor W. Killian, Aviral Kumar
ICML 2026
We establish scaling laws for LLM Reinforcement Learning by identifying how to optimally allocate compute across parallel rollouts, problem batch size, and update steps. It reveals that increasing parallel rollouts per problem is the primary driver of performance—improving solution quality for easy tasks and coverage for hard ones—while providing practical rules for compute-efficient post-training.
Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization
Matt Landers, Taylor W. Killian, Tom Hartvigsen, Afsaneh Doryab
ICLR 2026
SPIN is a two-stage framework that optimizes reinforcement learning in complex combinatorial spaces by first pre-training an Action Structure Model to learn valid action patterns and then training lightweight heads for control. This approach significantly improves performance and stability, outperforming current methods by up to 39% in rewards while achieving convergence up to $12.8\times$ faster.

2025

Concise Reasoning in the Lens of Lagrangian Optimization
Chengqian Gao, Haonan Li, Taylor W. Killian, Jianshu She, Renxi Wang, Liqun Ma, Zhoujun Cheng, Shibo Hao, Zhiqiang Xu
arXiv pre-print
PALU is a pragmatic optimization strategy that streamlines LLM reasoning by treating concision as a mathematical trade-off between output length and accuracy. It successfully reduces response length by 65% while boosting performance across various domains and model scales, proving that shorter, more focused reasoning chains can actually be more effective.
K2-V2: A 360-Open, Reasoning-Enhanced LLM
K2 Team, Institute of Foundation Models
MBZUAI IFM Technical Report
K2-V2 is a fully open-source 360-billion parameter LLM designed as a high-performance foundation for complex reasoning, knowledge retrieval, and tool use. By releasing its complete training history and data, the model provides a transparent, "reasoning-centric" base that rivals leading open-weight models like Qwen2.5-72B and approaches the performance of much larger systems.
SAINT: Attention-Based Policies for Discrete Combinatorial Action Spaces
Matt Landers, Taylor W. Killian, Tom Hartvigsen, Afsaneh Doryab
arXiv pre-print
SAINT is a novel policy architecture that uses Transformers to model combinatorial action spaces as unordered sets, capturing complex sub-action dependencies via self-attention. This permutation-invariant approach significantly outperforms traditional baselines in environments with up to $1.35 \times 10^{18}$ possible actions by improving sample efficiency and joint behavior modeling.
BraVE: Offline Reinforcement Learning for Discrete Combinatorial Action Spaces
Matt Landers, Taylor W. Killian, Hugo Barnes, Tom Hartvigsen, Afsaneh Doryab
NeurIPS 2025
BraVE addresses the computational challenges of high-dimensional, discrete action spaces in offline RL by using tree-structured traversal to capture sub-action dependencies efficiently. This approach enables the evaluation of a linear number of joint actions, outperforming existing methods by up to 20x in complex environments.
Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective
Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Yonghao Zhuang, Nilabjo Dey, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Haonan Li, Taylor W. Killian, Mikhail Yurochkin, Zhengzhong Liu, Eric P. Xing, Zhiting Hu
NeurIPS 2025
Reinforcement learning has emerged as a promising approach to improve large language model reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. We introduce Guru, a curated RL reasoning corpus spanning six reasoning domains.
K2-Think: A Parameter-Efficient Reasoning System
Zhoujun Cheng, Richard Fan, Shibo Hao, Taylor W. Killian, Haonan Li, Suqi Sun, Hector Ren, Alexander Moreno, Daqian Zhang, Tianjun Zhong, Yuxin Xiong, Yuanzhe Hu, Yutao Xie, Xudong Han, Yuqi Wang, Varad Pimpalkhute, Yonghao Zhuang, Aaryamonvikram Singh, Xuezhi Liang, Anze Xie, Jianshu She, Desai Fan, Chengqian Gao, Liqun Ma, Mikhail Yurochkin, John Maggs, Xuezhe Ma, Guowei He, Zhiting Hu, Zhengzhong Liu, Eric P. Xing
MBZUAI IFM Technical Report
K2-Think is a parameter-efficient 32B model that rivals much larger systems by combining long chain-of-thought training with advanced test-time computation techniques. It achieves state-of-the-art reasoning performance in math, code, and science while delivering ultra-fast inference speeds of over 2,000 tokens per second.
Robust Autonomy Emerges from Self-Play
Marco Cusumano-Towner, David Hafner, Alex Hertzberg, Brody Huval, Aleksei Petrenko, Eugene Vinitsky, Erik Wijmans, Taylor W. Killian, Stuart Bowers, Ozan Sener, Philipp Krahenbuhl, Vladlen Koltun
ICML 2025
We developed a robust autonomous driving agent, in simulation, via self-play at massive scale. This simulator was designed to run in extensively parallel settings where we could aggressively randomize each agent's physical and behavior characteristics and generate substantial amounts of experience.

2024

Clinically Motivated Sequential Decision Making Under Uncertainty in Offline Settings
Taylor W. Killian
PhD Thesis, University of Toronto, Department of Computer Science
In order to develop practical machine learning aided technology for the benefit of human users, it is critical to anchor scientific research and development by the intended real-world use cases. In this thesis, I propose specific modeling decisions that can be made to develop actionable insights from sequentially observed healthcare data.

2023

Continuous Time Evidential Distributions for Irregular Time Series
Taylor W. Killian, Haoran Zhang, Thomas Hartvigsen, Ava Amini
Interpretable Machine Learning in Healthcare Workshop, ICML 2023
We extend recent evidential deep learning approaches to sequential settings in continuous time to deal with irregularly sampled time series such as those one encounters in healthcare. This method provides stable, temporally correlated predictions and corresponding well calibrated uncertainty estimates based on the evidence gained with each collected observation.
Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning
Taylor W. Killian, Sonali Parbhoo, Marzyeh Ghassemi
Transactions on Machine Learning Research (TMLR)
We improve upon our prior dead-ends work by taking a risk-sensitive approach to dead-end discovery, leveraging distributional RL for value estimation. This allows for earlier indication of dead-ends in a manner that is tunable based on the risk tolerance of the designed task.

2022

Continuous Time Evidential Distributions for Processing Irregular Time Series
Taylor W. Killian, Ava Amini
Learning from Time Series for Health Workshop at NeurIPS
Identifying Disparities in Sepsis Treatment using Inverse Reinforcement Learning
Hyewon Jeong, Taylor W. Killian, Sanjat Kanjilal, Siddharth Nayak, Marzyeh Ghassemi
WiML: Women in Machine Learning and RL4RealLife workshops at NeurIPS
Counterfactually Guided Policy Transfer in Clinical Settings
Taylor W. Killian, Marzyeh Ghassemi, Shalmali Joshi
Conference on Health, Inference and Learning (CHIL), 2022

2021

Medical Dead-ends and Learning to Identify High-Risk States and Treatments
Mehdi Fatemi, Taylor W. Killian, Jayakumar Subramanian, Marzyeh Ghassemi
Neural Information Processing Systems, 2021
In data-constrained offline settings optimal sequential decision policies may not be attainable. However, negative outcomes in data can be used to identify behaviors to avoid, thereby guarding against overoptimistic decisions in safety-critical domains that may be significantly biased due to reduced data availability.

2020

An Empirical Study of Representation Learning for Reinforcement Learning in Healthcare
Taylor W. Killian, Haoran Zhang, Jayakumar Subramanian, Mehdi Fatemi, Maryzeh Ghassemi
ML4H: Machine Learning for Health Workshop at NeurIPS
Multiple Sclerosis Severity Classification From Clinical Text
Alister D'Costa, Stefan Denkovski, Michal Malyska, Sae Young Moon, Brandon Rufino, Zhen Yang, Taylor W. Killian, Marzyeh Ghassemi
The 3rd Clinical Natural Language Processing Workshop
Counterfactual Transfer via Inductive Bias in Clinical Settings
Taylor W. Killian, Marzyeh Ghassemi, Shalmali Joshi
Inductive Biases, Invariances and Generalization in RL (BIG) ICML Workshop
Optimization Methods for Interpretable Differentiable Decision Trees Applied to Reinforcement Learning
Andrew Silva, Taylor W. Killian, Ivan Rodriguez Jimenez, Sung-Hyun Son, Matthew Gombolay
The 23rd International Conference on Artificial Intelligence and Statistics (AISTATS)

2019

Kernelized Capsule Networks
Taylor W. Killian, Justin Goodwin, Olivia Brown, Sung-Hyun Son
1st Workshop on Understanding and Improving Generalization in Deep Learning, ICML

2018

Direct Policy Transfer with Hidden Parameter Markov Decision Processes
Jiayu Yao, Taylor W. Killian, George Konidaris, Finale Doshi-Velez
Lifelong Learning: A Reinforcement Learning Approach Workshop at FAIM 2018

2017

Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes
Taylor W. Killian, Samuel Daulton, George Konidaris, Finale Doshi-Velez
Neural Information Processing Systems, pp. 6245-6250, 2017
Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes
Taylor W. Killian, George Konidaris, Finale Doshi-Velez
AAAI, pp.4949-4950. 2017

2012

Rebound and jet formation of a fluid-filled sphere
Taylor W. Killian, Robert A. Klaus, and Tadd T. Truscott
Physics of Fluids, 24 122106. 2012.