Publications

2026

LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation

Tianrun Yu, Kaixiang Zhao, Chih-Chun Chen, Amanda Hughes, Taylor W. Killian, Fenglong Ma, Weitong Zhang, Porter Jenkins

arXiv pre-print

LARK introduces a principled approach to distillation that focuses on selecting teacher-generated reasoning trajectories based on what the student can actually learn from, rather than relying on heuristics like quality or confidence. By characterizing trajectory learnability through how quickly it reduces training loss and balancing this with distributional coverage, we achieve consistent improvements in reasoning distillation across multiple models and tasks.

Paper Code

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

Mingkai Deng, Jinyu Hou, Lara Sá Neves, Varad Pimpalkhute, Taylor W. Killian, Zhengzhong Liu, Eric P. Xing

arXiv pre-print

We present SR²AM, which decomposes AI agent decision-making into simulative reasoning (planning via world model predictions), self-regulation (deciding when to plan), and reactive execution (handling actions). By planning only when necessary rather than always relying on extended reasoning chains, our approach achieves competitive performance with much larger models while using 25.8-95.3% fewer reasoning tokens.

Paper Code

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

Christopher Z. Cui, Taylor W. Killian, Prithviraj Ammanabrolu

arXiv pre-print

We introduce Behavior Cue Reasoning, which trains LLMs to emit special token sequences before specific behaviors, making their reasoning more transparent and controllable. This approach enables weaker monitors to prune up to 50% of wasted reasoning tokens and, in safety-constrained environments, recovers safe actions from 80% of traces that would otherwise fail, boosting success rates from 46% to 96% with no performance cost.

Paper Code

From Reasoning Traces to Reusable Modules: Reinforcement Learning for Compositional Generalization in Language Model Reasoning

Lingjing Kong, Xin Liu, Guangyi Chen, Martin Q. Ma, Xiangchen Song, Yuekai Sun, Mikhail Yurochkin, Taylor W. Killian, Ruslan Salakhutdinov, Kun Zhang, Eric P. Xing, Zhengzhong Liu

ICML 2026

We demonstrate that RL enables LLMs to achieve compositional generalization by extracting and recombining reusable atomic modules from reasoning traces. Our theoretical framework shows RL's exploratory nature provides the coverage needed to identify latent structure, while experiments confirm that training on compound traces yields stronger generalization than isolated modules alone.

Paper

IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL

Zhoujun Cheng, Yutao Xie, Yuxiao Qu, Amrith Setlur, Shibo Hao, Varad Pimpalkhute, Tongtong Liang, Feng Yao, Zhengzhong Liu, Eric Xing, Virginia Smith, Ruslan Salakhutdinov, Zhiting Hu, Taylor W. Killian, Aviral Kumar

ICML 2026

We establish scaling laws for LLM Reinforcement Learning by identifying how to optimally allocate compute across parallel rollouts, problem batch size, and update steps. It reveals that increasing parallel rollouts per problem is the primary driver of performance—improving solution quality for easy tasks and coverage for hard ones—while providing practical rules for compute-efficient post-training.

Website Paper

Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization

Matt Landers, Taylor W. Killian, Tom Hartvigsen, Afsaneh Doryab

ICLR 2026

SPIN is a two-stage framework that optimizes reinforcement learning in complex combinatorial spaces by first pre-training an Action Structure Model to learn valid action patterns and then training lightweight heads for control. This approach significantly improves performance and stability, outperforming current methods by up to 39% in rewards while achieving convergence up to $12.8\times$ faster.

Paper Code

2025

Concise Reasoning in the Lens of Lagrangian Optimization

Chengqian Gao, Haonan Li, Taylor W. Killian, Jianshu She, Renxi Wang, Liqun Ma, Zhoujun Cheng, Shibo Hao, Zhiqiang Xu

arXiv pre-print

PALU is a pragmatic optimization strategy that streamlines LLM reasoning by treating concision as a mathematical trade-off between output length and accuracy. It successfully reduces response length by 65% while boosting performance across various domains and model scales, proving that shorter, more focused reasoning chains can actually be more effective.

Paper

K2-V2: A 360-Open, Reasoning-Enhanced LLM

K2 Team, Institute of Foundation Models

MBZUAI IFM Technical Report

K2-V2 is a fully open-source 360-billion parameter LLM designed as a high-performance foundation for complex reasoning, knowledge retrieval, and tool use. By releasing its complete training history and data, the model provides a transparent, "reasoning-centric" base that rivals leading open-weight models like Qwen2.5-72B and approaches the performance of much larger systems.

Website Paper Code Data Models

SAINT: Attention-Based Policies for Discrete Combinatorial Action Spaces

Matt Landers, Taylor W. Killian, Tom Hartvigsen, Afsaneh Doryab

arXiv pre-print

SAINT is a novel policy architecture that uses Transformers to model combinatorial action spaces as unordered sets, capturing complex sub-action dependencies via self-attention. This permutation-invariant approach significantly outperforms traditional baselines in environments with up to $1.35 \times 10^{18}$ possible actions by improving sample efficiency and joint behavior modeling.

Paper Code

BraVE: Offline Reinforcement Learning for Discrete Combinatorial Action Spaces

Matt Landers, Taylor W. Killian, Hugo Barnes, Tom Hartvigsen, Afsaneh Doryab

NeurIPS 2025

BraVE addresses the computational challenges of high-dimensional, discrete action spaces in offline RL by using tree-structured traversal to capture sub-action dependencies efficiently. This approach enables the evaluation of a linear number of joint actions, outperforming existing methods by up to 20x in complex environments.

Website Paper Code

Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective

Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Yonghao Zhuang, Nilabjo Dey, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Haonan Li, Taylor W. Killian, Mikhail Yurochkin, Zhengzhong Liu, Eric P. Xing, Zhiting Hu

NeurIPS 2025

Reinforcement learning has emerged as a promising approach to improve large language model reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. We introduce Guru, a curated RL reasoning corpus spanning six reasoning domains.

Website Paper Code Data Models

K2-Think: A Parameter-Efficient Reasoning System

Zhoujun Cheng, Richard Fan, Shibo Hao, Taylor W. Killian, Haonan Li, Suqi Sun, Hector Ren, Alexander Moreno, Daqian Zhang, Tianjun Zhong, Yuxin Xiong, Yuanzhe Hu, Yutao Xie, Xudong Han, Yuqi Wang, Varad Pimpalkhute, Yonghao Zhuang, Aaryamonvikram Singh, Xuezhi Liang, Anze Xie, Jianshu She, Desai Fan, Chengqian Gao, Liqun Ma, Mikhail Yurochkin, John Maggs, Xuezhe Ma, Guowei He, Zhiting Hu, Zhengzhong Liu, Eric P. Xing

MBZUAI IFM Technical Report

K2-Think is a parameter-efficient 32B model that rivals much larger systems by combining long chain-of-thought training with advanced test-time computation techniques. It achieves state-of-the-art reasoning performance in math, code, and science while delivering ultra-fast inference speeds of over 2,000 tokens per second.

Website Paper Code Models

Robust Autonomy Emerges from Self-Play

Marco Cusumano-Towner, David Hafner, Alex Hertzberg, Brody Huval, Aleksei Petrenko, Eugene Vinitsky, Erik Wijmans, Taylor W. Killian, Stuart Bowers, Ozan Sener, Philipp Krahenbuhl, Vladlen Koltun

ICML 2025

We developed a robust autonomous driving agent, in simulation, via self-play at massive scale. This simulator was designed to run in extensively parallel settings where we could aggressively randomize each agent's physical and behavior characteristics and generate substantial amounts of experience.

Paper

2024

Clinically Motivated Sequential Decision Making Under Uncertainty in Offline Settings

Taylor W. Killian

PhD Thesis, University of Toronto, Department of Computer Science

In order to develop practical machine learning aided technology for the benefit of human users, it is critical to anchor scientific research and development by the intended real-world use cases. In this thesis, I propose specific modeling decisions that can be made to develop actionable insights from sequentially observed healthcare data.

Thesis

2023

Continuous Time Evidential Distributions for Irregular Time Series

Taylor W. Killian, Haoran Zhang, Thomas Hartvigsen, Ava Amini

Interpretable Machine Learning in Healthcare Workshop, ICML 2023

We extend recent evidential deep learning approaches to sequential settings in continuous time to deal with irregularly sampled time series such as those one encounters in healthcare. This method provides stable, temporally correlated predictions and corresponding well calibrated uncertainty estimates based on the evidence gained with each collected observation.

Paper Code

Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning

Taylor W. Killian, Sonali Parbhoo, Marzyeh Ghassemi

Transactions on Machine Learning Research (TMLR)

We improve upon our prior dead-ends work by taking a risk-sensitive approach to dead-end discovery, leveraging distributional RL for value estimation. This allows for earlier indication of dead-ends in a manner that is tunable based on the risk tolerance of the designed task.

Paper Forum Code

2022

Continuous Time Evidential Distributions for Processing Irregular Time Series

Taylor W. Killian, Ava Amini

Learning from Time Series for Health Workshop at NeurIPS

Paper

Identifying Disparities in Sepsis Treatment using Inverse Reinforcement Learning

Hyewon Jeong, Taylor W. Killian, Sanjat Kanjilal, Siddharth Nayak, Marzyeh Ghassemi

WiML: Women in Machine Learning and RL4RealLife workshops at NeurIPS

Paper

Counterfactually Guided Policy Transfer in Clinical Settings

Taylor W. Killian, Marzyeh Ghassemi, Shalmali Joshi

Conference on Health, Inference and Learning (CHIL), 2022

Paper Poster

2021

Medical Dead-ends and Learning to Identify High-Risk States and Treatments

Mehdi Fatemi, Taylor W. Killian, Jayakumar Subramanian, Marzyeh Ghassemi

Neural Information Processing Systems, 2021

In data-constrained offline settings optimal sequential decision policies may not be attainable. However, negative outcomes in data can be used to identify behaviors to avoid, thereby guarding against overoptimistic decisions in safety-critical domains that may be significantly biased due to reduced data availability.

Paper Poster Code MSR Blog

2020

An Empirical Study of Representation Learning for Reinforcement Learning in Healthcare

Taylor W. Killian, Haoran Zhang, Jayakumar Subramanian, Mehdi Fatemi, Maryzeh Ghassemi

ML4H: Machine Learning for Health Workshop at NeurIPS

Paper Poster Code

Multiple Sclerosis Severity Classification From Clinical Text

Alister D'Costa, Stefan Denkovski, Michal Malyska, Sae Young Moon, Brandon Rufino, Zhen Yang, Taylor W. Killian, Marzyeh Ghassemi

The 3rd Clinical Natural Language Processing Workshop

Paper Model

Counterfactual Transfer via Inductive Bias in Clinical Settings

Taylor W. Killian, Marzyeh Ghassemi, Shalmali Joshi

Inductive Biases, Invariances and Generalization in RL (BIG) ICML Workshop

Paper

Optimization Methods for Interpretable Differentiable Decision Trees Applied to Reinforcement Learning

Andrew Silva, Taylor W. Killian, Ivan Rodriguez Jimenez, Sung-Hyun Son, Matthew Gombolay

The 23rd International Conference on Artificial Intelligence and Statistics (AISTATS)

Paper

2019

Kernelized Capsule Networks

Taylor W. Killian, Justin Goodwin, Olivia Brown, Sung-Hyun Son

1st Workshop on Understanding and Improving Generalization in Deep Learning, ICML

Paper Poster

2018

Direct Policy Transfer with Hidden Parameter Markov Decision Processes

Jiayu Yao, Taylor W. Killian, George Konidaris, Finale Doshi-Velez

Lifelong Learning: A Reinforcement Learning Approach Workshop at FAIM 2018

Paper Poster Slides

2017

Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes

Taylor W. Killian, Samuel Daulton, George Konidaris, Finale Doshi-Velez

Neural Information Processing Systems, pp. 6245-6250, 2017

Paper Poster Code Slides Video

Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes

Taylor W. Killian, George Konidaris, Finale Doshi-Velez

AAAI, pp.4949-4950. 2017

Paper Poster Slides

2012

Rebound and jet formation of a fluid-filled sphere

Taylor W. Killian, Robert A. Klaus, and Tadd T. Truscott

Physics of Fluids, 24 122106. 2012.

Paper Slides Video