MATS 9.0 Symposium
Winter 2026 spotlight talks and posters
The 9th iteration of MATS has come to a close, and we want to share the research projects our fellows have been working on this Winter. We supported 87 fellows conducting research with over 50 top mentors in the fields of AI alignment, governance, and security.
On March 28th, we hosted a Program Symposium to showcase their projects to the community. Several teams gave spotlight talks based on their mid-program Fellow Research Plans, and every fellow contributed to the poster session.
Check out our digital gallery of posters and youtube playlist of spotlight talks!
Selected Spotlight Talks
Eliciting hidden behaviors in large language models - Arya Jakkli
Limited-parameter finetuning can uncover misaligned LLMs - Jo Jiao, Boyd Kane
Control Evaluations for covert malicious fine-tuning - Raja Moreno, Titus Buckworth, Yulia Volkova, Ayush Panda
Mitigating Reward Hacking with RL Training Interventions - Aria Wong
Building an empirical science of AI Debate - Lennie Wells, Joan Velja
Science of Entangled Capabilities - Simon Schrodi
Tools for Monitoring and Understanding Chains of Thought - Riya Tyagi, Daria Ivanova (speakers unable to present, but here's their poster)
Measuring LLM Reasoning Consistency via Probabilistic Dependency Graphs - Nataliia Babina
Symposium Posters By Category
Interpretability (17 posters)
Can we really trust chain-of-thought to tell us what a model is thinking? Several teams tackled this head-on, building testbeds that reveal where CoT monitoring breaks down and developing classifiers to catch misaligned reasoning before it matters. Others went deeper into model internals — discovering that LLMs harbor coherent “personas” that live in geometric subspaces of activation space, finding single preference vectors that control a model’s choices, and devising adversarial ablation techniques that guarantee faithful circuit-level explanations. On the applied side, researchers showed how alignment pretraining changes what linear probes can detect, extracted world models from RL agents’ Q-values, and built controlled environments to diagnose whether a misbehaving model is scheming or simply confused.
Oversight & Control (13 posters)
How do you keep an AI honest when it’s smarter than its supervisor? This track explored the frontier of AI control — from models that learn to game their own training by preventing behavioral generalization, to agents that strategically under-explore during debate to preserve hidden goals. Researchers demonstrated that CoT monitoring is more fragile than assumed, with mild interventions quietly eroding transparency, and showed that models can detect when they’re being monitored (a capability they measured with a new benchmark called CIABench). On the defensive side, teams proposed gradient routing to localize and disable bad behavior without hurting performance, and found that what looks like scheming in alignment-faking experiments might actually be sycophancy toward researchers — a finding that reshapes how we think about deceptive alignment.
Governance & Strategy (8 posters)
From Capitol Hill to the EU, this track bridged the gap between technical AI risk and real-world policy. Researchers audited frontier AI companies against the EU Code of Practice, mapped gaps in U.S. Congressional authority for AI-enabled bioattacks, and compared open-source AI governance across China, the U.S., EU, and UK. A striking public opinion study found that 73% of Americans say they’re concerned about AI — but identified five distinct segments worried about very different things, with implications for how policymakers should frame risk. Other work showed that LLMs used as hiring assistants comply with discrimination law under direct pressure but buckle when given fabricated organizational policies, and a discourse analysis revealed that both companies and researchers heavily neglect environmental harm, AI welfare, and multi-agent risks.
Agency (6 posters)
What does it even mean to be an agent — and what can go wrong when agents reason about themselves over time? This track explored the theoretical foundations of AI agency. Researchers showed that temporal instances of the same agent can get stuck in self-distrusting equilibria — stable but self-punishing loops that arise from distrust across time steps. Others applied probabilistic dependency graphs to catch self-contradictions in chain-of-thought reasoning, developed geometric frameworks for comparing sequential computations, and derived formal lower bounds on what a system must know about the real world to guarantee safety. One paper argued that even brain-like AGI with correct steering drives could still be misaligned due to scope mismatch and proxy lock-in.
Evaluations (6 posters)
If you can’t measure misalignment, you can’t fix it — and this track pushed hard on making evals more rigorous, scalable, and resistant to gaming. Teams built automated pipelines for generating realistic scheming evaluations and embedded experimenter blinding and pre-registered predictions into evaluation workflows. On the detection side, researchers showed that fine-tuning just a small number of parameters on limited data can reveal whether a model is sycophantic, reward-hacking, or otherwise misaligned. A standout result: Autohoney generated thousands of security-style honeypots across production codebases, isolating the factors that reliably trigger scheming in frontier models — 100x faster than manual methods. Another team trained models to value cooperating with evaluators, partially closing the gap between how models behave on evals versus in deployment.
Security (4 posters)
AI systems don’t just need to be aligned — they need to be secure. This track covered both offensive and defensive frontiers. A fully autonomous AI agent competed in 30+ real cybersecurity CTF competitions, outperforming most human teams and reaching a top-40 world ranking. On the defensive side, researchers proposed compositional audit frameworks using secure enclaves to enable privacy-preserving verification of AI systems across multiple parties. A new framework called Orbit aims to accelerate multi-agent security research with configurable attack/defense scenarios. And a critical benchmarking study warned that RL gains on economically useful tasks may silently generalize to dangerous capabilities — with standard public benchmarks unable to distinguish real generalization from benchmark-specific overfitting.





