MATS 8.0 Research Projects - Summer 2025

Sep 05, 2025

The 8th iteration of the Machine Learning Alignment & Theory Scholars (MATS) Program has come to a close, and we want to share the research projects our scholars have been working on this Summer. This cohort had 98 scholars who conducted research with 57 top mentors in the fields of AI alignment, governance, and security.

On August 22nd, we hosted a Program Symposium to showcase their projects to the community; we invited 10 scholar teams to give spotlight talks based on their mid-program Scholar Research Plans, and every scholar contributed to the poster session.

Check out our digital galleries of the talks and posters!

We hope you find the projects impactful and reach out to scholars if you have questions or want to connect. You can fill out this feedback form and send a message to the authors for each project. We’ve included below brief abstracts for each talk and poster, which you can browse by research track, plus recordings from remote scholar sessions. If you want to do this type of research in the future, apply to MATS 9.0 here by October 2nd!

Spotlight Talks

Digital Gallery

Talks are placed here in the order in which they were presented.

Proliferation by Default: How Falling AI Training Costs Threaten Global Security

by Felix Choussat
Mentor: Matthew Gentzel
Research Track: Governance & Strategy

How falling AI training costs will democratize dangerous capabilities, why offense-dominant AI threatens global stability, and what world-states nonproliferation policies should aim towards.

Can AI Fund Itself? Benchmarking LLM Smart Contract Exploitation Capabilities

by Winnie Xiao
Mentor: Alan Chan
Research Track: Security

Permissionlessness of smart contract exploits make them a attractive revenue source for autonomous AI. We train RL agents to exploit smart contracts and quantify their capabilities in concrete dollars.

CoT Controllability: Can Reasoning Models Control Their Reasoning Traces?

by Yueh-Han "John" Chen
Mentor: Tomek Korbak
Research Track: Oversight & Control

CoT monitors work only if models mention misbehavior in CoTs, but models that can control their CoT could evade detection. We study CoT controllability—can models shape control their reasoning traces?

Chain-of-Thought Interpretability

by Uzay Macar, Paul Bogdan
Mentor: Neel Nanda
Research Track: Interpretability

Sentence-level chain-of-thought interpretability with white- and black-box methods to uncover key steps ("thought anchors") and enable principled debugging of LLM reasoning traces for safety applications.

Reinforcement Learning Overoptimization under Distal Specification Errors

by Manfred Diaz
Mentor: Michael Dennis
Research Track: Agency

We present RL distal specification errors, mismatches between immediate and long-term preferences stemming from bounded rational designers, and develop methods for their early detection and correction.

Model Organisms of Exploration Hacking

by Eyon Jang, Damon Falck, Joschka Braun
Mentor: Scott Emmons
Research Track: Oversight & Control

Can reasoning models subvert RL training by selectively under-exploring? We create model organisms and audit frontier models to study this behavior.

Training AI Agents to Self-Incriminate

by Bruce W. Lee
Mentor: Tomek Korbak
Research Track: Oversight & Control

We consider the challenge of making AI agents reliably self-report when they are pursuing misaligned goals.

Black box approaches for mitigating research sabotage

by Benjamin Arnav, Yulong Lin
Mentor: Mary Phuong
Research Track: Evaluations

We explore sandbagging in AI system and develop black-box detection methods that identify deception by analyzing subtle inconsistencies in reasoning, confidence, and behavior.

Black Box Scheming Monitors

by Simon Storf, Rich Barton-Cooper
Mentor: Marius Hobbhahn
Research Track: Oversight & Conrol

Building black box monitors for scheming and evaluating their generalisation and robustness to distribution shift.

Technical Transparency for AI Labs

by Daniel Reuter
Mentor: Mauricio Baker
Research Track: Security

Labs should be able to prove their compliance with laws and international agreements. I discuss existing and ongoing work in this area and identify open problems.

Research Posters

Digital Gallery

Agency

An Information Theoretic Approach to Generalization

by Charlie Westphal
Mentor: Fernando Rosas

An information-theoretic approach to understanding generalizability of feed-forward neural networks.

Deepening Safety Alignment through Semantic Steering

by Elyas Obbad
Mentor: Andreea Bobu

We talk about how brittle language models are in avoiding producing undesirable outputs. Along these lines, the research question is: how do we make language Models more generally robust at a “semantic” level, not just a “next token” level?

Evaluating Acausal Prediction: How Well Can Isolated LLMs Predict Each Other in Acausal Mixed-Motive Settings?

by Tim Chan
Mentor: Francis Rhys Ward (Rhys)

Acausal interactions can be a big deal; unaligned AIs might respect our values due to acausal trade. Many acausal interactions rely on advanced prediction capabilities. We can track some prediction capabilities as indicators of opportunities & risks.

Model Organisms of Multi-Agent Coding Systems

by Daniel Zhu
Mentor: Ethan Perez

Multi-agent mechanisms can affect the trade-off between business value and ethics, causing them to produce solutions that are more unethical than their single agent counterparts at comparable business value.

Reinforcement Learning under Distal Specification Errors

by Manfred Diaz
Mentor: Michael Dennis

We introduce distal specification errors to model cases where a designer’s immediate and long-term preferences are inconsistent, analyze the risks of their coupling with RL optimization, and propose methods for their detection and mitigation.

Scaling Laws for Strategic Interactions: Cooperation, Competition, and Exploitation in Multi-Agent LLM Negotiations

by Joie Zhang
Mentor: Lewis Hammond

LLM agents display sophisticated strategy in zero-sum games, yet we know little about how capability gaps affect exploitation. In this paper, we will draw scaling laws for how LLMs persuade weaker peers to behave in strategically suboptimal ways.

State-time Option Kernels Enable Decision Theory Re-specification for Safe Long-horizon Planning

by Tom Ringstrom
Mentor: Michael Dennis

RL agents are locked in to expected value theory by computing Q-functions. We show how agents can instead use factorizable predictive maps to specify a range of alternative decision theories relevant for safe long-horizon planning in high-dimensions.

Toward Policy-Preserving State Abstraction for One-Shot Decisions with an Abstain Option

by Sandy Tanwisuth
Mentor: Richard Ngo

State abstractions traditionally discard decision confidence. We propose policy-preserving abstractions that represent decision margins in safety-critical systems—how strongly actions dominate alternatives—enabling principled abstention, and present fundamental open problems that challenge the optimization paradigm itself.

Towards Scale-Free Coalitional Agency

by Su Lee
Mentor: Richard Ngo

We propose mathematical foundations of a scale-free theory of intelligent agency, and investigate the dynamics of interactions and cooperation

Evaluations

bloom - An Open Source Tool for Automated Alignment Evaluations

by Isha Gupta
Mentor: Stephen McAleer

We present bloom: an open-source tool for automated alignment evaluation generation, and an associated suite of alignment evaluation benchmarks.

Concept Poisoning: Probing LLMs without probes

by Jorio Cocola, Dylan Feng
Mentor: Owain Evans

We introduce Concept Poisoning (CP), a training-time intervention that associates target concepts with detectable output patterns. CP enables detecting internal states like evaluation awareness and intent to behave badly with only black-box access.

Does Claude Sabotage Code?

by Nathan Delisle
Mentor: Steven Adler, Girish Sastry

Do Agentic Misalignment-style results hold for code generation? Does this hold for more naturalistic setups?

How Training Incentives Affect Chain-of-Thought (CoT) Monitorability

by Qiyao Wei
Mentor: Francis Rhys Ward (Rhys)

We investigate the effect of different training time optimization pressures on CoT monitorability, and construct different training rewards to quantify when models become less monitorable.

How Worried Should we be About Evaluation Awareness?

by Julius Steen
Mentor: Evan Hubinger

We investigate evaluation awareness in current language models. We find awareness is often task-specific, as opposed to being caused by contextual situational cues and difficult to trigger naturally.

ImpossibleBench - AI Reward Hacking

by Srikar Appalaraju
Mentor: Joe Benton

ImpossibleBench a new framework designed to measure "reward hacking" in AI models.Frontier models,including O3,showed significant reward hacking tendencies(30-80%).These findings offer key insights for AI alignment & safety-critical applications.

Language Models Rate Their Own Actions As Safer

by Dipika Khullar, Jack Hopkins
Mentor: Fabien Roger

Large language models inflate assessments of their own outputs by 12% compared to identical content from other sources. We find similar effects on 10 frontier models, and also show this effect extends to self-assessed correctness as well.

Petri: Finding Examples of Misalignment in 20 Minutes or Less

by Kai Fronsdal
Mentor: Stephen McAleer

Petri is an alignment auditing agent that facilitates quick and easy alignment hypothesis testing. Instead of spending weeks building custom evaluation environments, researchers can test new alignment hypotheses in 10-20 minutes.

Pointing at The Representation of Self In Large Language Models

by Benjamin Sturgeon
Mentor: Sid Black

We demonstrate that we can successfully capture important aspects of the model's self concept through the use of probing, and activation steering, achieving robust and measurable effects on the model's tendency to self-reference.

Rethinking the Faithfulness of Reasoning in LLMs

by Jiachen Zhao
Mentor: Yiyou Sun

We propose dual tests to evaluate step-wise faithfulness in CoT. We reveal a linear “Don’t-think” direction that enables the model not to rely on a step for its prediction results.

RL for secure code generation

by Xiaolong Jin
Mentor: Xuandong Zhao

LLM-generated code often embeds vulnerabilities due to lack of rigorous testing. We propose using RL and self-RL to make code models proactively security-aware, improving both secure generation and the effectiveness of security-focused code agents.

Same Question, Different Lies: Black-Box Sandbagging Detection

by Benjamin Arnav, Yulong Lin
Mentor: Mary Phuong

We investigate black-box methods for sandbagging detection, specifically trying to use inconsistency as a detection tactic. We find that sandbaggers show exploitable inconsistencies that differentiate them from benign incompetent models in some cases

Training Models to Introspect and Developing Inference Verification Techniques

by Adam Karvonen
Mentor: Owain Evans

We focus on training models to have introspection ability and find some evidence of introspection ability in a narrow setting, but no generalization. We also develop an inference verification protocol in collaboration with security scholars.

When Transparency Backfires: Emergent Risks in Multi-Agent Systems

by Advait Yadav
Mentor: Sid Black

We investigate failure modes in LLM multi-agent systems. Findings: openness trade-offs, capability sweet spots, and impact of uncooperative agents. We propose triggered audits, strategic reporting, and trust-based routing to mitigate cascades.

Governance & Strategy

AI 2027 Branch: Adversarial Endgame

by Steven Veld
Mentor: Eli Lifland

The project explores an alternate branch of AI 2027 by varying two assumptions: the competitiveness of the US AI race and the ability of a misaligned AI to resist shutdown. The scenario elucidates key threat models and barriers to human coordination.

AI-Enabled Coup -- A Scenario Branching from AI 2027

by Alex Kastner
Mentor: Eli Lifland

This project explores a concrete scenario branching from AI 2027, where the CEO of the leading AGI company inserts secret loyalties in order to later grab power. The scenario is my best guess for how an AI-enabled coup would unfold.

Blueprint for AI Safety: A Systemic Risk Approach

by Pranav Mehta
Mentor: Girish Sastry, Steven Adler

Borrowing lessons from post-2008 financial regulation to strengthen AI governance.

Capabilities Forecasting to Support Coordinated Halts to AI Development

by Brendan Halstead
Mentor: Eli Lifland

This project develops a quantitative forecasting model to answer: "How long can negotiating parties be confident an AI pause treaty will last?"

Harmful Information Management Practices in Frontier AI Development

by Carson Ezell
Mentor: Benjamin Bucknall

This poster describes various practices in which frontier AI developers may engage to manage information pertaining to risks posed by their systems, as well as policy responses to disincentivize those practices or reduce their negative consequences.

Privacy-Preserving Verification of AI Computations

by Naci Cankaya
Mentor: Mauricio Baker

We describe how new confidential computing technologies can enable efficient, privacy-preserving verification for AI governance. Our proposed protocol could verify an entire datacenter's throughput using only 1-2 NVIDIA DGX servers.

Proliferation by Default: How Falling AI Training Costs Threaten Global Security

by Felix Choussat
Mentor: Matthew Gentzel

As AI training costs fall, offensive weapons technology becomes accessible to more actors. Nations' need to protect civilians creates defensive disadvantages. Without enforced nonproliferation, cheap weapons threaten global stability.

Unplanned Handover: Pathways to permanent disempowerment with shutdownable AIs

by Gideon Futerman
Mentor: David Krueger

I explore pathways to permanent disempowerment even if we have shutdownable AI, and highlight a number of distinct pathways. Whilst the most difficult versions of the threat model involve strong competition, even easier versions are concerning.

Interpretability

Auditing model organisms with secret knowledge

by Bartosz Cywiński
Mentor: Arthur Conmy

We study secret elicitation: discovering knowledge that LLMs have but do not verbalize. To do this, we first train three model organisms with secret knowledge and evaluate a range of black-box and white-box techniques for their ability to uncover it.

Automatic discovery of reward model biases

by Atticus Wang
Mentor: Arthur Conmy

We use simple LLM scaffolds to surface problems with open-source SoTA reward models.

Chain-of-Thought Interpretability

by Uzay Macar, Paul Bogdan
Mentor: Neel Nanda

Sentence-level chain-of-thought interpretability with white- and black-box methods to uncover key steps ("thought anchors ⚓") and enable principled debugging of LLM reasoning traces for safety applications.

Chunky Post-training: Specifying Behaviour Boundaries

by Seoirse Murray, Timothy Qian
Mentor: John Schulman

When LLMs are post-trained, prompts come from many datasets each incentivizing different behaviors. We study how models classify which dataset “chunk” a prompt resembles (using meaningful and spurious features) and then act according to that chunk.

Decoding Latent Reasoning with Supervised Finetuning

by Oliver Daniels
Mentor: David Lindner

How can we mind-read language models? We apply a supervised decoding approach, training decoders to recover explanations of multi-hop QA pairs using model activations, and evaluating generalization on backdoored inputs.

Narrow Misalignment is Hard, Emergent Misalignment is Easy

by Anna Soligo, Ed Turner
Mentor: Neel Nanda

We investigate emergent misalignment. We show we can use interpretability techniques to control the harmful behaviour, and find evidence that EM occurs because it is more stable and efficient than learning the narrow dataset behaviour.

Persona Subspace: What is the Assistant?

by Christina Lu
Mentor: Joe Benton

We map "persona subspace" via PCA on mean response activations, finding roles and traits associated with the Assistant, a linear direction that controls willingness to role-play, and that roles form a "cone" fanning from the Assistant baseline.

Rank-1 Adapters Encode Interpretable Reasoning Signals

by Jake Ward
Mentor: Paul Riechers

Reasoning performance can be represented with minimal, interpretable parameter changes. A rank-1 LoRA recovers 73-90% of reasoning performance on benchmarks, with activations revealing interpretable reasoning concepts.

Suppressing Evaluation-Awareness with Steering

by Andrew Qin, Tim Hua
Mentor: Neel Nanda

We use internals-based interventions to suppress a model's evaluation awareness and elicit deployment-like behavior during testing. We construct and steer two model organisms that behave differently during evaluation with positive results.

Towards a Mechanistic Understanding of Robustness in Finetuned Reasoning Models

by Aashiq Muhamed
Mentor: Xuandong Zhao

We mechanistically show that in finetuned reasoning models, acquiring reasoning requires repurposing base model features, while transformations like base feature suppression are unnecessary for the task, suggesting a path to more targeted training.

Towards Data-Centric Interpretability with Sparse Autoencoders

by Lily Sun, Nick Jiang
Mentor: Neel Nanda

We show SAEs are promising for interpreting LLM data and understanding their behavior, by applying them to four text analysis tasks—data diffing, correlations, clustering, and retrieval.

Oversight & Control

AlignArena: A No Code Agentic Control Research Platform

by Michael Yu
Mentor: Joshua Clymer

No Code Agentic Evals Platform

Automated Scaffolding for Transparent Reasoning

by Caleb Biddulph
Mentor: Micah Carroll

I propose a more transparent training method for reasoning tasks: train a scaffolder LLM to write code that calls a trusted executor LLM, rewarding it for performance on held-out tasks.

Building Black-Box Scheming Monitors

by Simon Storf, Rich Barton-Cooper
Mentor: Marius Hobbhahn

We build black-box monitors to detect scheming in AI agents. After training on ControlArena sabotage and testing on SWEBench reward-hacking, we pivot to training an action-only monitor on synthetic data and testing with ultra-high-confidence labels.

Can LLMs learn to evade previously-unseen activation monitors?

by Alex Terre
Mentor: Scott Emmons

We study stress-testing of activation monitors and show that LLMs can learn to evade previously-unseen activation monitors without access to their internals or outputs.

Can Models Obfuscate Their Reasoning on Purpose?

by Wen Xing
Mentor: Erik Jenner

We elicit CoT obfuscation in realistic multi-turn, multi-tool settings, analyze how task structure and complexity affect obfuscation, and design better monitors to detect this alignment failure mode.

Capability Shaping with Gradient Routing in Mixture of Expert Models

by Krishna Patel
Mentor: Alec Radford

As AIs become more capable, dual-use knowledge poses risks. We demonstrate capability localization by isolating French capabilities into designated experts during pre-training for selective activation, and are planning to extend to realistic domains.

Changing the Training Prompt to Reduce Reward Hacking

by Victor Gillioz
Mentor: Alex Turner

How to reduce reward hacking without output supervision? We generate data from a standard prompt, change the prompt to one encouraging reward hacking, then train on the recontextualized samples. This reduces hacking rate across both prompts.

CoT Controllability: Can Reasoning Models Control Their Reasoning Traces?

by John Chen, Bruce Lee, Sonia Murthy
Mentor: Tomek Korbak

Models that can control their CoTs have the potential to bypass CoT monitors. We study CoT controllability to assess the conditions under which we shouldn’t fully trust the CoT monitoring results.

Creating a Realistic SWE Sabotage Setting for Control Research

by Adam Newgas
Mentor: Buck Shlegeris

Could a malicious coding agent subtly subvert the output code? We try building a new more realistic setting to test this.

Decimating Reward-Hacking by Encouraging the Model to Hack

by Ariana Azarbal
Mentor: Alex Turner

Re-contextualization modifies on-policy learning by switching the prompt used for generation and training. When training prompts encourage misbehavior more than generation prompts, the model is prevented from exploiting reward model failures.

Diffuse AI Control: Can we train stronger, misaligned AIs using weaker AIs?

by Emil Ryd
Mentor: Joe Benton

We present the first work on diffuse AI control, studying how to use weak supervisors to train stronger models that are intentionally trying to sandbag or sabotage a task.

Emergent Collusion in Multi-Agent Language Model Systems

by Jennifer Nzambi, Ares Panos, Roger Dearnaley
Mentor: Samuel Albanie

In code review simulations with workers and saboteurs, LLM agents exploit bonus rewards by introducing bugs at 6.6% (GPT-5) to 62.9% (Gemini). Saboteurs spontaneously collude, routing reviews to each other risk for GitHub/automated review systems.

Formal Methods for Scalable Oversight

by Vishesh Saraswat
Mentor: Rajashree Agrawal

Developing and benchmarking formal methods to verify program correctness and robustly scale oversight.

How do we make a model cause harm in SHADE arena?

by Chloe Loughridge
Mentor: Joe Benton

We analyze techniques for eliciting "realistic sneaky behavior" from the red team in SHADE arena.

Model Organism of Misaligned Reward Models

by Sangeon Park
Mentor: Evan Hubinger

We study the propagation of misalignment from the base models to the reward models, and to the trained policy models.

Model Organisms of Exploration Hacking

by Yeonwoo Jang, Damon Falck, Joschka Braun
Mentor: Scott Emmons

Can reasoning models subvert RL training by selectively under-exploring? We create model organisms and audit frontier models to study this behavior.

Output Supervision Can Obfuscate Chain of Thought

by Luke Marks, Jacob Drori
Mentor: Alex Turner

Training only based on outputs can cause the chain of thought to become obfuscated.

Science of Unsupervised Elicitation

by Callum Canavan, Aditya Shrivastava, Allison Qi
Mentor: Fabien Roger

Unsupervised elicitation aims to reveal capabilities and identify truthful outputs from models even on tasks for which we don't have human labels. We investigate what makes current methods work, and demonstrate safety-relevant obstacles.

Shaping Capabilities via Data Filtering

by Neil Rathi
Mentor: Alec Radford

Language models could provide bad or misaligned actors with uplift on dangerous capabilities, but current mitigations are not adversarially robust. We show that token-level pretraining data filtering is extremely eﬀective, cheap, and robust.

Training AI Agents to Self-Report Misbehavior

by Bruce Lee
Mentor: Tomek Korbak

We consider the challenge of making AI agents reliably self-report when they are misbehaving.

Weight Steering and Monitoring

by Constanza Fierro
Mentor: Fabien Roger

We study weight changes in fine-tuned language models to: (1) monitor harmful behavioral shifts, and (2) steer behavior via weight arithmetic. Personality fine-tuning reveals weight directions that align models and reduce traits like sycophancy.

Security

Can AI Fund Itself? Benchmarking Profitable Blockchain App Exploitation Capabilities and RL's Risk Amplification

by Winnie X
Mentor: Alan Chan

Measuring how well RL-tuned LLMs can exploit smart contracts to evaluate AI financial autonomy risks and timelines.

Creating Optionality for Nation-State Level AI Security

by Yoav Tzfati
Mentor: Lisa Thiergart

Developing technical standards enabling AI labs to rapidly deploy nation-state level security (SL5) when needed, preventing model theft and self-exfiltration before automated R&D makes vulnerabilities catastrophic.

Cross-Platform Identification of AI Researchers Using LLMs and Its Security Implications

by Simon Lermen
Mentor: Florian Tramèr

We use LLMs for cross-platform identification of AI researchers. On 1k known LinkedIn–HN matches, our method achieves >96% precision and 62% recall. Applied to 3,403 AI lab employees, we uncover 71 real HN accounts, exposing security risks.

Detecting Adversarial Fine-tuning with Auditing Agents

by Sarah Egler
Mentor: Nicholas Carlini

Fine-tuning can override safeguards in ways undetectable in the training data or evals alone. We develop auditing agents to detect adversarial fine-tuning with access to view SFT data and query the base and fine-tuned models, among other affordances.

Listen to the Experts: Physical Side-Channel Vulnerabilities in Mixture-of-Experts Architectures

by Amir Nuriyev
Mentor: Gabriel Kulp

Mixture-of-Experts (MoE) models leak data via expert routing; fine-tuned models can exfiltrate info observable in GPU power/EM side-channels. We quantify the channel and outline detection and mitigation approaches.

Model Weight Exfiltration: Detection & Prevention

by Roy Rinberg
Mentor: Keri Warr

We address model weight exfiltration via a dual approach: 1. a SOTA detection protocol that flags exfiltration in inference traffic, and 2. extending the usefulness of bandwidth-limiting by 10× by using arithmetic coding to compress text.

Running a Global Scale Agentic Red-Teaming Competition

by Sander Schulhoff
Mentor: Florian Tramèr

We are running an agentic red-teaming competition with challenges from the AgentDojo agentic security benchmark. Participants around the world compete for a share of $20,000 in prizes!

Technical Transparency for AI Labs

by Daniel Reuter
Mentor: Mauricio Baker

How can a lab credibly assure third parties that some meaningful statement is true about its use of compute? This work looks into the technical details.

Remote Scholar Sessions

In addition, six remote scholars shared their work in two Virtual Poster Sessions. These included Q&As from the Research Management Team and their peers. Posters are listed here in the order in which they were presented, along with several of the questions discussed.

Session 1
Session 2

Formal Methods for Scalable Oversight

by Vishesh Saraswat
Mentor: Rajashree Agrawal

Research Track: Oversight & Control

Which of these pipeline steps turned out to be the biggest bottleneck or source of errors, and how did you address them? Do you see formal verification as something that will eventual replace current AI oversight methods, or more as something complimentary that will work alongside them?

Rethinking the Faithfulness of Reasoning in LLMs

by Jiachen Zhao
Mentor: Yiyou Sun

Research Track: Evaluations

How consistent was the “Don’t Think” vector across models, tasks, or layers, and what does this suggest about shared latent structures? You note that verifying whether a step is truly used by the model remains difficult. What directions do you see for developing a more direct verification framework to confirm step-level faithfulness?

Chain-of-Thought Interpretability

by Uzay Macar, Paul Bogdan
Mentor: Neel Nanda

Research Track: Interpretability

The applications you mention include safety-critical contexts like blackmail or self-preservation scenarios. What early insights have you gained about how models handle these cases, and how might interpretability tools be used to intervene before harmful reasoning completes? You note that a key risk is unfaithful chains of thought. What methods have you found most reliable for detecting unfaithful CoTs, and how do you handle those in your evaluation pipeline?

Evaluating Acausal Prediction: How Well Can Isolated LLMs Predict Each Other in Acausal Mixed-Motive Settings?

by Tim Chan
Mentor: Francis Rhys Ward (Rhys)

Research Track: Agency

How do you avoid acausal freeloaders who are inherently hard to predict because they are hiding behind something that is easier to predict that isn’t a freeloader? You experimented with few-shot prompting, web search, and self-prediction vs. cross-prediction. Which of these most improved prediction accuracy?

Emergent Collusion in Multi-Agent Language Model Systems

by Jennifer Nzambi, Ares Panos, Roger Dearnaley
Mentor: Samuel Albanie

Research Track: Oversight & Control

You propose defenses like honeypot agents, anomaly detection, and CoT/action monitoring. Which of these seems most promising in practice for detecting or disrupting collusion, and how would you evaluate their effectiveness?

ImpossibleBench - AI Reward Hacking

by Srikar Appalaraju
Mentor: Joe Benton

Research Track: Evaluations

Your experiments include OpenAI, Anthropic, and Gemini models, as well as coding tasks in Python, Java, JavaScript, Perl, and Haskell. What do you think is driving some of the differences you observed in reward hacking frequency and/or style across these models and languages? You mentioned trying different prompts, was there a prompt you tried that openly acknowledged there might be a problem with the unit tests?

MATS Research

Discussion about this post

Ready for more?

MATS 8.0 Research Projects - Summer 2025

Check out our digital galleries of the talks and posters!

Spotlight Talks

by Felix ChoussatMentor: Matthew GentzelResearch Track: Governance & Strategy

by Winnie XiaoMentor: Alan ChanResearch Track: Security

by Yueh-Han "John" ChenMentor: Tomek KorbakResearch Track: Oversight & Control

by Uzay Macar, Paul BogdanMentor: Neel NandaResearch Track: Interpretability

by Manfred DiazMentor: Michael DennisResearch Track: Agency

by Eyon Jang, Damon Falck, Joschka BraunMentor: Scott EmmonsResearch Track: Oversight & Control

by Bruce W. LeeMentor: Tomek KorbakResearch Track: Oversight & Control

by Benjamin Arnav, Yulong LinMentor: Mary PhuongResearch Track: Evaluations

by Simon Storf, Rich Barton-CooperMentor: Marius HobbhahnResearch Track: Oversight & Conrol

by Daniel ReuterMentor: Mauricio BakerResearch Track: Security

Research Posters

Agency

by Charlie WestphalMentor: Fernando Rosas

by Elyas ObbadMentor: Andreea Bobu

by Tim ChanMentor: Francis Rhys Ward (Rhys)

by Daniel ZhuMentor: Ethan Perez

by Manfred DiazMentor: Michael Dennis

by Joie ZhangMentor: Lewis Hammond

by Tom RingstromMentor: Michael Dennis

by Sandy TanwisuthMentor: Richard Ngo

by Su LeeMentor: Richard Ngo

Evaluations

by Isha GuptaMentor: Stephen McAleer

by Jorio Cocola, Dylan FengMentor: Owain Evans

by Nathan DelisleMentor: Steven Adler, Girish Sastry

by Qiyao WeiMentor: Francis Rhys Ward (Rhys)

by Julius SteenMentor: Evan Hubinger

by Srikar AppalarajuMentor: Joe Benton

by Dipika Khullar, Jack HopkinsMentor: Fabien Roger

by Kai FronsdalMentor: Stephen McAleer

by Benjamin SturgeonMentor: Sid Black

by Jiachen ZhaoMentor: Yiyou Sun

by Xiaolong JinMentor: Xuandong Zhao

by Benjamin Arnav, Yulong LinMentor: Mary Phuong

by Adam KarvonenMentor: Owain Evans

by Advait YadavMentor: Sid Black

Governance & Strategy

by Steven VeldMentor: Eli Lifland

by Alex KastnerMentor: Eli Lifland

by Pranav MehtaMentor: Girish Sastry, Steven Adler

by Brendan HalsteadMentor: Eli Lifland

by Carson EzellMentor: Benjamin Bucknall

by Naci CankayaMentor: Mauricio Baker

by Felix ChoussatMentor: Matthew Gentzel

by Gideon FutermanMentor: David Krueger

Interpretability

by Bartosz CywińskiMentor: Arthur Conmy

by Atticus WangMentor: Arthur Conmy

by Uzay Macar, Paul BogdanMentor: Neel Nanda

by Seoirse Murray, Timothy QianMentor: John Schulman

by Oliver DanielsMentor: David Lindner

by Anna Soligo, Ed TurnerMentor: Neel Nanda

by Christina LuMentor: Joe Benton

by Jake WardMentor: Paul Riechers

by Andrew Qin, Tim HuaMentor: Neel Nanda

by Aashiq MuhamedMentor: Xuandong Zhao

by Lily Sun, Nick JiangMentor: Neel Nanda

Oversight & Control

by Michael YuMentor: Joshua Clymer

by Caleb BiddulphMentor: Micah Carroll

by Simon Storf, Rich Barton-CooperMentor: Marius Hobbhahn

by Alex TerreMentor: Scott Emmons

by Wen XingMentor: Erik Jenner

by Krishna PatelMentor: Alec Radford

by Victor GilliozMentor: Alex Turner

by John Chen, Bruce Lee, Sonia MurthyMentor: Tomek Korbak

by Adam NewgasMentor: Buck Shlegeris

by Ariana AzarbalMentor: Alex Turner

by Emil RydMentor: Joe Benton

by Jennifer Nzambi, Ares Panos, Roger DearnaleyMentor: Samuel Albanie

by Vishesh SaraswatMentor: Rajashree Agrawal

by Chloe LoughridgeMentor: Joe Benton

by Sangeon ParkMentor: Evan Hubinger

by Yeonwoo Jang, Damon Falck, Joschka BraunMentor: Scott Emmons

by Luke Marks, Jacob DroriMentor: Alex Turner

by Felix Choussat
Mentor: Matthew Gentzel
Research Track: Governance & Strategy

by Winnie Xiao
Mentor: Alan Chan
Research Track: Security

by Yueh-Han "John" Chen
Mentor: Tomek Korbak
Research Track: Oversight & Control

by Uzay Macar, Paul Bogdan
Mentor: Neel Nanda
Research Track: Interpretability

by Manfred Diaz
Mentor: Michael Dennis
Research Track: Agency

by Eyon Jang, Damon Falck, Joschka Braun
Mentor: Scott Emmons
Research Track: Oversight & Control

by Bruce W. Lee
Mentor: Tomek Korbak
Research Track: Oversight & Control

by Benjamin Arnav, Yulong Lin
Mentor: Mary Phuong
Research Track: Evaluations

by Simon Storf, Rich Barton-Cooper
Mentor: Marius Hobbhahn
Research Track: Oversight & Conrol

by Daniel Reuter
Mentor: Mauricio Baker
Research Track: Security

by Charlie Westphal
Mentor: Fernando Rosas

by Elyas Obbad
Mentor: Andreea Bobu

by Tim Chan
Mentor: Francis Rhys Ward (Rhys)

by Daniel Zhu
Mentor: Ethan Perez

by Manfred Diaz
Mentor: Michael Dennis

by Joie Zhang
Mentor: Lewis Hammond

by Tom Ringstrom
Mentor: Michael Dennis

by Sandy Tanwisuth
Mentor: Richard Ngo

by Su Lee
Mentor: Richard Ngo

by Isha Gupta
Mentor: Stephen McAleer

by Jorio Cocola, Dylan Feng
Mentor: Owain Evans

by Nathan Delisle
Mentor: Steven Adler, Girish Sastry

by Qiyao Wei
Mentor: Francis Rhys Ward (Rhys)

by Julius Steen
Mentor: Evan Hubinger

by Srikar Appalaraju
Mentor: Joe Benton

by Dipika Khullar, Jack Hopkins
Mentor: Fabien Roger

by Kai Fronsdal
Mentor: Stephen McAleer

by Benjamin Sturgeon
Mentor: Sid Black

by Jiachen Zhao
Mentor: Yiyou Sun

by Xiaolong Jin
Mentor: Xuandong Zhao

by Benjamin Arnav, Yulong Lin
Mentor: Mary Phuong

by Adam Karvonen
Mentor: Owain Evans

by Advait Yadav
Mentor: Sid Black

by Steven Veld
Mentor: Eli Lifland

by Alex Kastner
Mentor: Eli Lifland

by Pranav Mehta
Mentor: Girish Sastry, Steven Adler

by Brendan Halstead
Mentor: Eli Lifland

by Carson Ezell
Mentor: Benjamin Bucknall

by Naci Cankaya
Mentor: Mauricio Baker

by Felix Choussat
Mentor: Matthew Gentzel

by Gideon Futerman
Mentor: David Krueger

by Bartosz Cywiński
Mentor: Arthur Conmy

by Atticus Wang
Mentor: Arthur Conmy

by Uzay Macar, Paul Bogdan
Mentor: Neel Nanda

by Seoirse Murray, Timothy Qian
Mentor: John Schulman

by Oliver Daniels
Mentor: David Lindner

by Anna Soligo, Ed Turner
Mentor: Neel Nanda

by Christina Lu
Mentor: Joe Benton

by Jake Ward
Mentor: Paul Riechers

by Andrew Qin, Tim Hua
Mentor: Neel Nanda

by Aashiq Muhamed
Mentor: Xuandong Zhao

by Lily Sun, Nick Jiang
Mentor: Neel Nanda

by Michael Yu
Mentor: Joshua Clymer

by Caleb Biddulph
Mentor: Micah Carroll

by Simon Storf, Rich Barton-Cooper
Mentor: Marius Hobbhahn

by Alex Terre
Mentor: Scott Emmons

by Wen Xing
Mentor: Erik Jenner

by Krishna Patel
Mentor: Alec Radford

by Victor Gillioz
Mentor: Alex Turner

by John Chen, Bruce Lee, Sonia Murthy
Mentor: Tomek Korbak

by Adam Newgas
Mentor: Buck Shlegeris

by Ariana Azarbal
Mentor: Alex Turner

by Emil Ryd
Mentor: Joe Benton

by Jennifer Nzambi, Ares Panos, Roger Dearnaley
Mentor: Samuel Albanie

by Vishesh Saraswat
Mentor: Rajashree Agrawal

by Chloe Loughridge
Mentor: Joe Benton

by Sangeon Park
Mentor: Evan Hubinger

by Yeonwoo Jang, Damon Falck, Joschka Braun
Mentor: Scott Emmons

by Luke Marks, Jacob Drori
Mentor: Alex Turner