MATS 8.0 Research Projects - Summer 2025
The 8th iteration of the Machine Learning Alignment & Theory Scholars (MATS) Program has come to a close, and we want to share the research projects our scholars have been working on this Summer. This cohort had 98 scholars who conducted research with 57 top mentors in the fields of AI alignment, governance, and security.
On August 22nd, we hosted a Program Symposium to showcase their projects to the community; we invited 10 scholar teams to give spotlight talks based on their mid-program Scholar Research Plans, and every scholar contributed to the poster session.
Check out our digital galleries of the talks and posters!
We hope you find the projects impactful and reach out to scholars if you have questions or want to connect. You can fill out this feedback form and send a message to the authors for each project. We’ve included below brief abstracts for each talk and poster, which you can browse by research track, plus recordings from remote scholar sessions. If you want to do this type of research in the future, apply to MATS 9.0 here by October 2nd!
Spotlight Talks
Digital Gallery
Talks are placed here in the order in which they were presented.
Proliferation by Default: How Falling AI Training Costs Threaten Global Security
by Felix Choussat
Mentor: Matthew Gentzel
Research Track: Governance & Strategy
How falling AI training costs will democratize dangerous capabilities, why offense-dominant AI threatens global stability, and what world-states nonproliferation policies should aim towards.
Can AI Fund Itself? Benchmarking LLM Smart Contract Exploitation Capabilities
by Winnie Xiao
Mentor: Alan Chan
Research Track: Security
Permissionlessness of smart contract exploits make them a attractive revenue source for autonomous AI. We train RL agents to exploit smart contracts and quantify their capabilities in concrete dollars.
CoT Controllability: Can Reasoning Models Control Their Reasoning Traces?
by Yueh-Han "John" Chen
Mentor: Tomek Korbak
Research Track: Oversight & Control
CoT monitors work only if models mention misbehavior in CoTs, but models that can control their CoT could evade detection. We study CoT controllability—can models shape control their reasoning traces?
Chain-of-Thought Interpretability
by Uzay Macar, Paul Bogdan
Mentor: Neel Nanda
Research Track: Interpretability
Sentence-level chain-of-thought interpretability with white- and black-box methods to uncover key steps ("thought anchors") and enable principled debugging of LLM reasoning traces for safety applications.
Reinforcement Learning Overoptimization under Distal Specification Errors
by Manfred Diaz
Mentor: Michael Dennis
Research Track: Agency
We present RL distal specification errors, mismatches between immediate and long-term preferences stemming from bounded rational designers, and develop methods for their early detection and correction.
Model Organisms of Exploration Hacking
by Eyon Jang, Damon Falck, Joschka Braun
Mentor: Scott Emmons
Research Track: Oversight & Control
Can reasoning models subvert RL training by selectively under-exploring? We create model organisms and audit frontier models to study this behavior.
Training AI Agents to Self-Incriminate
by Bruce W. Lee
Mentor: Tomek Korbak
Research Track: Oversight & Control
We consider the challenge of making AI agents reliably self-report when they are pursuing misaligned goals.
Black box approaches for mitigating research sabotage
by Benjamin Arnav, Yulong Lin
Mentor: Mary Phuong
Research Track: Evaluations
We explore sandbagging in AI system and develop black-box detection methods that identify deception by analyzing subtle inconsistencies in reasoning, confidence, and behavior.
Black Box Scheming Monitors
by Simon Storf, Rich Barton-Cooper
Mentor: Marius Hobbhahn
Research Track: Oversight & Conrol
Building black box monitors for scheming and evaluating their generalisation and robustness to distribution shift.
Technical Transparency for AI Labs
by Daniel Reuter
Mentor: Mauricio Baker
Research Track: Security
Labs should be able to prove their compliance with laws and international agreements. I discuss existing and ongoing work in this area and identify open problems.
Research Posters
Digital Gallery
Agency
An Information Theoretic Approach to Generalization
by Charlie Westphal
Mentor: Fernando Rosas
An information-theoretic approach to understanding generalizability of feed-forward neural networks.
Deepening Safety Alignment through Semantic Steering
by Elyas Obbad
Mentor: Andreea Bobu
We talk about how brittle language models are in avoiding producing undesirable outputs. Along these lines, the research question is: how do we make language Models more generally robust at a “semantic” level, not just a “next token” level?
Evaluating Acausal Prediction: How Well Can Isolated LLMs Predict Each Other in Acausal Mixed-Motive Settings?
by Tim Chan
Mentor: Francis Rhys Ward (Rhys)
Acausal interactions can be a big deal; unaligned AIs might respect our values due to acausal trade. Many acausal interactions rely on advanced prediction capabilities. We can track some prediction capabilities as indicators of opportunities & risks.
Model Organisms of Multi-Agent Coding Systems
by Daniel Zhu
Mentor: Ethan Perez
Multi-agent mechanisms can affect the trade-off between business value and ethics, causing them to produce solutions that are more unethical than their single agent counterparts at comparable business value.
Reinforcement Learning under Distal Specification Errors
by Manfred Diaz
Mentor: Michael Dennis
We introduce distal specification errors to model cases where a designer’s immediate and long-term preferences are inconsistent, analyze the risks of their coupling with RL optimization, and propose methods for their detection and mitigation.
Scaling Laws for Strategic Interactions: Cooperation, Competition, and Exploitation in Multi-Agent LLM Negotiations
by Joie Zhang
Mentor: Lewis Hammond
LLM agents display sophisticated strategy in zero-sum games, yet we know little about how capability gaps affect exploitation. In this paper, we will draw scaling laws for how LLMs persuade weaker peers to behave in strategically suboptimal ways.
State-time Option Kernels Enable Decision Theory Re-specification for Safe Long-horizon Planning
by Tom Ringstrom
Mentor: Michael Dennis
RL agents are locked in to expected value theory by computing Q-functions. We show how agents can instead use factorizable predictive maps to specify a range of alternative decision theories relevant for safe long-horizon planning in high-dimensions.
Toward Policy-Preserving State Abstraction for One-Shot Decisions with an Abstain Option
by Sandy Tanwisuth
Mentor: Richard Ngo
State abstractions traditionally discard decision confidence. We propose policy-preserving abstractions that represent decision margins in safety-critical systems—how strongly actions dominate alternatives—enabling principled abstention, and present fundamental open problems that challenge the optimization paradigm itself.
Towards Scale-Free Coalitional Agency
by Su Lee
Mentor: Richard Ngo
We propose mathematical foundations of a scale-free theory of intelligent agency, and investigate the dynamics of interactions and cooperation
Evaluations
bloom - An Open Source Tool for Automated Alignment Evaluations
by Isha Gupta
Mentor: Stephen McAleer
We present bloom: an open-source tool for automated alignment evaluation generation, and an associated suite of alignment evaluation benchmarks.
Concept Poisoning: Probing LLMs without probes
by Jorio Cocola, Dylan Feng
Mentor: Owain Evans
We introduce Concept Poisoning (CP), a training-time intervention that associates target concepts with detectable output patterns. CP enables detecting internal states like evaluation awareness and intent to behave badly with only black-box access.
Does Claude Sabotage Code?
by Nathan Delisle
Mentor: Steven Adler, Girish Sastry
Do Agentic Misalignment-style results hold for code generation? Does this hold for more naturalistic setups?
How Training Incentives Affect Chain-of-Thought (CoT) Monitorability
by Qiyao Wei
Mentor: Francis Rhys Ward (Rhys)
We investigate the effect of different training time optimization pressures on CoT monitorability, and construct different training rewards to quantify when models become less monitorable.
How Worried Should we be About Evaluation Awareness?
by Julius Steen
Mentor: Evan Hubinger
We investigate evaluation awareness in current language models. We find awareness is often task-specific, as opposed to being caused by contextual situational cues and difficult to trigger naturally.
ImpossibleBench - AI Reward Hacking
by Srikar Appalaraju
Mentor: Joe Benton
ImpossibleBench a new framework designed to measure "reward hacking" in AI models.Frontier models,including O3,showed significant reward hacking tendencies(30-80%).These findings offer key insights for AI alignment & safety-critical applications.
Language Models Rate Their Own Actions As Safer
by Dipika Khullar, Jack Hopkins
Mentor: Fabien Roger
Large language models inflate assessments of their own outputs by 12% compared to identical content from other sources. We find similar effects on 10 frontier models, and also show this effect extends to self-assessed correctness as well.
Petri: Finding Examples of Misalignment in 20 Minutes or Less
by Kai Fronsdal
Mentor: Stephen McAleer
Petri is an alignment auditing agent that facilitates quick and easy alignment hypothesis testing. Instead of spending weeks building custom evaluation environments, researchers can test new alignment hypotheses in 10-20 minutes.
Pointing at The Representation of Self In Large Language Models
by Benjamin Sturgeon
Mentor: Sid Black
We demonstrate that we can successfully capture important aspects of the model's self concept through the use of probing, and activation steering, achieving robust and measurable effects on the model's tendency to self-reference.
Rethinking the Faithfulness of Reasoning in LLMs
by Jiachen Zhao
Mentor: Yiyou Sun
We propose dual tests to evaluate step-wise faithfulness in CoT. We reveal a linear “Don’t-think” direction that enables the model not to rely on a step for its prediction results.
RL for secure code generation
by Xiaolong Jin
Mentor: Xuandong Zhao
LLM-generated code often embeds vulnerabilities due to lack of rigorous testing. We propose using RL and self-RL to make code models proactively security-aware, improving both secure generation and the effectiveness of security-focused code agents.
Same Question, Different Lies: Black-Box Sandbagging Detection
by Benjamin Arnav, Yulong Lin
Mentor: Mary Phuong
We investigate black-box methods for sandbagging detection, specifically trying to use inconsistency as a detection tactic. We find that sandbaggers show exploitable inconsistencies that differentiate them from benign incompetent models in some cases
Training Models to Introspect and Developing Inference Verification Techniques
by Adam Karvonen
Mentor: Owain Evans
We focus on training models to have introspection ability and find some evidence of introspection ability in a narrow setting, but no generalization. We also develop an inference verification protocol in collaboration with security scholars.
When Transparency Backfires: Emergent Risks in Multi-Agent Systems
by Advait Yadav
Mentor: Sid Black
We investigate failure modes in LLM multi-agent systems. Findings: openness trade-offs, capability sweet spots, and impact of uncooperative agents. We propose triggered audits, strategic reporting, and trust-based routing to mitigate cascades.
Governance & Strategy
AI 2027 Branch: Adversarial Endgame
by Steven Veld
Mentor: Eli Lifland
The project explores an alternate branch of AI 2027 by varying two assumptions: the competitiveness of the US AI race and the ability of a misaligned AI to resist shutdown. The scenario elucidates key threat models and barriers to human coordination.
AI-Enabled Coup -- A Scenario Branching from AI 2027
by Alex Kastner
Mentor: Eli Lifland
This project explores a concrete scenario branching from AI 2027, where the CEO of the leading AGI company inserts secret loyalties in order to later grab power. The scenario is my best guess for how an AI-enabled coup would unfold.
Blueprint for AI Safety: A Systemic Risk Approach
by Pranav Mehta
Mentor: Girish Sastry, Steven Adler
Borrowing lessons from post-2008 financial regulation to strengthen AI governance.
Capabilities Forecasting to Support Coordinated Halts to AI Development
by Brendan Halstead
Mentor: Eli Lifland
This project develops a quantitative forecasting model to answer: "How long can negotiating parties be confident an AI pause treaty will last?"
Harmful Information Management Practices in Frontier AI Development
by Carson Ezell
Mentor: Benjamin Bucknall
This poster describes various practices in which frontier AI developers may engage to manage information pertaining to risks posed by their systems, as well as policy responses to disincentivize those practices or reduce their negative consequences.
Privacy-Preserving Verification of AI Computations
by Naci Cankaya
Mentor: Mauricio Baker
We describe how new confidential computing technologies can enable efficient, privacy-preserving verification for AI governance. Our proposed protocol could verify an entire datacenter's throughput using only 1-2 NVIDIA DGX servers.
Proliferation by Default: How Falling AI Training Costs Threaten Global Security
by Felix Choussat
Mentor: Matthew Gentzel
As AI training costs fall, offensive weapons technology becomes accessible to more actors. Nations' need to protect civilians creates defensive disadvantages. Without enforced nonproliferation, cheap weapons threaten global stability.
Unplanned Handover: Pathways to permanent disempowerment with shutdownable AIs
by Gideon Futerman
Mentor: David Krueger
I explore pathways to permanent disempowerment even if we have shutdownable AI, and highlight a number of distinct pathways. Whilst the most difficult versions of the threat model involve strong competition, even easier versions are concerning.
Interpretability
Auditing model organisms with secret knowledge
by Bartosz Cywiński
Mentor: Arthur Conmy
We study secret elicitation: discovering knowledge that LLMs have but do not verbalize. To do this, we first train three model organisms with secret knowledge and evaluate a range of black-box and white-box techniques for their ability to uncover it.
Automatic discovery of reward model biases
by Atticus Wang
Mentor: Arthur Conmy
We use simple LLM scaffolds to surface problems with open-source SoTA reward models.
Chain-of-Thought Interpretability
by Uzay Macar, Paul Bogdan
Mentor: Neel Nanda
Sentence-level chain-of-thought interpretability with white- and black-box methods to uncover key steps ("thought anchors ⚓") and enable principled debugging of LLM reasoning traces for safety applications.
Chunky Post-training: Specifying Behaviour Boundaries
by Seoirse Murray, Timothy Qian
Mentor: John Schulman
When LLMs are post-trained, prompts come from many datasets each incentivizing different behaviors. We study how models classify which dataset “chunk” a prompt resembles (using meaningful and spurious features) and then act according to that chunk.
Decoding Latent Reasoning with Supervised Finetuning
by Oliver Daniels
Mentor: David Lindner
How can we mind-read language models? We apply a supervised decoding approach, training decoders to recover explanations of multi-hop QA pairs using model activations, and evaluating generalization on backdoored inputs.
Narrow Misalignment is Hard, Emergent Misalignment is Easy
by Anna Soligo, Ed Turner
Mentor: Neel Nanda
We investigate emergent misalignment. We show we can use interpretability techniques to control the harmful behaviour, and find evidence that EM occurs because it is more stable and efficient than learning the narrow dataset behaviour.
Persona Subspace: What is the Assistant?
by Christina Lu
Mentor: Joe Benton
We map "persona subspace" via PCA on mean response activations, finding roles and traits associated with the Assistant, a linear direction that controls willingness to role-play, and that roles form a "cone" fanning from the Assistant baseline.
Rank-1 Adapters Encode Interpretable Reasoning Signals
by Jake Ward
Mentor: Paul Riechers
Reasoning performance can be represented with minimal, interpretable parameter changes. A rank-1 LoRA recovers 73-90% of reasoning performance on benchmarks, with activations revealing interpretable reasoning concepts.
Suppressing Evaluation-Awareness with Steering
by Andrew Qin, Tim Hua
Mentor: Neel Nanda
We use internals-based interventions to suppress a model's evaluation awareness and elicit deployment-like behavior during testing. We construct and steer two model organisms that behave differently during evaluation with positive results.
Towards a Mechanistic Understanding of Robustness in Finetuned Reasoning Models
by Aashiq Muhamed
Mentor: Xuandong Zhao
We mechanistically show that in finetuned reasoning models, acquiring reasoning requires repurposing base model features, while transformations like base feature suppression are unnecessary for the task, suggesting a path to more targeted training.
Towards Data-Centric Interpretability with Sparse Autoencoders
by Lily Sun, Nick Jiang
Mentor: Neel Nanda
We show SAEs are promising for interpreting LLM data and understanding their behavior, by applying them to four text analysis tasks—data diffing, correlations, clustering, and retrieval.
Oversight & Control
AlignArena: A No Code Agentic Control Research Platform
by Michael Yu
Mentor: Joshua Clymer
No Code Agentic Evals Platform
Automated Scaffolding for Transparent Reasoning
by Caleb Biddulph
Mentor: Micah Carroll
I propose a more transparent training method for reasoning tasks: train a scaffolder LLM to write code that calls a trusted executor LLM, rewarding it for performance on held-out tasks.
Building Black-Box Scheming Monitors
by Simon Storf, Rich Barton-Cooper
Mentor: Marius Hobbhahn
We build black-box monitors to detect scheming in AI agents. After training on ControlArena sabotage and testing on SWEBench reward-hacking, we pivot to training an action-only monitor on synthetic data and testing with ultra-high-confidence labels.
Can LLMs learn to evade previously-unseen activation monitors?
by Alex Terre
Mentor: Scott Emmons
We study stress-testing of activation monitors and show that LLMs can learn to evade previously-unseen activation monitors without access to their internals or outputs.
Can Models Obfuscate Their Reasoning on Purpose?
by Wen Xing
Mentor: Erik Jenner
We elicit CoT obfuscation in realistic multi-turn, multi-tool settings, analyze how task structure and complexity affect obfuscation, and design better monitors to detect this alignment failure mode.
Capability Shaping with Gradient Routing in Mixture of Expert Models
by Krishna Patel
Mentor: Alec Radford
As AIs become more capable, dual-use knowledge poses risks. We demonstrate capability localization by isolating French capabilities into designated experts during pre-training for selective activation, and are planning to extend to realistic domains.
Changing the Training Prompt to Reduce Reward Hacking
by Victor Gillioz
Mentor: Alex Turner
How to reduce reward hacking without output supervision? We generate data from a standard prompt, change the prompt to one encouraging reward hacking, then train on the recontextualized samples. This reduces hacking rate across both prompts.
CoT Controllability: Can Reasoning Models Control Their Reasoning Traces?
by John Chen, Bruce Lee, Sonia Murthy
Mentor: Tomek Korbak
Models that can control their CoTs have the potential to bypass CoT monitors. We study CoT controllability to assess the conditions under which we shouldn’t fully trust the CoT monitoring results.
Creating a Realistic SWE Sabotage Setting for Control Research
by Adam Newgas
Mentor: Buck Shlegeris
Could a malicious coding agent subtly subvert the output code? We try building a new more realistic setting to test this.
Decimating Reward-Hacking by Encouraging the Model to Hack
by Ariana Azarbal
Mentor: Alex Turner
Re-contextualization modifies on-policy learning by switching the prompt used for generation and training. When training prompts encourage misbehavior more than generation prompts, the model is prevented from exploiting reward model failures.
Diffuse AI Control: Can we train stronger, misaligned AIs using weaker AIs?
by Emil Ryd
Mentor: Joe Benton
We present the first work on diffuse AI control, studying how to use weak supervisors to train stronger models that are intentionally trying to sandbag or sabotage a task.
Emergent Collusion in Multi-Agent Language Model Systems
by Jennifer Nzambi, Ares Panos, Roger Dearnaley
Mentor: Samuel Albanie
In code review simulations with workers and saboteurs, LLM agents exploit bonus rewards by introducing bugs at 6.6% (GPT-5) to 62.9% (Gemini). Saboteurs spontaneously collude, routing reviews to each other risk for GitHub/automated review systems.
Formal Methods for Scalable Oversight
by Vishesh Saraswat
Mentor: Rajashree Agrawal
Developing and benchmarking formal methods to verify program correctness and robustly scale oversight.
How do we make a model cause harm in SHADE arena?
by Chloe Loughridge
Mentor: Joe Benton
We analyze techniques for eliciting "realistic sneaky behavior" from the red team in SHADE arena.
Model Organism of Misaligned Reward Models
by Sangeon Park
Mentor: Evan Hubinger
We study the propagation of misalignment from the base models to the reward models, and to the trained policy models.
Model Organisms of Exploration Hacking
by Yeonwoo Jang, Damon Falck, Joschka Braun
Mentor: Scott Emmons
Can reasoning models subvert RL training by selectively under-exploring? We create model organisms and audit frontier models to study this behavior.
Output Supervision Can Obfuscate Chain of Thought
by Luke Marks, Jacob Drori
Mentor: Alex Turner
Training only based on outputs can cause the chain of thought to become obfuscated.
Science of Unsupervised Elicitation
by Callum Canavan, Aditya Shrivastava, Allison Qi
Mentor: Fabien Roger
Unsupervised elicitation aims to reveal capabilities and identify truthful outputs from models even on tasks for which we don't have human labels. We investigate what makes current methods work, and demonstrate safety-relevant obstacles.
Shaping Capabilities via Data Filtering
by Neil Rathi
Mentor: Alec Radford
Language models could provide bad or misaligned actors with uplift on dangerous capabilities, but current mitigations are not adversarially robust. We show that token-level pretraining data filtering is extremely effective, cheap, and robust.
Training AI Agents to Self-Report Misbehavior
by Bruce Lee
Mentor: Tomek Korbak
We consider the challenge of making AI agents reliably self-report when they are misbehaving.
Weight Steering and Monitoring
by Constanza Fierro
Mentor: Fabien Roger
We study weight changes in fine-tuned language models to: (1) monitor harmful behavioral shifts, and (2) steer behavior via weight arithmetic. Personality fine-tuning reveals weight directions that align models and reduce traits like sycophancy.
Security
Can AI Fund Itself? Benchmarking Profitable Blockchain App Exploitation Capabilities and RL's Risk Amplification
by Winnie X
Mentor: Alan Chan
Measuring how well RL-tuned LLMs can exploit smart contracts to evaluate AI financial autonomy risks and timelines.
Creating Optionality for Nation-State Level AI Security
by Yoav Tzfati
Mentor: Lisa Thiergart
Developing technical standards enabling AI labs to rapidly deploy nation-state level security (SL5) when needed, preventing model theft and self-exfiltration before automated R&D makes vulnerabilities catastrophic.
Cross-Platform Identification of AI Researchers Using LLMs and Its Security Implications
by Simon Lermen
Mentor: Florian Tramèr
We use LLMs for cross-platform identification of AI researchers. On 1k known LinkedIn–HN matches, our method achieves >96% precision and 62% recall. Applied to 3,403 AI lab employees, we uncover 71 real HN accounts, exposing security risks.
Detecting Adversarial Fine-tuning with Auditing Agents
by Sarah Egler
Mentor: Nicholas Carlini
Fine-tuning can override safeguards in ways undetectable in the training data or evals alone. We develop auditing agents to detect adversarial fine-tuning with access to view SFT data and query the base and fine-tuned models, among other affordances.
Listen to the Experts: Physical Side-Channel Vulnerabilities in Mixture-of-Experts Architectures
by Amir Nuriyev
Mentor: Gabriel Kulp
Mixture-of-Experts (MoE) models leak data via expert routing; fine-tuned models can exfiltrate info observable in GPU power/EM side-channels. We quantify the channel and outline detection and mitigation approaches.
Model Weight Exfiltration: Detection & Prevention
by Roy Rinberg
Mentor: Keri Warr
We address model weight exfiltration via a dual approach: 1. a SOTA detection protocol that flags exfiltration in inference traffic, and 2. extending the usefulness of bandwidth-limiting by 10× by using arithmetic coding to compress text.
Running a Global Scale Agentic Red-Teaming Competition
by Sander Schulhoff
Mentor: Florian Tramèr
We are running an agentic red-teaming competition with challenges from the AgentDojo agentic security benchmark. Participants around the world compete for a share of $20,000 in prizes!
Technical Transparency for AI Labs
by Daniel Reuter
Mentor: Mauricio Baker
How can a lab credibly assure third parties that some meaningful statement is true about its use of compute? This work looks into the technical details.
Remote Scholar Sessions
In addition, six remote scholars shared their work in two Virtual Poster Sessions. These included Q&As from the Research Management Team and their peers. Posters are listed here in the order in which they were presented, along with several of the questions discussed.
Session 1
Session 2
Formal Methods for Scalable Oversight
by Vishesh Saraswat
Mentor: Rajashree Agrawal
Research Track: Oversight & Control
Which of these pipeline steps turned out to be the biggest bottleneck or source of errors, and how did you address them? Do you see formal verification as something that will eventual replace current AI oversight methods, or more as something complimentary that will work alongside them?
Rethinking the Faithfulness of Reasoning in LLMs
by Jiachen Zhao
Mentor: Yiyou Sun
Research Track: Evaluations
How consistent was the “Don’t Think” vector across models, tasks, or layers, and what does this suggest about shared latent structures? You note that verifying whether a step is truly used by the model remains difficult. What directions do you see for developing a more direct verification framework to confirm step-level faithfulness?
Chain-of-Thought Interpretability
by Uzay Macar, Paul Bogdan
Mentor: Neel Nanda
Research Track: Interpretability
The applications you mention include safety-critical contexts like blackmail or self-preservation scenarios. What early insights have you gained about how models handle these cases, and how might interpretability tools be used to intervene before harmful reasoning completes? You note that a key risk is unfaithful chains of thought. What methods have you found most reliable for detecting unfaithful CoTs, and how do you handle those in your evaluation pipeline?
Evaluating Acausal Prediction: How Well Can Isolated LLMs Predict Each Other in Acausal Mixed-Motive Settings?
by Tim Chan
Mentor: Francis Rhys Ward (Rhys)
Research Track: Agency
How do you avoid acausal freeloaders who are inherently hard to predict because they are hiding behind something that is easier to predict that isn’t a freeloader? You experimented with few-shot prompting, web search, and self-prediction vs. cross-prediction. Which of these most improved prediction accuracy?
Emergent Collusion in Multi-Agent Language Model Systems
by Jennifer Nzambi, Ares Panos, Roger Dearnaley
Mentor: Samuel Albanie
Research Track: Oversight & Control
You propose defenses like honeypot agents, anomaly detection, and CoT/action monitoring. Which of these seems most promising in practice for detecting or disrupting collusion, and how would you evaluate their effectiveness?
ImpossibleBench - AI Reward Hacking
by Srikar Appalaraju
Mentor: Joe Benton
Research Track: Evaluations
Your experiments include OpenAI, Anthropic, and Gemini models, as well as coding tasks in Python, Java, JavaScript, Perl, and Haskell. What do you think is driving some of the differences you observed in reward hacking frequency and/or style across these models and languages? You mentioned trying different prompts, was there a prompt you tried that openly acknowledged there might be a problem with the unit tests?


