by Daniel Filan
AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.
Language
🇺🇲
Publishing Since
12/11/2020
Email Addresses
1 available
Phone Numbers
0 available
March 28, 2025
<p>How do we figure out whether interpretability is doing its job? One way is to see if it helps us prove things about models that we care about knowing. In this episode, I speak with Jason Gross about his agenda to benchmark interpretability in this way, and his exploration of the intersection of proofs and modern machine learning.</p> <p>Patreon: <a href= "https://www.patreon.com/axrpodcast">https://www.patreon.com/axrpodcast</a></p> <p>Ko-fi: <a href= "https://ko-fi.com/axrpodcast">https://ko-fi.com/axrpodcast</a></p> <p>Transcript: <a href= "https://axrp.net/episode/2025/03/28/episode-40-jason-gross-compact-proofs-interpretability.html"> https://axrp.net/episode/2025/03/28/episode-40-jason-gross-compact-proofs-interpretability.html</a></p> <p> </p> <p>Topics we discuss, and timestamps:</p> <p>0:00:40 - Why compact proofs</p> <p>0:07:25 - Compact Proofs of Model Performance via Mechanistic Interpretability</p> <p>0:14:19 - What compact proofs look like</p> <p>0:32:43 - Structureless noise, and why proofs</p> <p>0:48:23 - What we've learned about compact proofs in general</p> <p>0:59:02 - Generalizing 'symmetry'</p> <p>1:11:24 - Grading mechanistic interpretability</p> <p>1:43:34 - What helps compact proofs</p> <p>1:51:08 - The limits of compact proofs</p> <p>2:07:33 - Guaranteed safe AI, and AI for guaranteed safety</p> <p>2:27:44 - Jason and Rajashree's start-up</p> <p>2:34:19 - Following Jason's work</p> <p> </p> <p>Links to Jason:</p> <p>Github: <a href= "https://github.com/jasongross">https://github.com/jasongross</a></p> <p>Website: <a href= "https://jasongross.github.io">https://jasongross.github.io</a></p> <p>Alignment Forum: <a href= "https://www.alignmentforum.org/users/jason-gross">https://www.alignmentforum.org/users/jason-gross</a></p> <p> </p> <p>Links to work we discuss:</p> <p>Compact Proofs of Model Performance via Mechanistic Interpretability: <a href= "https://arxiv.org/abs/2406.11779">https://arxiv.org/abs/2406.11779</a></p> <p>Unifying and Verifying Mechanistic Interpretability: A Case Study with Group Operations: <a href= "https://arxiv.org/abs/2410.07476">https://arxiv.org/abs/2410.07476</a></p> <p>Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration: <a href= "https://arxiv.org/abs/2412.03773">https://arxiv.org/abs/2412.03773</a></p> <p>Stage-Wise Model Diffing: <a href= "https://transformer-circuits.pub/2024/model-diffing/index.html">https://transformer-circuits.pub/2024/model-diffing/index.html</a></p> <p>Causal Scrubbing: a method for rigorously testing interpretability hypotheses: <a href= "https://www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing"> https://www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing</a></p> <p>Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition (aka the Apollo paper on APD): <a href= "https://arxiv.org/abs/2501.14926">https://arxiv.org/abs/2501.14926</a></p> <p>Towards Guaranteed Safe AI: <a href= "https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-45.pdf"> https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-45.pdf</a></p> <p> </p> <p> </p> <p>Episode art by Hamish Doodles: <a href= "https://hamishdoodles.com/">hamishdoodles.com</a></p>
March 1, 2025
<p>In this episode, I chat with David Duvenaud about two topics he's been thinking about: firstly, a paper he wrote about evaluating whether or not frontier models can sabotage human decision-making or monitoring of the same models; and secondly, the difficult situation humans find themselves in in a post-AGI future, even if AI is aligned with human intentions.</p> <p> </p> <p>Patreon: <a href= "https://www.patreon.com/axrpodcast">https://www.patreon.com/axrpodcast</a></p> <p>Ko-fi: <a href= "https://ko-fi.com/axrpodcast">https://ko-fi.com/axrpodcast</a></p> <p>Transcript: <a href= "https://axrp.net/episode/2025/03/01/episode-38_8-david-duvenaud-sabotage-evaluations-post-agi-future.html"> https://axrp.net/episode/2025/03/01/episode-38_8-david-duvenaud-sabotage-evaluations-post-agi-future.html</a></p> <p>FAR.AI: <a href="https://far.ai/">https://far.ai/</a></p> <p>FAR.AI on X (aka Twitter): <a href= "https://x.com/farairesearch">https://x.com/farairesearch</a></p> <p>FAR.AI on YouTube: <a class= "mention-chip ytcp-social-suggestion-input" spellcheck="false" href="https://studio.youtube.com/channel/UCCV6kbjBZje3LPxRp0NHfxg">@FARAIResearch</a></p> <p>The Alignment Workshop: <a href= "https://www.alignment-workshop.com/">https://www.alignment-workshop.com/</a></p> <p> </p> <p>Topics we discuss, and timestamps:</p> <p>01:42 - The difficulty of sabotage evaluations</p> <p>05:23 - Types of sabotage evaluation</p> <p>08:45 - The state of sabotage evaluations</p> <p>12:26 - What happens after AGI?</p> <p> </p> <p>Links:</p> <p>Sabotage Evaluations for Frontier Models: <a href= "https://arxiv.org/abs/2410.21514">https://arxiv.org/abs/2410.21514</a></p> <p>Gradual Disempowerment: <a href= "https://gradual-disempowerment.ai/">https://gradual-disempowerment.ai/</a></p> <p> </p> <p>Episode art by Hamish Doodles: <a href= "https://hamishdoodles.com/">hamishdoodles.com</a></p>
February 9, 2025
<p>The Future of Life Institute is one of the oldest and most prominant organizations in the AI existential safety space, working on such topics as the AI pause open letter and how the EU AI Act can be improved. Metaculus is one of the premier forecasting sites on the internet. Behind both of them lie one man: Anthony Aguirre, who I talk with in this episode.</p> <p>Patreon: <a href= "https://www.patreon.com/axrpodcast">https://www.patreon.com/axrpodcast</a></p> <p>Ko-fi: <a href= "https://ko-fi.com/axrpodcast">https://ko-fi.com/axrpodcast</a></p> <p>Transcript: <a href= "https://axrp.net/episode/2025/01/24/episode-38_6-joel-lehman-positive-visions-of-ai.html"> https://axrp.net/episode/2025/02/09/episode-38_7-anthony-aguirre-future-of-life-institute.html</a></p> <p>FAR.AI: <a href="https://far.ai/">https://far.ai/</a></p> <p>FAR.AI on X (aka Twitter): <a href= "https://x.com/farairesearch">https://x.com/farairesearch</a></p> <p>FAR.AI on YouTube: <a href= "https://www.youtube.com/@FARAIResearch">https://www.youtube.com/@FARAIResearch</a></p> <p>The Alignment Workshop: <a href= "https://www.alignment-workshop.com/">https://www.alignment-workshop.com/</a></p> <p> </p> <p>Topics we discuss, and timestamps:</p> <p>00:33 - Anthony, FLI, and Metaculus</p> <p>06:46 - The Alignment Workshop</p> <p>07:15 - FLI's current activity</p> <p>11:04 - AI policy</p> <p>17:09 - Work FLI funds</p> <p> </p> <p>Links:</p> <p>Future of Life Institute: <a href= "https://futureoflife.org/">https://futureoflife.org/</a></p> <p>Metaculus: <a href= "https://www.metaculus.com/">https://www.metaculus.com/</a></p> <p>Future of Life Foundation: <a href= "https://www.flf.org/">https://www.flf.org/</a></p> <p> </p> <p>Episode art by Hamish Doodles: <a href= "https://hamishdoodles.com/">hamishdoodles.com</a></p>
Dwarkesh Patel
Machine Learning Street Talk (MLST)
Spencer Greenberg
Patrick McKenzie
Mercatus Center at George Mason University
swyx + Alessio
Turpentine
Hannah Fry
Russ Roberts
Conviction
Microsoft, Brad Smith
Tom Chivers and Stuart Ritchie
Sam Charrington
The 80000 Hours team
Pod Engine is not affiliated with, endorsed by, or officially connected with any of the podcasts displayed on this platform. We operate independently as a podcast discovery and analytics service.
All podcast artwork, thumbnails, and content displayed on this page are the property of their respective owners and are protected by applicable copyright laws. This includes, but is not limited to, podcast cover art, episode artwork, show descriptions, episode titles, transcripts, audio snippets, and any other content originating from the podcast creators or their licensors.
We display this content under fair use principles and/or implied license for the purpose of podcast discovery, information, and commentary. We make no claim of ownership over any podcast content, artwork, or related materials shown on this platform. All trademarks, service marks, and trade names are the property of their respective owners.
While we strive to ensure all content usage is properly authorized, if you are a rights holder and believe your content is being used inappropriately or without proper authorization, please contact us immediately at [email protected] for prompt review and appropriate action, which may include content removal or proper attribution.
By accessing and using this platform, you acknowledge and agree to respect all applicable copyright laws and intellectual property rights of content owners. Any unauthorized reproduction, distribution, or commercial use of the content displayed on this platform is strictly prohibited.