Justified Posteriors

Seth Benzell and Andrey Fradkin

Explorations into the economics of AI and innovation. Seth Benzell and Andrey Fradkin discuss academic papers and essays at the intersection of economics and technology. empiricrafting.substack.com

  1. FEB 9

    Basil Halperin: Leading Indicators for TAI, Conditions for the Singularity, and Tax Policy at the End of History

    In this week’s episode of Justified Posteriors, we interview TAI expert and friend of the show Basil Halperin of the University of Virginia. There Basil is doing some of the most fascinating work on the economics of TAI with Anton Korinek and other leading researchers. The first section of our conversation covers Basil’s early career, including jobs at Uber and AQI, how he got interested in AI as a research topic, and his role in managing the Stripe Economics of AI Fellowship. We then discuss a paper we’ve already covered on the show: his work on whether the real interest rate can be interpreted as a leading indicator of the probability of TAI (or ‘doom’). Listen to our previous conversation on his paper, and view show notes, including links to that paper and blog post here: If the Robots Are Coming, Why Aren't Interest Rates Higher? Seth was previously convinced by Basil’s arguments, but Andrey was a hold out — we discover Basil’s takes about Andrey’s reservations. Our third subject is Basil’s new paper with Anton about the relevant elasticities for a singularity in research progress “When Does Automating Research Lead to Explosive Growth?” Basil explains how the key issues are the degree of fishing out and spillovers in/across different industries, as well as the extent to which research can be automated. We also take a step back to ask what theoretical research like this teaches us.Finally, we cover Basil’s back and forth with friend of the show Phil Trammel’s new blog post with Dwarkesh about Piketty and optimal taxation in the age of TAI, link below, and ask him to explain the meme he posted, summarizing his arguments: Additional references: Does carbon taxation yield a double dividend (environmental plus fiscal)? We hope you enjoy the conversation! Transcript follows: [00:00] Seth Benzell: Welcome to the Justified Posteriors podcast, the podcast that updates its beliefs about the economics of AI and technology. I’m Seth Benzell, looking forward to the Basil exposition we’ll get today, coming to you from Chapman University in sunny Southern California. [00:35] Andrey Fradkin: And I’m Andrey Fradkin, looking forward to creating a new accord with Basil, coming to you from San Francisco, California. And today we’re very excited to welcome Basil Halperin to our show. Welcome to the show. [00:49] Basil Halperin: Thanks Andrey. Thanks Seth. Super excited to be here. [00:53] Andrey Fradkin: So as background, Basil is an expert on the economics of transformative AI and he’s currently... [01:00] Seth Benzell: Expert is underselling. He is one of the most interesting thinkers around on... Alright, continue. [01:07] Andrey Fradkin: Yes, he’s great. And he’s a professor at the University of Virginia. We have an exciting show for you today touching on many topics, but we first wanted to get a start with some of the biographical tidbits. In particular, Basil, how did you get interested in this topic? And it seems like you were a lot earlier than other economists. So I’m curious what drew you in before everyone else to this interesting set of topics? [01:38] Basil Halperin: I mean, not as early as you two, I don’t think. Uh, I don’t know. I was just a nerd growing up. I read a lot of sci-fi. I read Ray Kurzweil in high school when his The Singularity is Near book came out in the 2000s, just because it was popular. The idea got in my head. I was kind of like, “Well, this is interesting, but eventually...” I was like, “I have a few decades to work on other things before any of this becomes relevant.” And then GPT-3 came out in that long hot summer of 2020. I freaked out a little bit for a week or two. This is crazy. How is this happening so fast? So that sort of woke me up a bit. I started thinking about these issues and gradually more and more have gotten sucked into working on it. [02:20] Seth Benzell: What were your favorite sci-fi growing up? [02:23] Basil Halperin: Ender’s Game was always the classic. [02:26] Andrey Fradkin: Now I saw on your resume that you spent a stint at AQR, which is a large capital management firm. I’m curious, what did you learn working there? [02:37] Basil Halperin: Yeah. So I didn’t expect to go into finance out of college, but basically the opportunity came along. I found out that this firm seemed pretty interesting. So the background is, this firm was founded by two PhD students of Eugene Fama, the Nobel Laureate in finance. Basically taking his ideas seriously and other ideas from the asset pricing literature seriously and applying them to earn a bunch of money. So I didn’t know anything about finance going into that job. So I learned a whole bunch and some of that has been applied in my research that I think we’ll talk about today. [03:13] Seth Benzell: Ooh, wait, yeah. Pricing assets in the age of AI. Fascinating. [03:17] Basil Halperin: Yeah, yeah. Talk about it. [03:19] Andrey Fradkin: So I do think this is an interesting background because a lot of people in our field don’t have a finance background. That’s not where they’re coming from in terms of thinking about technology. So it maybe gave you this strong, prepared mind to be thinking about the asset pricing implications of transformative AI. Did you get to interact with Cliff Asness or were you too much of a, like, intern, low-level employee? [03:45] Basil Halperin: No, I was there for a year and a half or two years, but too junior. I think one time I made a bad joke to him in the elevator and he like, pretended to laugh. That was pretty much the highlight. [03:56] Andrey Fradkin: Well, he also likes to make a lot of bad jokes, so you have that in common. Some of them are good too. [04:05] Basil Halperin: [Laughs] These bad jokes are funny. [04:06] Andrey Fradkin: What about at Uber? You also spent some time there working with John List, is that right? [04:11] Basil Halperin: Yeah, yeah. John taught my first ever Econ class when I was undergrad at Chicago, Intro Micro. And he helped inspire me to become an economist plausibly. And then yeah, I worked for him when he was Chief Economist at Uber. Which, Andrey, as you well know, being an economist in tech is an interesting experience. And Uber in 2017 was a particularly interesting time because it was a controversial firm. Sort of like OpenAI is today, the firm that’s always in the headlines. [04:42] Andrey Fradkin: Were there specific perspectives that you gained there that have informed your subsequent economics career? Or was it more of just like you learned some useful skills in data science or something else? [04:55] Basil Halperin: Yeah, I don’t know how much super tangible I have to say, but it definitely was informative in general to work in the private sector before going into academia, just to see how different things are. You know, like in the private sector you’re being paid to tell your boss that he or she is wrong. And then in academia that’s not so much a recommended strategy. [05:19] Seth Benzell: Wait, wait, okay. So tell us about... so you’re there, it’s in 2017. Uber is one of the most evil, fast-growing companies on the planet. So you said it was interesting. So what was interesting about that? Were you pressured to write an economics report you didn’t agree with? Did you feel like you had to like wear, you know, a hoodie going into the office as people were throwing trash at you? What was it like? [05:43] Basil Halperin: No, it was just... I mean, I certainly didn’t have a negative experience or negative view of the company, though I’m sure there were negative things the company did, like any large organization. But the team I was on, this Chief Economist team, was like five people. So it was pretty small. So we just had a lot of leverage to go around the company, be sort of an internal consultancy and do a lot of crazy things, varied things that I otherwise never would have had the chance to do. Like I was sort of a software engineer for one month that I was there, which was otherwise something that never would have happened to me. Or running large scale experiments on a million riders or whatever, which... I would love to do macro experiments if any central bank wants to volunteer for some coin flips. But otherwise, as a macroeconomist now, I don’t really have that opportunity. [06:35] Andrey Fradkin: So this kind of is a, you know, is a nice segue into our next topic, which is... like a lot of people are worried about their careers these days, obviously because of AI. [06:49] Seth Benzell: Not me! Podcasting is never gonna go out of style, Andrey! [06:53] Andrey Fradkin: Fair enough. But I think that’s a very broad question and perhaps too broad to answer. But I think for people with an interest in economics—you know, you were in tech, you decided to go into academia. I’ve made the same decision in my life. But I’m curious like what advice would you have? And maybe this is a good opportunity to also speak about the efforts you’ve been doing with the Stripe Economics of AI Fellowship. [07:23] Basil Halperin: Yeah, okay. So two points here. One point is that I feel like on every good AI podcast, there’s a question of, “What do you tell young people? What they should be studying today?” And like there’s zero good answer to that question. So yeah, I don’t have any good answer to that question. [07:38] Seth Benzell: Study the Justified Posteriors podcast. Listen to every episode every day. Three times a day. [07:45] Basil Halperin: But besides that, it’s not clear. The other thing I guess I can say is that if you’re an economist, working on the economics of AI is like a really cool thing to do. There’s just like so much low hanging fruit. There’s so many insights that can be arbitraged from other fields, which is always a good place to be. You can... instead of going to have to pick the fruit yourself, you can just take the fruit out of othe

    1h 29m
  2. JAN 26

    Can an AI Interview You Better Than a Human?

    We discuss “Voice in AI Firms: A Natural Field Experiment on Automated Job Interviews” by Brian Jabarian and Luca Henkel. The paper examines a randomized experiment with call center job applicants in the Philippines who were assigned to either AI-conducted voice interviews, human interviews, or given a choice between the two. Key Findings: * AI interviews led to higher job offer rates and proportionally higher retention rates * No significant difference in involuntary terminations between groups * Applicants actually preferred AI interviews—likely due to scheduling flexibility and immediate availability * AI interviewers kept conversations more on-script with more substantive exchanges * Online applicants saw especially large gains from AI interviews Topics Discussed: * The costs of recruitment and why interview efficiency matters * Whether AI interviews find different workers or just reduce noise in screening * How human recruiters interpret AI interview transcripts differently * The “Coasean singularity” question: Will AI improve labor market matching overall? * Limitations: scheduling confounds, external validity beyond call centers, unmeasured long-tail outcomes * The coming arms race between AI interviewers and AI-coached applicants Posterior Updates: On the usefulness of current AI for job hiring: * Seth: 40% → 90% confidence AI works for call center jobs; modest update for general jobs * Andrey: 20% → 75% for call centers; 1% → 5% for general interviews (“we need to reorganize all of hiring first”) On whether AI will improve job matching significantly on net in the next 5-10 years * Andrey: 55% → No Update * Seth: “A bit more optimistic than Andrey” → +1pp update Referenced Work/Authors: * Prediction Machines * Related episode on AI and labor signaling with Bo Cowgill. Transcript: [00:00:00] INTRODUCTION Seth: Welcome to the Justified Posteriors podcast, the podcast that updates its priors about the economics of AI and technology. I’m Seth Benzell, an interviewer who will never stick to a standard script, coming to you from Chapman University in sunny Southern California. Andrey: And I’m Andrey Fradkin, counting down the days until I can use an AI to pre-interview my podcast guests to see if they deserve to be on the show. Coming to you from San Francisco, California. Seth: I don’t know. I think our filtering criteria is pretty good. Andrey: I know. Seth: Right. That’s one job we never want to automate—who becomes a friend of the podcast. That’s an un-automatable job. Andrey: But it would be nice to pre-interview our guests so that we could prepare better for the actual show. Seth: I was thinking about this, because there’s two possibilities, right? You do the pre-interview, and you get an unsurprising answer in this sort of pre-interview, and then that’s good, and then you should go with it. And then if you get a surprising one, then you would lean into it. What would you even get out of the pre-interview? Andrey: Maybe what the guests would want to talk about. Seth: Okay. Andrey: But I agree with you. Mostly, it’s just hearing the guest talk, and then thinking about, “Oh, this is something that we want to really dig into,” versus, “This is something that might be not as interesting to our audience,” and knowing that ex ante. [00:02:00] SETTING UP THE TOPIC Seth: Yeah. We’ve been... So we’re talking about interviews. You’ll remember in a recent episode, we just talked to our friend Bo, who’s doing work on how maybe job applications are changing because of AI. So now I think what we want to think a little bit about is how job interviews are changing because of AI. Maybe we’ve heard before about how AI is changing how people talk to the hirer. Maybe we want to hear a little bit about how AI is changing how the hirer solicits information in an interview. We’ve got a very interesting paper to talk about just about that. But do you remember the last job interview you did, Andrey? Andrey: Yes. Seth: How did it go? Did you have fun? Did you feel like you stayed on topic? Andrey: It was a very intense set of interviews that required me to fly halfway across the world, which was fun, but exhausting. Seth: So fun. So you would describe the interview as a fun experience? Did you get more excited about the job after doing the interview? Andrey: Yes, although I ultimately didn’t take it, but I did get—you know, I was impressed by the signaling value of having such an interview. Seth: So the signaling value. So in other words, the signal to you from the interviewer about the fact that they were going to invest this much time. Is that right? It’s that direction of signal? Andrey: Yes, yes. And also the sorts of people who they had talking to me, and just the fact that they were trying to pitch me so hard. Now, certain other companies lacked such efforts. Seth: Right. So it seems like one important aspect of an interview is what the interviewee learns from the interview. But what about the other side? Do you feel like your interviewer learned a lot about you, or enough to justify all that time and expense? Andrey: I’d like to think so. I mean, I’m not them, so I can’t really speak on their behalf. But it did seem like the interview process was fairly thought out for a certain set of goals, which might differ across companies. What about yourself, Seth? Seth: Thank God, it has been a long time ago that I interviewed for a job, and I can tell you exactly what happened. I was on the academic job market, but I did throw out a couple of business applications, and so I got an interview at Facebook. Headed out to their headquarters, did all of the one-on-one interviews, and then there was a code screen, and I was not grinding LeetCode for the last five months and completely bombed it. And they said, “Thank you very much for your time.” So that was an example of, I think they probably could have saved the time for the interview if they had given me the code screen first. Andrey: It’s funny, there was a time in my life where I interviewed at Facebook, too. I mean, this is probably 2014 or something. Seth: Mm-hmm, mm-hmm. Andrey: And they did do the coding screen before. Seth: Who knows? Who knows, dude? [00:05:15] THE PAPER Seth: Okay, so interviews, we do them. People seem to give information, take information from them. How can this be made more efficient with AI? That’s today’s question. In order to learn more about that, we read Voice in AI Firms: A Natural Field Experiment on Automated Job Interviews, by friend of the show, Brian Jabrian and Luca Henkel. I was interested in this paper because it’s kind of an interesting flip side of what we just saw from Bo. I guess before we talk too much about what the paper actually does, it’s time for us to go into our priors. ═══════════════════════════════════════════════════════════════════ [00:06:00] PRIORS Seth: Okay, so Andrey, when we’re thinking about AI being used in interviews, what sort of thoughts do you have about that going in? What sort of priors should we be exchanging? Andrey: Yeah, I mean, I think just when I first saw this paper, I was kind of surprised that we were there already, honestly. I think interviewing via voice is a pretty delicate thing, and the fact that AI is potentially able to do it already was—I hadn’t been thinking—I didn’t think we were there yet, and I think just the very existence of this paper was a bit of a surprise when I first saw it. But I guess a first natural prior that we can think about is: is using an AI to interview someone rather than using a human to interview someone, is that better or worse, or how do we think about that? So, Seth, what do you think? Seth: Well, it’s a big question, Andrey. I guess my first response is, like we always say in this podcast, context matters, partial equilibrium versus general equilibrium matters. The context that we’re going to be looking at in the paper is call center workers. So maybe I’ll give kind of a different answer for short-term call center workers than maybe longer term economy as a whole. When I think about call center workers, I think about a job that seems to be—no offense to our friends of the show out there who are call center workers—but this does seem like one of the jobs that is going to be the first to be automated with generative AI, or most at risk, especially kind of low-skilled call center work. So if there was going to be any sort of domain where you could automatically verify whether someone was good at it, intuitively, it would be the domain that you’re kind of close to automating anyway. So if it was going to work anywhere, I would say it would work here. And yet still, call center work, you might imagine, it requires a lot of personal empathy, it requires maybe some subtleties of voice and accent that an AI might not identify or even might hesitate to point out such deficits. I would say I kind of went in with the idea that for call center workers, maybe there’s a forty percent chance that AI would be better than a human interviewer. So maybe it’s slightly unlikely that it would be better. But if we were to expand out to kind of knowledge work as a whole, I would be more, even more pessimistic, maybe only a twenty-five percent chance or lower that the AI interviewer would be better. What do you think? Andrey: Well, how would you—what do you mean by better? Seth: Oh, well, better in terms of the hire is ultimately the correct match, right? That’s going to be operationalized in a specific way in this paper, what... How they’re going to measure better match, but, yeah, that’s what I would say. They hire someone who’s going to be productive and work with the firm for a long time. Andrey: Yeah. I mean, so that’s ki

    1h 1m
  3. Anecdotes from AI Supercharged Science

    JAN 13

    Anecdotes from AI Supercharged Science

    Anecdotes of AI Supercharged Science: Justified Posteriors reads “Early Science Acceleration Experiments with GPT-5” In this episode, Seth and Andrey break down OpenAI’s report, Early Science Acceleration Experiments with GPT-5. The paper is organized as a series of anecdotes about how top scientists used an early version of GPT-5 in their scientific investigations. The coauthors of the papers try out the model to help them with everything from Erdős’ unsolved math problems to understanding black hole symmetries to interpreting the results of a biological experiment. Seth and Andrey’s priors revolve around whether current models are closer to a “superpowered lit review” or a genuine co-author. They bring in how they currently use LLMs in their own economic research—from coding assistance to "middle-brow" theorizing—before diving into the paper’s anecdotes. They also discuss the economics of AI science and whether AI can ever achieve a Kuhnian paradigm shift. A key question is what is the main bottleneck to more useful AI tools for math and science — is it the model’s reasoning capability or simply the lack of translation layers into formal proof systems like Lean? Priors Hypothesis 1: What is the most promising paradigm for AI in Science today and 5 years from now? (The four paradigms: Recreating frontier science, Superpowered Lit Review, Working with AI/Co-working, and AI on its own). * Andrey’s View: * Today: “Working with AI” (Co-working) is the primary mode. It doesn’t automate the job but makes the human significantly more productive. * In 5 Years: “Working with AI” remains the dominant mode. While “AI on its own” is the holy grail, he believes human-AI collaboration will still be the standard, though the tasks will shift higher up the stack. * Seth’s View: * Today: “Superpowered Lit Review” is the clearest “no-downside win.” Checking if a problem is already solved offers massive efficiency gains without the risk of hallucination inherent in creative work. * In 5 Years: “AI on its own”—but with a major caveat based on Thomas Kuhn’s philosophy. Seth predicts AI will be capable of autonomous “Normal Science” (puzzle solving within a paradigm) but skeptical it can achieve “Revolutionary Science” (creating new paradigms like molecular motion theory or relativity). Hypothesis 2: How impressed will we be by the anecdotes in this report? (On a scale of 0 to 10, where 10 is “Holy Sh*t / Curing Cancer” and 0 is “Trivial”). * Andrey’s View: * Estimate: “Pretty Impressed” (Implied ~7/10). * Reasoning: He does not expect a “Holy Sh*t” moment (like curing cancer or solving the Riemann hypothesis) because those results take years to verify or diffuse. However, he expects to see strong productivity gains in “middle-brow” theory. * Seth’s View: * Estimate: 7 or 8 out of 10. * Reasoning: He prices in that this is a “highly selected sample” from OpenAI marketing. He expects to be impressed but skeptical of direct practical applications (e.g., a medical treatment we can use in the near future). Links + Shownotes * Early Science Acceleration Experiments with GPT-5 – The central paper of the episode by Sébastien Bubeck, Timothy Gowers, and others (OpenAI/arXiv, Nov 2025). * Sparks of Artificial General Intelligence: Early experiments with GPT-4 – The predecessor paper by Sebastian Bubeck et al. (for context on the “Early Experiments” series). Scholars Mentioned * Benjamin Golub – Podcast guest in a recent episode; Professor of Economics and Computer Science at Northwestern University. We say the episode with Golub is upcoming, but it’s already out! Check it out here. * Timothy Gowers – Fields Medalist and co-author of the paper * Sébastien Bubeck – Lead author of the paper and researcher at OpenAI. * Terence Tao – Fields Medalist mentioned for his use of AI in mathematics. * Imre Lakatos – A philosopher of science * Tyler Cowen – Economist mentioned regarding the concept of “Writing for the AI.” * Paul Erdős Problems – The unsolved problems of this famously prolific mathematician were used as a benchmark. Tools & Technology * Refine.inc – The AI-for-science tool co-founded by Ben Golub. * Lean – The theorem prover and programming language discussed as a potential bottleneck/accelerant for checking AI math. * Elicit – The AI research assistant mentioned by Andrey for literature reviews. * Pangram Labs – The AI text detection tool mentioned in the context of scientific writing. Concepts & Philosophy * The Structure of Scientific Revolutions – Thomas Kuhn’s foundational text on “Normal Science” vs. “Paradigm Shifts.” * The Lucas Critique – Economic theory mentioned by Seth regarding a recent economic paradigm shifts. Transcript: [00:00] Seth Benzell: Welcome to the Justified Posteriors podcast, the podcast that updates its beliefs about the economics of AI and technology. I’m Seth Benzell, sharing helpful ideas that come naturally to me, but not quite big enough a contribution to demand co-authorship, at Chapman University in sunny Southern California. [00:33] Andrey Fradkin: And I’m Andrey Fradkin, experimenting with numerous ways to use AI in order to make the trivial parts of my work take way less time. But then again, maybe all parts of my work are trivial. Coming to you from San Francisco, California. [00:53] Seth: All right, Andrey. Coming out the gate against himself. [00:58] Andrey: That’s the only way I know how to be, Seth. That’s the only way. [01:03] Seth: Well, I mean, maybe that’s a good place to start. I know that you use LLMs all the time as part of your research. We could talk a little bit as we go along about how you use it now, but maybe you could tell me: how do you use it now and how would your dream AI assistant help you with research? Is your dream to completely delegate it? What would be a reasonable near-term dream? What do you have and what do you want? [01:31] Andrey: Yeah. Wow. I didn’t realize it was already Christmas. Readers, we’re recording this in November, so it’s not quite there yet. [01:41] Seth: Mariah Carey is on the way, dude. [01:44] Andrey: So, look, I use it all the time. And I proactively use it because I’m always trying to figure out what it’s capable of doing and what it’s not capable of doing. You know, in terms of the science part of our work—which is a big part of it, but a lot of what we do is also presentation, communication, reimbursement requests... [02:12] Seth: [Laughs] Reimbursement requests. [02:14] Andrey: Yeah. But in terms of science, some parts of my work require some math, right? Not very complicated math. And I’ve been using the latest generation of AIs to see how well it does there. And, you know, it’s pretty good, honestly. It definitely requires oversight. Like, I wouldn’t trust it to just do it. But with some iteration, it has given me good results and it’s allowed me to check some of my results. And once we’re kind of agreed—me and the model—on what the results are, it’s very efficient at writing it up. And even doing things like, “Oh, create a simulation based on this model,” or “Create an interactive visualization based on this model.” So I think that sort of work, it’s already pretty good at. [03:17] Seth: Actually, can I ask a quick question here before you go on? You’ve described it as a system that is maybe like... it guesses and then you have to check it. So you have this sort of iteration. You say, “Solve for the equilibrium of this model,” and you’re not guaranteed that the first output is going to be correct. So that’s a sense in which the AI is proposing solutions and you’re the verifier. But you also find it useful for the opposite, right? Where you have an intuition about a result and then it’s the verifier. Should I notice a contradiction there? [03:56] Andrey: I don’t think it’s a contradiction. I think as with any results or ideas, we want to battle-test it, right? And that could go in either direction. It’s kind of like when you give an academic seminar. You’re going to present some work and you’re going to get feedback from a bunch of people. Some of it might be good, some of it might be bad. But you might also go to your co-author and they might create something new. So I don’t view it as a contradiction. I guess one way to think about it is that it’s not omniscient, right? So it isn’t like doing things end-to-end without my judgment yet. I can’t just give it a prompt and then it finishes the entire task. [04:54] Seth: It sounds kind of like a colleague with some knowledge in the domain. [04:59] Andrey: Yes, exactly. [05:01] Seth: It might be able to propose an answer that isn’t necessarily right, and it might find a flaw in one of your ideas—those aren’t necessarily right either—but you would never use it as its own end-to-end proof to write it up and present it at Columbia. [05:19] Andrey: Yeah, yeah. And then the other thing is... what I’ve been talking about is more on the theoretical side. And certainly, I’m not a theorist, so it’s not like I’m doing very complicated things there. But on the empirical side, it’s also very useful. And once again, I found that it’s not giving me end-to-end results. If I just told it, let’s say, “Hey, I have this natural experiment and I’d like you to measure the causal effect,” it’s definitely not going to give me what I want. And maybe that’s underspecified. Or maybe it doesn’t have my taste for what type of evidence I like. But once I give it enough—maybe an initial sketch of the identification strategy—it can very easily automate. Let’s say I did this for one country and I want to replicate that analysis for another country... [06:30] Seth: I want you to use rainfall as an instrument. [06:32] Andrey: Yeah. “I did the analysis for

    1h 7m
  4. Ben Golub: AI Referees, Social Learning, and Virtual Currencies

    12/29/2025

    Ben Golub: AI Referees, Social Learning, and Virtual Currencies

    In this episode, we sit down with Ben Golub, economist at Northwestern University, to talk about what happens when AI meets academic research, social learning, and network theory. We start with Ben’s startup Refine, an AI-powered technical referee for academic papers. From there, the conversation ranges widely: how scholars should think about tooling, why “slop” is now cheap, how eigenvalues explain viral growth, and what large language models might do to collective belief formation. We get math, economics, startups, misinformation, and even cow tipping. Links & References * Refine — AI referee for academic papers * Harmonic — Formal verification and proof tooling for mathematics * Matthew O. Jackson — Stanford economist and leading scholar of networks and social learning * Cow tipping (myth) — Why you can’t actually tip a cow (physics + folklore) * The Hype Machine — Sinan Aral on how social platforms amplify misinformation * Sequential learning / information cascades / DeGroot Model * AI Village — Multi-agent AI simulations and emergent behavior experiments * Virtual currencies & Quora credits — Internal markets for attention and incentives Transcript: Seth: Welcome to Justified Posteriors, the podcast that updates its beliefs about the economics of AI and technology. Seth: I’m Seth Benzel, hoping my posteriors are half as good as the average of my erudite Friends is coming to you from Chapman University in sunny Southern California. Andrey: And I’m Andrey Fradkin coming to you from San Francisco, California, and I’m very excited that our guest for today is Ben Goleb, who is a prominent economist at Northwestern University. Ben has won the Calvó-Armengol International Prize, which recognizes a top researcher in economics or social science, younger than 40 years old, for contributions to theory and comprehension of mechanisms of social interaction. Andrey: So you want someone to analyze your social interactions, Ben is definitely the guy. Seth: If it’s in the network, Andrey: Yeah, he is, he was also a member of the Harvard Society of Fellows and had a brief stint working as an intern at Quora, and we’ve known each other for a long time. So welcome to the show, Ben. Ben: Thank you, Andrey. Thank you, Seth. It’s wonderful to be on your podcast. Refine: AI-Powered Paper Reviewing Andrey: All right. Let’s get started. I want us to get started on what’s very likely been the most on your mind thing, Ben, which is your new endeavor, Refine.Ink. Why don’t you tell us a little bit about, give us the three minute spiel about what you’re doing. Seth: and tell us why you didn’t name your tech startup after a Lord of the Rings character. Ben: Man, that’s a curve ball right there. All right, I’ll tell you what, I’ll put that on background processing. So, what refine is, is it’s an AI referee technical referee. From a user perspective, what happens is you just give it a paper and you get the experience of a really obsessive research assistant reading for as long as it takes to get through the whole thing, probing it from every angle, asking every lawyerly question about whether things make sense. Ben: And then that feedback, hopefully the really valuable parts that an author would wanna know are distilled and delivered. So as my co-founder Yann Calvó López puts it, obsession is really the obsessiveness is the nature of the company. We just bottled it up and we give it to people. So that’s the basic product—it’s an AI tool. It uses AI obviously to do all of this thinking. One thing I’ll say about it is that I have long felt it was a scandal that the level of tooling for scholars is a tiny fraction of what it is for software engineers. Ben: And obviously software engineering is a much larger and more economically valuable Seth: Boo. Ben: least Andrey: Oh, disagree. Ben: In certain immediate quantifications. But I felt that ever since I’ve been using tech, I just felt imagine if we had really good tools and then there was this perfect storm where my co-founder and I felt we could make a tool that was state of the art for now. So that’s how I think of it. Seth: I have to quibble with you a little bit about the user experience because the way I went, the step zero was first, jaw drops to the floor at the sticker price. How much do you, Ben: not, Seth: But then I will say I have used it myself and on a paper I recently submitted, it really did find a technical error and I would a kind of error that you wouldn’t find, just throwing this into ChatGPT as of a few months ago. Who knows with the latest Gemini. But it really impressed me with my limited time using it. Andrey: So. Ben: is probably, if you think about the sticker price, if you compare that to the amount of time you’d have, you’d have had to pay error. Seth: Yeah. And water. If I didn’t have water, I’d die, so I should pay a million for water. Andrey: A question I had: how do you know it’s good? Isn’t this whole evals thing very tricky? Seth: Hmm. Andrey: Is there Is there, a paper review or benchmark that you’ve come across, or did you develop your own? Ben: Yeah. That’s a wonderful question. As Andrey knows, he’s a super insightful person about AI and this goes to the core of the issue because all the engineers we work with are immediately like, okay, I get what you’re doing. Ben: Give me the evals, give me the standard of quality. So we know we’re objectively doing a good job. What we have are a set of papers where we know what ground truth is. We basically know everything that’s wrong with them and every model update we run, so that’s a small set of fairly manual evaluations that’s available. I think one of the things that users experience is they know their own papers well and can see over time that sometimes we find issues that they know about and then sometimes we find other issues and we can see whether they’re correct. Ben: We’re not at the point where we can make confident precision recall type assessments. But another thing that we do, which I find cool, was whenever tools that our competitors come out, like Andrew Ng put out a cool paper reviewer thing targeted at CS conferences. Ben: And what we do is we just run that thing, we run our thing, we put both of them into Gemini 2.0, and we say, could you please assess these side by side as reviews of the same paper? Which one caught mistakes? We try to make it a very neutral prompt, and that’s an eval that is easy to carry out. Ben: But actually we’re in the market. We’d love to work with people who are excited about doing this for refine. We finally have the resources to take a serious run at it as founders. The simple truth is because my co-founder and I are researchers as well as founders, we constantly look at how it’s doing on documents we know. Ben: And it’s a very seat of the pants thing for now, to tell the truth. Andrey: Do you think that there’s an aspect of data-driven here and that one of your friends puts their paper into it and says, well, you didn’t catch this mistake, or you didn’t catch that mistake, and then you optimize towards that. Is that a big part of your development process? Ben: Yeah, it was more. I think we’ve reached an equilibrium where of the feedback of that form we hear, there’s usually a cost to catching it. But early on that was basically, I would just tell everyone I could find, and there were a few. When I finally had the courage to tell my main academic group chat about it and I gave it, immediately people had very clear feedback and this was in the deep, I think the first reasoning model we used for the substantive feedback was DeepSeek R1 and people, we immediately felt, okay, this is 90% slop. Ben: And that’s where we started by iterating. We got to where, and one great thing about having academic friends is they’re not gonna be shy to tell you that your thought of paper. Refereeing Math and AI for Economic Theory Andrey: One thing that we wanted to dig a little bit into is how you think about refereeing math and Seth: Mm-hmm. Andrey: More generally opening it up to how are economic theorists using AI for math? Ben: So say a little more about your question. When you say math Seth: Well, we see people, Axiom, I think is the name of the company, immediately converting these written proofs into Lean. Is that the end game for your tool? Ben: I see, yes. So good. Our vision for the company is that, at least for quite a while, I think there’s gonna be this product layer between tools, the core AI models and the things that are necessary to bring your median, ambitious Seth: Middle Ben: not Seth: theorists, that’s what we call ourselves. Ben: Well, yeah. Or middle, but in a technical dimension, I think it’s almost certainly true that the median economist doesn’t use GitHub almost ever. If you told them, they set up something that, a tool that works through the terminal, think about Harmonic, right? Ben: Their tools are all, they say the first step is, go grab this from a repository and run these command line things to, they try to make it pretty easy, but it’s still a terminal tool. So a big picture vision is that we think the most sophisticated tools will be, there will be a lot of them that are not yet productized and we can just make the bundle for scholars to actually use it in their work. Ben: Now about the question of formalization per se, I have always been excited to use formalization in particular to make that product experience happen. For formalized math, my understanding is right now the coverage of the auto formalization systems is very jagged across, even across. If you compare number theory to algebraic geometry, the former is in good shape for people to start solving Erdős problems or combinatorial number theory, things like that, people can just start doing that. For algebraic geometry, there are a lot of basics that aren

    1h 16m
  5. 12/15/2025

    Are We There Yet? Evaluating METR’s Eval of AI’s Ability to Complete Tasks of Different Lengths

    Seth and Andrey are back to evaluating an AI evaluation, this time discussing METR’s paper “Measuring AI Ability to Complete Long Tasks.” The paper’s central claim is that the “effective horizon” of AI agents—the length of tasks they can complete autonomously—is doubling every 7 months. Extrapolate that, and AI handles month-long projects by decade’s end. They discuss the data and the assumptions that go into this benchmark. Seth and Andrey start by walking through the tests of task length, from simple atomic actions to the 8-hour research simulations in RE-Bench. They discuss whether the paper properly measures task length median success with their logarithmic models. And, of course, they zoom out to ask whether “time” is even the right metric for AI capability, and whether METR applies the concept correctly. Our hosts also point out other limitations and open questions the eval leaves us with. Does the paper properly acknowledge how messy long tasks get in practice? AI still struggles with things like playing Pokémon or coordinating in AI Village—tasks that are hard to decompose cleanly. Can completing one 10-hour task really be equated with reliably completing ten 1-hour subtasks? And Seth has a bone to pick about a very important study detail omitted from the introduction. The Priors that We Update On Are: * Is evaluating AI by time (task length) more useful/robust than evaluating by economic value (as seen in OpenAI’s GDP-eval)? * How long until an AI can autonomously complete a “human-month” sized task (defined here as a solid second draft of an economics paper, given data and research question)? * Seth’s Prior: 50/50 in 5 years, >90% in 10 years. * Andrey’s Prior: 50/50 in 5 years, almost certain in 10 years.Listen to see how our perspectives change after reading! Links & Mentions: * The Paper: Measuring AI Ability to Complete Long Tasks by METR * Complementary Benchmarks: * RE-Bench (Research Engineering Benchmark) - METR’s eval for AI R&D capabilities. * H-CAST (Human-Calibrated Autonomy Software Tasks) - The benchmark of 189 tasks used in the study. * The “Other” Eval: GDP-eval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks by OpenAI * AI 2027 (A forecasting scenario discussed) * AI Village - A project where AI agents attempt to coordinate on real-world tasks. * Steve Newman on the “100 Person-Year” Project (Creator of Writely/Google Docs). * In the Beginning... Was the Command Line by Neal Stephenson * Raj Chetty Transcript[00:14] Seth Benzell: Welcome to the Justified Posteriors podcast, the podcast that updates its beliefs about the economics of AI and technology. I’m Seth Benzell, wondering just how long a task developing an AI evaluation is, at Chapman University in sunny Southern California.Andrey Fradkin: And I’m Andrey Fradkin, becoming very sad as the rate of improvement in my ability to do tasks is nowhere near the rate at which AI is improving. Coming to you from San Francisco, California.Andrey: All right, Seth. You mentioned how long it takes to do an eval. I think this is going to be a little bit of a theme of our podcast about how actually, evals are pretty hard and expensive to do. Recently there was a Twitter exchange between one of the METR members talking about their eval, which we’ll be talking about today, where he says that for each new model to evaluate it takes approximately 25 hours of staff time, but maybe even more like 60 hours in rougher cases. And that’s not even counting all the compute that’s required to do these evaluations.So, you know, evals get thrown around. I think people knowing evals know how hard they are, but I think as outsiders, we take them for granted. And we shouldn’t, because it certainly takes a lot of work. But yeah, with that in mind, what do you want to say, Seth?Seth: Well, I guess I want to say that we, I think we are the leaders in changing people’s opinions about the importance of these evals. The public responded very positively to our recent eval of Open AI’s GDP-eval, which was trying to look to bring Daron Acemoglu’s view of how can we evaluate the economic potential economic impact of AI to actual task-by-task-by-task, how successful is this AI system. People loved it. Now you demanded it, we listened. We’re coming back to you to talk to you about a new eval—well not a new eval, it’s about eight months old, but it’s the Godzilla of evals. It’s the Kaiju of evals. It’s this paper called “Measuring AI Ability to Complete Long Tasks,” a study that came out by METR. We’ve seen some updates or new evaluations of models since this first came out in March of 2025. Andrey, do you want to list the authors of this paper?[3:05] Andrey: As usual I don’t. There are a lot of authors of this paper. But, you know, I’ve interacted with some of the authors of this paper, I have a lot of respect for them. I have a lot of respect for the METR organization.Seth: Okay. But at a high level, just in a sentence, what this wants to do is evaluate different frontier AI models by the criteria of: “how long are the tasks that they complete”?” Andrey: I guess what I would say before we get to our priors is, just as context, this, from what everything I’ve seen, is the most influential evaluation of AI progress in the world right now. It is a measure that all important new models are benchmarked against. If something is above the trend, it’s news. If something is below the trend, it’s news. If something’s on the trend, it’s news. And it’s caused a lot of people to change their minds about the likely path of AI progress. So I’m very excited to discuss this.Seth: It’s been the source of many “we’re so back” memes. Yeah, I totally agree Andrey. Am I right that this was a paper that was partly inspiring the AI 2027 scenario by favorite blogger Scott Alexander?Andrey: I don’t know if it inspired it, but I think it was used as part of the evidence in that. Just to be clear though, AI 2027, it’s a scenario that was proposed that seemed a bit too soon of a vision for AGI taking over the world by many folks. We have not done an episode on it.Seth: We haven’t done an episode on it. But it’s fair to say that people look at the results of this paper and they see, you know, they see a trend that they extrapolate. But before we get into the details of the paper, are we ready to get into our priors?Andrey: Let’s do it.[05:50] Seth: Okay, so Andrey, just based on that headline description, that instead of evaluating AI systems by trying to go occupation by occupation and try to find tasks in those occupations that are economically valuable and then trying to see what percentage of those tasks the AI can do—that’s what the Open AI GDPval approach that we recently reviewed did—this approach is trying to evaluate tasks again by how long they are. So comparing those two approaches, I guess my first prior is, before we read this paper, which of those approaches do you see as like kind of intuitively more promising?Andrey: One way of thinking about this is tasks are, or things people do which could be a series of tasks, are bundles and they’re bundles embedded in some higher dimensional space. And what these two evals are doing, this one we’re discussing here versus GDPval, is they’re embedding them into different spaces. One of them is a time metric. And one of them is a dollar metric, right? And you can just by phrasing it that way, you can see what some of the issues might be with either. With the dollar metric, well, what are people getting paid for? Is it a specific deliverable or is it being on call or being the responsible party for something? So you can see how it’s kind of hard to really convert lots of things into dollar values at a systematic level. Now, you can say the same thing about how long it takes to do something. Of course, it takes different people very different times to do different tasks. And then once again chaining tasks together, how to rethink about how long it takes to do that. So I think they’re surprisingly similar. I think maybe this length of time one is more useful at the moment because it seems simpler to do frankly. It seems like, yes we can get an estimate for how long it takes to do something. It’s not going to be perfect, it’s going to be noisy, but we can get it and then we can just see whether the model does it. And that’s easier than trying to translate tasks to dollar values in my opinion.[8:42] Seth: Right. I guess I also am tempted to reject the premise of this question and say that they’re valuable for different things. But I guess I come into this thinking about, you know, we think about AI agents as opposed to AI tools as being this next frontier of automation and potentially supercharging the economy. And it really does feel like the case that working with AI models, the rate limiter is the human. It’s how often the human has to stop and give feedback and say, “Okay, here’s the next step,” or “Hey, back up a little bit and try again.” So going in, I would say I was kind of in equipoise about which of the two is the most useful kind of as a projection for where this is going. Maybe on your side of the ledger saying that economic value is kind of a socioeconomic construct, right? That could definitely change a lot even without the tool changing. Whereas time seems more innately connected to difficulty. You can think about psychometric measures of difficulty where we think about, you know, a harder exam is a longer exam. So at least going in, I think that this has a lot of potential to even potentially surpass GDP-eval in terms of its value for projection.Andrey: Yes. Yeah, yeah. Seth: Okay. The next one I was thinking to ask you Andrey was, if we buy all the premises of whatever context the paper sets up for us, the question I’d like to think about is: ho

    1h 6m
  6. Epistemic Apocalypse and Prediction Markets (Bo Cowgill Pt. 2)

    12/02/2025

    Epistemic Apocalypse and Prediction Markets (Bo Cowgill Pt. 2)

    We continue our conversation with Columbia professor Bo Cowgill. We start with a detour through Roman Jakobson’s six functions of language (plus two bonus functions Seth insists on adding: performative and incantatory). Can LLMs handle the referential? The expressive? The poetic? What about magic? The conversation gets properly technical as we dig into Crawford-Sobel cheap talk models, the collapse of costly signaling, and whether “pay to apply” is the inevitable market response to a world where everyone can produce indistinguishable text. Bo argues we’ll see more referral hiring (your network as the last remaining credible signal), while Andrey is convinced LinkedIn Premium’s limited signals are just the beginning of mechanism design for application markets. We take a detour into Bo’s earlier life running Google’s internal prediction markets (once the largest known corporate prediction market), why companies still don’t use them for decision-making despite strong forecasting performance, and whether AI agents participating in prediction markets will have correlated errors if they all derive from the same foundation models. We then discuss whether AI-generated content will create demand for cryptographic proof of authenticity, whether “proof of humanity” protocols can scale, and whether Bo’s 4-year-old daughter’s exposure to AI-generated squirrel videos constitutes evidence of aggregate information loss. Finally: the superhuman persuasion debate. Andrey clarifies he doesn’t believe in compiler-level brain hacks (sorry, Snow Crash fans), Bo presents survey evidence that 85% of GenAI usage involves content meant for others, and Seth closes with the contrarian hot take that information transmission will actually improve on net. General equilibrium saves us all—assuming a spherical cow. Topics Covered: * Jakobson’s functions of language (all eight of them, apparently) * Signaling theory and the pooling equilibrium problem * Crawford-Sobel cheap talk games and babbling equilibria * “Pay to apply” as incentive-compatible mechanism design * Corporate prediction markets and conflicts of interest * The ABC conjecture and math as a social enterprise * Cryptographic verification and proof of humanity * Why live performance and in-person activities may increase in economic value * The Coasean singularity * Robin Hanson’s “everything is signaling” worldview Papers & References: * Crawford & Sobel (1982), “Strategic Information Transmission” * Cowgill and Zitzewitz (2015) “Corporate Prediction Markets: Evidence from Google, Ford, and Firm X”. * Jakobson, “Linguistics and Poetics” (1960) * Binet, The Seventh Function of Language * Stephenson, Snow Crash Transcript:Andrey: Well, let’s go to speculation mode. Seth: All right. Speculation mode. I have a proposal that I’m gonna ask you guys to indulge me in as we think about how AI will affect communication in the economy. For my book club, I’ve been recently reading some postmodern fiction. In particular, a book called The Seventh Function of Language. The book is a reference to Jakobson’s six famous functions of language. He is a semioticist who is interested in how language functions in society, and he says language functions in six ways.1 I’m gonna add two bonus ones to that, because of course there are seven functions of language, not just six. Maybe this will be a good framework for us to think about how AI will change different functions of language. All right. Are you ready for me? Bo Cowgill: Yes. Seth: Bo’s ready. Okay. Bo Cowgill: Remember all six when you... Seth: No, we’re gonna do ‘em one by one. Okay. The first is the Referential or Informational function. This is just: is the language conveying facts about the world or not? Object level first. No Straussian stuff. Just very literally telling you a thing. When I think about how LLMs will do at this task, we think that LLMs at least have the potential to be more accurate, right? If we’re thinking about cover letters, the LLMs should maybe do a better job at choosing which facts to describe. Clearly there might be an element of choosing which facts to report as being the most relevant. We can think about, maybe that’s in a different function. If we ask about how LLMs change podcasts? Well, presumably an LLM-based podcast, if the LLM was good enough, would get stuff right more often. I’m sure I make errors. Andrey doesn’t make errors. So restricting attention to this object-level, “is the language conveying the facts it needs to convey,” how do you see LLMs changing communication? Bo Cowgill: Do I go first? Seth: Yeah, of course Bo, you’re the guest. Bo Cowgill: Of course. Sorry, I should’ve known. Well, it sounds like you’re optimistic that it’ll improve. Is that right? Seth: I think that if we’re talking about hallucinations, those will be increasingly fixed and be a non-issue for things like CVs and resumes in the next couple of years. And then it becomes the question of: would an LLM be less able to correctly report on commonly agreed-upon facts than a human? I don’t know. The couple-years-out LLM, you gotta figure, is gonna be pretty good at reliably reproducing facts that are agreed upon. Bo Cowgill: Yeah, I see what you mean. So, I’m gonna say “it depends,” but I’ll tell you exactly what I think it depends on. I think in instances where the sender and the receiver are basically playing a zero-sum game, I don’t think that the LLM is gonna help. And arguably, nothing is gonna help. Maybe costly signaling could help, but... Seth: Sender and the receiver are playing a zero-sum game? If I wanna hire someone, that’s a positive-sum game, I thought. Andrey: Two senders are playing a zero-sum game. Seth: Oh, two senders. Yes. Two senders are zero-sum with each other. Okay. Bo Cowgill: Right. This is another domain-specific answer, but I think that it depends on what game the two parties are playing. Are they trying to coordinate on something? Is it a zero-sum game where they have total opposite objectives? If all costly signaling has been destroyed, then I don’t think that the LLM is gonna help overcome that total separation. On the other hand, if there’s some alignment between sender and receiver—even in a cheap talk world—we know from the Crawford and Sobel literature that you can have communication happen even without the cost of a signal. I do think that in those Crawford and Sobel games, you have these multiple equilibria ranging from the babbling equilibrium to the much more precise one. And it seems like, if I’m trying to communicate with Seth costlessly, and all costly signal has been destroyed so we only have cheap talk, the LLM could put us on a more communicative equilibrium. Seth: We could say more if we’re at the level where you trust me. The LLM can tell you more facts than I ever could. Bo Cowgill: Right. Put us into those more fine partitions in the cheap talk literature. At least that’s how I think the potential for it to help would go. Andrey: I wanna jump in a little bit because I’m a little bit worried for our listeners if we have to go through eight... Seth: You’re gonna love these functions, dude. They’re gonna love... this is gonna be the highlight of the episode. Andrey: I guess rather than having a discussion after every single one, I think it’s just good to list them and then we can talk. Seth: Okay. That’ll help Bo at least. I don’t know if the audience needs this; the audience is up to date with all the most lame postmodern literature. So for the sake of Bo, though, I’ll give you the six functions plus two bonus functions. * Informational: Literal truth. * Expressive (or Emotive): Expressing something about the sender. This is what actually seems to break in your paper: I can’t express that I’m a good worker bee if now everybody can easily express they’re good worker bees. * Connotative (or Directive): The rhetorical element. That’s the “I am going to figure out how to flatter you and persuade you,” not necessarily on a factual level. That’s the zero-sum game maybe you were just talking about. * Phatic: This is funny. This is the language used to just maintain communications. So the way I’m thinking about this is if we’re in an automated setting, you know how they have those “dead man’s switches” where it’s like, “If I ever die, my lawyer will send the information to the federal government.” And so you might have a message from your heart being like, “Bo’s alive. Bo’s alive. Bo’s alive.” And then the problem is when the message doesn’t go. * Metalingual (or Metalinguistic): Language to talk about language. You can tell me if you think LLMs have anything to help us with there. * Poetic: Language as beautiful for the sake of language. Maybe LLMs will change how beautiful language is. * Performative: This comes to us from John Searle, who talks about, “I now pronounce you man and wife.” That’s a function of language that is different than conveying information. It’s an act. And maybe LLMs can or can’t do those acts. * Incantatory (Magic): The most important function. Doing magic. You can come back to us about whether or not LLMs are capable of magic. Okay? So there’s eight functions of language for you. LLMs gonna change language? All right. Take any of them, Bo. Andrey: Seth, can I reframe the question? I try to be more grounded in what might be empirically falsifiable. We have these ideas that in certain domains—and we can focus on the jobs one—LLMs are going to be writing a lot of the language that was previously written by humans, and presumably the human that was sending the signal. So how is that going to affect how people find jobs in the future? And how do we think this market is gonna adjust as a result? Do you have any thoughts on that? Bo Cowgill: Yeah. So I guess the reframing is about how

    1h 2m
  7. Does AI Cheapen Talk? (Bo Cowgill Pt. 1)

    11/18/2025

    Does AI Cheapen Talk? (Bo Cowgill Pt. 1)

    In this episode, we brought on our friend Bo Cowgill, to dissect his forthcoming Management Science paper, Does AI Cheapen Talk? The core question is one economists have been circling since Spence drew a line on the blackboard: What happens when a technology makes costly signals cheap? If GenAI allows anyone to produce polished pitches, résumés, and cover letters, what happens to screening, hiring, and the entire communication equilibrium? Bo’s answer: it depends. Under some conditions, GenAI induces an epistemic apocalypse, flattening signals and confusing recruiters. In others, it reveals skill even more sharply, giving high-types superpowers. The episode walks through the theory, the experiment, and implications. Transcript:Seth: Welcome to the Justified Posteriors Podcast, the podcast that updates its priors about the economics of AI and technology. I’m Seth Benzell, certifying my humanity with takes so implausible that no softmax could ever select them at Chapman University in sunny Southern California. Andrey: And I am Andrey Fradkin, collecting my friends in all sorts of digital media formats, coming to you from San Francisco, California. Today we’re very excited to have Bo Cowgill with us. Bo is a friend of the show and a listener of the show, so it’s a real treat to have him. He is an assistant professor at Columbia Business School and has done really important research on hiring, on prediction markets, and now on AI and the intersection of those topics. And he’s also won some very cool prizes. I’ll mention that he was on the list of the best 40 business school professors. So he is one of those professors that’s really captivating for his students. So yeah. Welcome, Bo. Bo Cowgill: Thank you so much. It’s awesome to be here. Thanks so much for having me on the podcast. Seth: What do you value about the podcast? That’s something I’ve been trying to figure out because I just do the podcast for me. I’m just having a lot of fun here with Andrey. Anything I can do to get this guy’s attention to talk about interesting stuff for 10 minutes? Why do you like the podcast? What can we do to make this an even better podcast for assistant professors at Columbia? Bo Cowgill: Well, I don’t wanna speak for all assistant professors at Columbia, but one thing it does well is aggregate papers about AI that are coming out from around the ecosystem and random places. I think it’s hard for anybody to catch all of these, so you guys do a great job. I did learn about new papers from the podcast sometimes. Another cool thing I think is there is some continuity across podcast episodes about themes and arbitrage between different topics and across even different disciplines and domains. So I think this is another thing you don’t get necessarily just kind of thumbing around papers yourself. Seth: So flattering. So now I can ask you a follow-up question, which is: obviously you’re enjoying our communication to you. A podcast is kind of a one-dimensional communication. Now we’ve got the interview going, we’ve got this back and forth. How would you think about the experience of the podcast changing if a really, really, really good AI that had read all of my papers and all of Andrey’s papers went and did the same podcast, same topics? How would that experience change for you? Would it have as much informative content? Would it have as much experiential value? How do you think about that? Bo Cowgill: Well, first of all, I do enjoy y’all’s banter back and forth. I don’t know how well an AI would do that. Maybe it would do a perfectly good job with that. I do enjoy the fact that—this is personal to me—but we know a lot of the same people. And in addition to other guests and other paper references, I like to follow some of the inside jokes and whatnot. I don’t know if that’s all that big of a deal for the average person. But I have listened to at least the latest version of NotebookLM and its ability to do a quote-unquote “deep dive podcast” on anything. And at least recently I’ve been pleased with those. I don’t know if you’ve ever tried putting in like a bad paper in theirs, and then it will of course just say, “Oh, this is the greatest paper. It’s so interesting.” Seth: Right. Bo Cowgill: You can. Seth: So that’s a little bit different, maybe slightly different than our approach. Bo Cowgill: Well, yeah, for sure. Although you can also tell NotebookLM to try to find problems and be a little bit more critical. And that I think works well too. But yeah, I don’t think we should try to replace you guys with robots just yet. Seth: We’re very highly compensated though. The opportunity cost of Andrey’s time, he could be climbing a mountain right now. Andrey, you take it up. Why are we doing this ourselves? Why isn’t an LLM doing this communication for us? Andrey: Well, mostly it’s because we have fun doing it, and so if the LLM was doing it, then we wouldn’t be having the fun. Seth: There you go. Well put. Experiential value of the act itself. Now, Bo, I did not bring up this question randomly. The reason I raised this question of how does AI modify communication... yeah, I used a softmax process, so it was not random. The reason I’m asking this question about how AI changes communication is because you have some recently accepted, forthcoming work at Management Science trying to bring some theory and empirics to the question of how LLMs change human communication, but now in the context of resumes and job search and job pitches. Do you want to briefly introduce the paper “Does AI Cheapen Talk?” and tell us about your co-authors? Bo Cowgill: Yeah, most definitely. So the paper is called “Does AI Cheapen Talk?”. It is with Natalia Berg-Wright, also at Columbia Business School, and with Pablo Hernandez Lagos, who is a professor at Yeshiva University. And what we’re looking at in this paper is the way people screen job candidates or screen entrepreneurs or, more abstractly, how they kind of screen generally. You could apply our model, I think, to lots of different things. But the core idea behind it kind of goes back to these models from Spence in the 1970s saying that costly signals are more valuable to try to separate types. Seth: Right. If I wanna become a full member of the tribe, I have to go kill a lion. Why is it important for me to kill a lion? It’s not important. The important part is I do a hard thing. Bo Cowgill: Exactly. Yeah. So maybe part of the key to this Spence idea that appears in our paper too is that it’s not just that the signal has to be costly, it has to be kind of differentially costly for different types of people. So maybe in your tribe, killing a lion is easy for tough guys like you, but for wimpier people or something, it’s prohibitively high. And so it’s like a test of your underlying cost parameter for killing lions or for being tough in general. So they go and do this. And I guess what you’re alluding to, which appears in a lot of cases, is the actual value of killing the lion is kind of irrelevant. It was just a test. And maybe one of the more potentially depressing implications of that is the idea that what we send our students to do in four-year degrees or even degrees like ours is really just as valuable as killing a lion, which is to say, you’re mainly revealing something about your own costs and your own type and your own skills, and the actual work doesn’t generate all that much value. Seth: Is education training or screening? Bo Cowgill: Right, right, right. Yes. I do think a good amount of it these days is probably screening, and maybe that’s especially true at the MBA level. Andrey: I would just say that, given the rate of hiring for MBAs, I’m not sure that the screening is really happening either. Maybe the screening is happening to get in. Bo Cowgill: What the screening function is now is like, can you get in as the ultimate thing? Seth: Right. And I think as you already suggest, the way this works can flip if there’s a change in opportunity costs, right? So maybe in the past, “Oh, I’m the high type. I go to college.” In the present, “I’m the high type. I’m gonna skip college, I’m gonna be an entrepreneur,” and now going to college is a low signal. Bo Cowgill: Yes. Exactly. So that’s kind of what’s going on in our model too. How are we applying this to job screening and AI? Well, you apply for a job, you have a resume, possibly a cover letter or, if you don’t have an old-fashioned cover letter, you probably have a pitch to a recruiter or to your friend who works at the company. And there are kind of elements of costly signaling in those pitches. So some people could have really smart-sounding pitches that use the right jargon and are kind of up to speed with regards to the latest developments in the industry or in the underlying technology or whatever. And those could actually be really useful signals because the only sort of person who would be up to speed is the one who finds it easy to follow all this information. Seth: Can I pause you for a second? Back before LLMs, when I was in high school, they helped me make a CV or a resume. It’s not like there was ever any monitoring that people had to write their own cover letters. Bo Cowgill: That’s really true. No, some people have said about our paper that this is a more general model of signal dilution, which was happening before AI and the internet and everything. And so one example of this might be SAT tutoring or other forms of help for high school students, like writing your resume for you. Where if something comes along—and this is where GenAI is gonna come in—but if anything comes along that makes it cheaper to produce signals that were once more expensive, at least for some groups, then that changes the informational content of the signal. Seth: If the tribe gets guns, it’s too easy to kill a lion.

    53 min
  8. 11/04/2025

    Evaluating GDPVal, OpenAI's Eval for Economic Value

    In this episode of Justified Posteriors podcast, Seth and Andrey discuss “GDPVal” a new set of AI evaluations, really a novel approach to AI evaluation, from OpenAI. The metric is debuted in a new OpenAI paper, “GDP Val: Evaluating AI Model Performance on Real-World, Economically Valuable Tasks.” We discuss this “bottom-up” approach to the possible economic impact of AI (which evaluates hundreds of specific tasks, multiplying them by estimated economic value in the economy of each), and contrast it with Daron Acemoglu’s “top-down” “Simple Macroeconomics of AI” paper (which does the same, but only for aggregate averages), as well as with measures of AI’s use and potential that are less directly tethered to economic value (like Anthropic's AI Economic Value Index and GPTs are GPTs). Unsurprisingly, the company pouring hundreds of billions into AI thinks that AI already can do ALOT. Perhaps trillions of dollars in knowledge work tasks annually. More surprisingly, OpenAI claims the leading Claude model is better than their own!Do we believe that analysis? Listen to find out! Key Findings & Results Discussed * AI Win Rate vs. Human Experts: * The Prior: We went in with a prior that a generic AI (like GPT-5 or Claude) would win against a paid human expert in a head-to-head task only about 10% of the time. * The Headline Result: The paper found a 47.6% win rate for Claude Opus (near human parity) and a 38.8% win rate for GPT-5 High. This was the most shocking finding for the hosts. * Cost and Speed Improvements: * The paper provides a prototype for measuring economic gains. It found that using GPT-5 in a collaborative “N-shot” workflow (where the user can prompt it multiple times) resulted in a 39% speed improvement and a 63% cost improvement over a human working alone. * The “Catastrophic Error” Rate: * A significant caveat is that in 2.7% of the tasks the AI lost, it was due to a “catastrophic error,” such as insulting a customer, recommending fraud, or suggesting physical harm. This is presumed to be much higher than the human error rate. * The “Taste” Problem (Human Agreement): * A crucial methodological finding was that inter-human agreement on which work product was “better” was only 70%. This suggests that “taste” and subjective preferences are major factors, making it difficult to declare an objective “winner” in many knowledge tasks. Main Discussion Points & Takeaways * The “Meeting Problem” (Why AI Can’t Take Over): * Andrey argues that even if AI can automate artifact creation (e.g., writing a report, making a presentation), it cannot automate the core of many knowledge-work jobs. * He posits that much of this work is actually social coordination, consensus-building, and decision-making—the very things that happen in meetings. AI cannot yet replace this social function. * Manager of Agents vs. “By Hand”: * The Prior: We believed 90-95% of knowledge workers would still be working “by hand” (not just managing AI agents) in two years. * The Posterior: We did not significantly change this belief. We distinguish between “1-shot” delegation (true agent management) and “N-shot” iterative collaboration (which they still classify as working “by hand”). We believe most AI-assisted work will be the iterative kind for the foreseeable future. * Prompt Engineering vs. Model Size: * We noted that the models were not used “out-of-the-box” but benefited from significant, expert-level prompt engineering. * However, we were surprised that the data seemed to show that prompt tuning only offered a small boost (e.g., ~5 percentage points) compared to the massive gains from simply using a newer, larger, and more capable model. * Final Posterior Updates: * AI Win Rate: We updated our 10% prior to 25-30%. We remain skeptical of the 47.6% figure. PS — Should our thumbnails have anime girls in them, or Andrey with giant eyes? Let us know in the comments! Timestamps: * (00:45) Today’s Topic: A new OpenAI paper (”GDP Val”) that measures AI performance on real-world, economically valuable tasks. * (01:10) Context: How does this new paper compare to Acemoglu’s “Simple Macroeconomics of AI”? * (04:45) Prior #1: What percentage of knowledge tasks will AI win head-to-head against a human? (Seth’s prior: 10%). * (09:45) Prior #2: In two years, what share of knowledge workers will be “managers of AI agents” vs. doing work “by hand”? * (19:25) The Methodology: This study uses sophisticated prompt engineering, not just out-of-the-box models. * (25:20) Headline Result: AI (Claude Opus) achieves a 47.6% win rate against human experts, nearing human parity. GPT-5 High follows at 38.8%. * (33:45) Cost & Speed Improvements: Using GPT-5 in a collaborative workflow can lead to a 39% speed improvement and a 63% cost improvement. * (37:45) The “Catastrophic Error” Rate: How often does the AI fail badly? (Answer: 2.7% of the time). * (39:50) The “Taste” Problem: Why inter-human agreement on task quality (at only 70%) is a major challenge for measuring AI. * (53:40) The Meeting Problem: Why AI can’t (yet) automate key parts of knowledge work like consensus-building and coordination. * (58:00) Posteriors Updated: Seth and Andrey update their “AI win rate” prior from 10% to 25-30%. Seth: Welcome to the Justified Posteriors Podcast, the podcast that updates its priors on the economics of AI and technology. I’m Seth Benzell, highly competent at many real-world tasks, just not the most economically valuable ones, coming to you from Chapman University in sunny Southern California. Andrey: And I’m Andrey Fradkin, making sure to never use the Unicode character 2011, since it will not render properly on people’s computers. Coming to you from,, San Francisco, California. Seth: Amazing, Andrey. Amazing to have you here in the “state of the future.” and today we’re kind of reading about those AI companies that are bringing the future here today and are gonna, I guess, automate all knowledge work. And here they are today, with some measures about how many jobs—how much economic value of jobs—they think current generation chatbots can replace. We’ll talk about to what extent we believe those economic extrapolations. But before we go into what happens in this paper from our friends at OpenAI, do you remember one of our early episodes, that macroeconomics of AI episode we did about Daron Acemoglu’s paper? Andrey: Well, the only thing I remember, Seth, is they were quite simple, those macroeconomics., it was the... Seth: “Simple Macroeconomics of AI.” So you remembered the title. And if I recall correctly, the main argument of that paper was you can figure out the productivity of AI in the economy by multiplying together a couple of numbers. How many jobs can be automated? Then you multiply it by, if you automate the job, how much less labor do you need? Then you multiply that by, if it’s possible to automate, is it economically viable to automate? And you multiply those three numbers together and Daron concludes that if you implement all current generation AI, you’ll raise GDP by one percentage point. If you think that’s gonna take 10 years, he concludes that’s gonna be 0.1 additional percentage point of growth a year. You can see why people are losing their minds over this AI boom, Andrey. Andrey: Yeah. Yeah. I mean, I, you know, I think with such so much hype, you know, they should,, they should,, probably just stop investing altogether. Is kind of right what I would think from [Eriun’s?] paper. Yeah. Seth: Well, Andrey, why don’t I tell you, which is, the way I see this paper that we just read is that OpenAI has actually taken on the challenge and said, “Okay, you can multiply three numbers together and tell me the economic value of AI. I’m gonna multiply 200 numbers together and tell you the economic value of AI.” And in particular, rather than just try to take the sort of global aggregate of like efficiency from automation, they’re gonna go task by task by task and try to measure: Can AI speed you up? Can it do the job by itself?, this is the sort of real-world economics rubber-hits-the-road that you don’t see in macroeconomics papers. Andrey: Yeah. Yeah. I mean, it is, it is in many ways a very micro study, but I guess micro... Seth: Macro. Andrey: Micro, macro. That was the best, actually my favorite. Seth: Yeah. Andrey: I guess maybe we should start with our prior, Seth,, before we get deeper. Seth: Well, let’s say the name of the paper and the authors maybe. Andrey: There are so many authors, so OpenAI... I’m sorry guys. You gotta have fewer co-authors. Seth: We will not list the authors. Andrey: But,, the paper is called,, “GDP Val: Evaluating AI Model Performance on Real-World, Economically Valuable Tasks.” Seth: And we’re sure it’s written by humans. Andrey: We’re sure that it’s not fully written by humans because they’ve disclosed that they use AI. They have an acknowledgement—they have an AI acknowledgement section. Seth: They used AI “as per usual”? Yeah. In the “ordinary course of coding...” Andrey: And writing. Seth: And writing. And for “minor improvements.” Yes. They wanted to be clear. Okay. Andrey: Not, not the major ones. Yes. Seth: Because,, you know, base... so, all right. You gave us the name of the paper. The paper is going to... just in one sentence, what the paper is about is them going through lots of different tasks and trying to figure out if they can be automated. What are the priors? Before we go into this, what are you thinking about, Andrey? Andrey: Well, what they’re gonna do is they’re gonna create a work product, let’s say a presentation or schematic or a document, and then they’re gonna have people rate which one is better, the one created by the AI, or the one created by a professional human being. And so the first prior

    1h 4m

Ratings & Reviews

5
out of 5
11 Ratings

About

Explorations into the economics of AI and innovation. Seth Benzell and Andrey Fradkin discuss academic papers and essays at the intersection of economics and technology. empiricrafting.substack.com

You Might Also Like