The Inference Layer

inferencelayer.ai

A new podcast covering the intricate systems, chips, and stacks that define the inference layer and the complexities of moving AI models from training to real-world deployment. The Inference Layer operates as a community-driven initiative, linking a diverse network of university labs, AI specialists, and supporting partners through a volunteer-led framework - coordinated by the Princeton School of AI meetup group and producers of the Humanitarian AI Today podcast and the University of Pittsburgh's Health and Explainable AI podcast.

Episodes

  1. MAR 18

    Federico Pierucci on Multi-Agent Risks in Humanitarian Aid at The Inference Layer

    This third pilot episode of The Inference Layer bridges the technical complexities of AI deployment with the reality of humanitarian operations, featuring a deep dive into the transition from static models to autonomous agentic systems. On behalf of the Humanitarian AI Today podcast, guest host Patrick Hassan, an AI policy lead with a background in disaster response, interviews Federico Pierucci, Scientific Director of the Icaro Lab, to explore how the inference layer is becoming a site of significant systemic risk. The discussion provides a unique look at inference-time failures such as alignment drift and steganographic coordination that emerge only when multiple agents interact in production environments. For humanitarian actors, the episode raises concerns regarding operating in an era of assistance automated by layers of AI agents. The dialogue highlights how multi-agent chains used for beneficiary selection or resource allocation for example can degrade, develop invisible biases or be weaponized or politicized by parties to a conflict. Federico explains that these risks can be compounded by a lack of safety benchmarks for things like underrepresented languages and dialects, which can lead to unpredictable jailbreaks or administrative failures in the field. The episode provides an inside look at pioneering research being carried out by the Icaro Lab, a Rome-based laboratory specialized in AI safety in collaboration with the Sapienza University. The lab focuses on mechanistic interpretability, a technical field dedicated to understanding the internal attention heads and decision-making units of an AI to decipher how it truly processes information. The discussion introduces the concept of Institutional AI, a proposed framework to manage these emerging xeno-behaviors through a governance graph. Rather than relying solely on prompt engineering or model-level alignment, Federico argues for a protocol-level solution that can manage misbehaving agents during inference. The episode is informative for professionals seeking to understand why AI safety must evolve from a localized technical challenge into a global institutional design problem, particularly in regions where traditional governance has broken down.

    39 min
  2. FEB 20

    Alexandre Marques from Red Hat on Tackling the Hardest Problems in Open Source Inference

    Alexandre Marques, Engineering Manager and Team Lead of Machine Learning Research at Red Hat and former Manager of Machine Learning Research at NeuralMagic, speaks with the University of Pittsburgh’s Health and Explainable AI podcast producer, Brent Phillips, about Red Hat and his team’s work building and maintain platforms that power open-source AI inference at scale.  In this pilot episode of The Inference Layer, Alexandre discusses his transition from aerospace engineering to leading a research team focused on making large AI models faster, cheaper, and more deployable. He explains that while large labs have proven model capabilities, the current challenge lies in moving these models into production. To bridge the gap between research demos and real-world scaling, he emphasizes the need for a deep understanding of how architectural decisions influence performance and the ability to translate research into high-quality code. The conversation delves into the technical definition of the inference layer, which Alexandre describes as the entire stack, including runtime, hardware, memory management, and batching strategies that sits between a trained model and the end-user experience. He highlights the important role of open source and open research at Red Hat and speaks on his team’s search for a Senior Machine Learning Research Engineer to join the team and work on post-training optimization for large language models and conduct applied research on state-of-the-art inference optimization techniques, including quantization, pruning, knowledge distillation, and speculative decoding.  In the interview, Alexandre highlights two ambitious areas he is eager to explore that define the future of the field. First, he is interested in systematically studying how different optimization techniques compound, specifically how speculative decoding interacts with compression methods like quantization in production environments. Second, he aims to tackle the evolution of inference from single, independent models toward the orchestration of multiple models across distributed environments. This shift introduces new layers of complexity in scheduling and systems design, representing the kind of "hard problem" Alexandre believes will define the next few years of AI deployment. The Inference Layer podcast is a collaborative initiative linking university AI labs, researchers, volunteers and supporting partners to explore the complexities of moving models from training to real-world deployment. By highlighting advanced research and frontier challenges, the podcast provides a platform for experts to discuss the cutting-edge developments driving the future of AI.

    14 min
  3. JAN 29

    Manuela Nayantara Jeyaraj Discusses Explainability at the Inference Layer

    Manuela Nayantara Jeyaraj, a PhD student and researcher at the The Applied Intelligence Research Centre (AIRC) within the Technological University Dublin speaks with the University of Pittsburgh’s Health and Explainable AI podcast producer, Brent Phillips about explainability at the inference layer. In this pilot episode of The Inference Layer, Manuela discusses her award-winning work on identifying cognitive bias in language models. She explains that while explicit bias is well-studied, her research focuses on implicit, subtle "cognitive biases" that models learn from human patterns, such as gender stereotypes in job recruitment or political descriptions. To address this, Manuela developed an algorithm that combines model-agnostic and model-specific explainability approaches to provide high-confidence justifications for AI decisions. She also highlights the creation of a massive, modern lexicon that captures gendered associations across a wide range of English, from archaic terms to contemporary slang found on TikTok and Instagram. The conversation delves into the technical challenges of maintaining explainability at the inference layer, particularly when transitioning from high-compute cloud environments to resource-constrained edge devices like phones or wearables. Manuela emphasizes that for real-time applications clinical decision-making, explainability cannot be an "afterthought" and must be lightweight enough to run locally to ensure user privacy and trust.  In the interview, Manuela highlights two ambitious areas she is eager to explore that connect the technical and human sides of AI. First, she is interested in developing high-confidence, real-time explainability for streaming data, where decisions must be justified in milliseconds without slowing down the model. This includes providing "counterfactual" explanations—identifying exactly what would need to change for a different outcome to occur, such as a patient's risk level shifting from high to low. Second, she wants to tackle the "storytelling" aspect of explainable AI (XAI), creating systems that can tailor the complexity and detail of an explanation to different stakeholders. For instance, in a recruitment scenario, she envisions a model that provides a deep technical justification for a recruiter while offering a more abstracted, helpful level of feedback for the job applicant. The Inference Layer podcast is a collaborative initiative linking university AI labs, researchers, and supporting partners to explore the complexities of moving models from training to real-world deployment. Managed by volunteers, the series focuses on the intricate systems, chips, and stacks that define the inference layer. By highlighting advanced research and frontier challenges the podcast provides a platform for experts to discuss the cutting-edge developments driving the future of AI.

    23 min

About

A new podcast covering the intricate systems, chips, and stacks that define the inference layer and the complexities of moving AI models from training to real-world deployment. The Inference Layer operates as a community-driven initiative, linking a diverse network of university labs, AI specialists, and supporting partners through a volunteer-led framework - coordinated by the Princeton School of AI meetup group and producers of the Humanitarian AI Today podcast and the University of Pittsburgh's Health and Explainable AI podcast.