Generative AI in the Real World

O'Reilly

In 2023, ChatGPT put AI on everyone’s agenda. Now, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.

  1. 1D AGO

    Interactions Between Humans and AI with Rajeshwari Ganesan

    In this edition of Generative AI in the Real World, Ben Lorica and Rajeshwari Ganesan talk about how to put generative AI in closer touch with human needs and requirements. AI isn’t all about building bigger models and benchmarks. To use it effectively, we need better interfaces; we need contexts that support groups rather than individuals; we need applications that allow people to explore the space they’re working in. Ever since ChatGPT, we’ve assumed that chat is the best interface for AI. We can do better. Points of Interest 0:17: We’re both builders and consumers of AI. How does this dual relationship affect how we design interfaces?0:41: A lot of advances happen in the large language models. But when we step back, are these models consumable by users? We lack the kind of user interface we need. With ChatGPT, conversations can go round and round, turn by turn. If you don’t give the right context, you don’t get the right answer. This isn’t good enough.1:47: Model providers go out of their way to coach users, telling them how to prompt new models. All the providers have coaching tips. What alternatives should we be exploring?2:50: We’ve made certain initial starts. GitHub Copilot and mail applications with typeahead don’t require heavy-duty prompting. The AI coinhabits the same workspace as the user. The context is derived from the workspace. The second part is that generative interfaces are emerging. It’s not the content but the experience that’s generated by the machine.5:22: Interfaces are experience. Generate the interface based on what the user needs at any given point. At Infosys, we do a lot of legacy modernization—that’s where you really need good interfaces. We have been able to create interfaces where the user is able to walk into a latent space—an area that gives them an understanding of what they want to explore.7:11: A latent space is an area that is meaningful for the user’s interaction. A space that’s relatable and semantically understandable. The user might say, “Tell me all the modules dealing with fraud detection.” Exploring the space that the user wants is possible. Let’s say I describe various aspects of a project I’m launching. The machine looks at my thought process. It looks at my answers, breaks [them] up part by part, judges the quality of response, and gets into the pieces that need to be better.9:44: One of the things people struggle with is evaluation. Not of a single agent—most tasks require multiple agents because there are different skills and tasks involved. How do we address evaluation and transparency?10:42: When it comes to evaluation, I think in terms of trustworthy systems. A lot of focus on evaluation comes from model engineering. But one critical piece of building trustworthy systems is the interface itself. A human has an intent and is requesting a response. There is a shared context—and if the context isn’t shared properly, you won’t get the right response. Prompt engineering is difficult; if you don’t give the right context, you go in a loop.12:26: Trustworthiness breaks because you’re dependent on the prompt. The coinhabited workspace that takes the context from the environment plays a big role.12:46: Once you give the questions to the machine, the machine gives a response. But if you don’t make a response that is consumable by the user, that’s a problem.13:18: Trustworthiness of systems in the context of agent frameworks is much more complex. Humans don’t just have factual knowledge. We have beliefs. Humans have a belief state, and if an agent doesn’t have access to the belief state, they will get into something called reasoning derailment. If the interface can’t bring belief states to life, you will have a problem.

    33 min
  2. 2D AGO

    Getting Beyond the Demo with Hamel Husain

    In this episode, Ben Lorica and Hamel Husain talk about how to take the next steps with artificial intelligence. Developers don’t need to build their own models—but they do need basic data skills. It’s important to look at your data, to discover your model’s weaknesses, and to use that information to develop test suites and evals that show whether your model is behaving well. Links to Resources Hamel's upcoming course on evaluating LLMs.Hamel's O'Reilly publications: “AI Essentials for Tech Executives” and “What We Learned from a Year of Building with LLMs”Hamel's website.Points of Interest 0:39: What inspired you and your coauthors to create a series on practical uses of foundation models? What gaps in existing resources did you aim to address?0:56: We’re publishing “AI Essentials for Tech Executives”¹ now; last year, we published “What We Learned from a Year of Building with LLMs.”² Coming from the perspective of a machine learning engineer or data scientist—you don’t need to build or train models. You can use an API. But there are skills and practices from data science that are crucial.2:16: There are core skills around data analysis and error analysis and basic data literacy that you need to get beyond a demo.2:43: What are some crucial shifts in mindset that you’ve written about on your blog?3:24: The phrase we keep repeating is “look at your data." What does “look at your data" mean?3:51: There’s a process that you should use. Machine learning systems have a lot in common with modern AI. How do you test those? Debug them? Improve them? Look at your data; people fail on this. They do vibe checks, but they don’t really know what to do next.4:56: Looking at your data helps ground everything. Look at actual logs of user interactions. If you don’t have users, generate interactions synthetically. See how your AI is behaving and write detailed notes about failure modes. Do some analysis on those notes: Categorize them. You’ll start to see patterns and your biggest failure modes. This will give you a sense of what to prioritize.6:08: A lot of people are missing that. People aren’t familiar with the rich ecosystem of data tools, so they get stuck. We know that it’s crucial to sample some data and look at it.7:08: It’s also important that you have the domain expert do it with the engineers. On a lot of teams, the domain expert isn’t an engineer.7:44: Another thing is focusing on processes, not tools. Tools aren’t the problem—the problem is that your AI isn’t working. The tools won’t take care of it for you. There’s a process: how to debug, look at, and measure AI. Those are the main mind shifts.9:32: Most people aren’t building models (pretraining); they might be doing posttraining on a base model. But there are a lot of experiments that you still have to run. There’[re] knobs you have to turn, and without the ability to do it systematically and measure, you’re just mindless[ly] turning knobs without learning much.10:29: I’ve held open office hours for people to ask questions about evals. What people ask most is what to eval. There are many components. You can’t and shouldn’t test everything. You should be grounded in your actual failure modes. Prioritize your tests on that.11:30: Another topic is what I call the prototype purgatory. A lot of people have great demos. The demos work, and might even be deployable. But people struggle with pulling the trigger.12:15: A lot of people don’t know how to evaluate their AI systems if they don’t have any users. One way to help yourself is to generate synthetic data. Have an LLM generate realistic user inputs and brainstorm different personas and scenarios. That bootstraps you significantly towards production.13:57: There’s a new open source tool that does something like this for agents. It’s called IntelAgent. It generates synthetic data that you might not come up with yourself.

    32 min
  3. 3D AGO

    Agents—The Next Step in AI with Shelby Heinecke

    Join Shelby Heinecke, senior research manager at Salesforce, and Ben Lorica as they talk about agents, AI models that can take action on behalf of their users. Are they the future—or at least the hot topic for the coming year? Where are we with smaller models? And what do we need to improve the agent stack? How do you evaluate the performance of models and agents? About the Generative AI in the Real World podcast: In 2023, ChatGPT put AI on everyone’s agenda. In 2025, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise. Points of Interest 0:29: Introduction—Our guest is Shelby Heinecke, senior research manager at Salesforce.0:43: The hot topic of the year is agents. Agents are increasingly capable of GUI-based interactions. Is this my imagination?1:20: The research community has made tremendous progress to make this happen. We’ve made progress on function calling. We’ve trained LLMs to call the correct functions to perform tasks like sending emails. My team has built large action models that, given a task, write a plan and the API calls to execute that. This is one piece. A second piece is when you don’t know the functions a priori, giving the agent the ability to reason about images and video.3:07: We released multimodal action models. They take an image and text and produce API calls. That makes navigating GUIs a reality.3:34: A lot of knowledge work relies on GUI interactions. Is this just robotic process automation rebranded?4:05: We’ve been automating forever. What’s special is that automation is driven by LLMs, and that combination is particularly powerful.4:32: The earlier generation of RPA was very tightly scripted. With multimodal models that can see the screen, they can really understand what’s happening. Now we’re beginning to see reasoning enhanced models. Inference scaling will be important.5:52: Multimodality and reasoning-enhanced models will make agents even more powerful.6:00: I’m very interested in how much reasoning we can pack into a smaller model. Just this week DeepSeek also released smaller distilled versions.7:08: Every month the capability of smaller models has been pushed. Smaller models right now may not compare to large models. But this year, we can push the boundaries.7:38: What’s missing from the agent stack? You have the model—some notion of memory. You have tools that the agent can call. There are agent frameworks. You need monitoring, observability. Everything depends on the model’s capabilities: There’s a lot of fragmentation, and the vocabulary is still unclear. Where do agents usually fall short?9:00: There’s a lot of room for improvement with function calling and multistep function calling. Earlier in the year, it was just single step. Now there’s multistep. That expands our horizons.9:59: We need to think about deploying agents that solve complex tasks that take multiple steps. We will need to think more about efficiency and latency. With increased reasoning abilities, latency increases.10:45: This year, we’ll see small language models and agents come together.10:58: At the end of the day, this is an empirical discipline and you need to come up with your own benchmarks and eval tools. What are you doing in terms of benchmarks and eval?11:36: This is the most critical piece of applied research. You’re deploying models for a purpose. You still need an evaluation set for that use case. As we work with a variety of products, we cocreate evaluation sets with our partners.12:38: We’ve released the CRM benchmark. It’s open. We’ve created CRM-style datasets with CRM-type tasks. You can see the open source models and small models on these leaderboards and how they perform.13:16: How big do these datasets have to be?

    27 min
  4. 4D AGO

    Measuring Skills with Kian Katanforoosh

    How do we measure skills in an age of AI? That question has an effect on everything from hiring to productive teamwork. Join Kian Katanforoosh, founder and CEO of Workera, and Ben Lorica for a discussion of how we can use AI to assess skills more effectively. How do we get beyond pass/fail exams to true measures of a person’s ability? Points of Interest 0:28: Can you give a sense of how big the market for skills verification is?0:42: It’s extremely large. Anything that touches skills data is on the rise. When you extrapolate university admissions to someone’s career, you realize that there are many times when they need to validate their skills.1:59: Roughly what’s the breakdown between B2B and B2C?2:04: Workera is exclusively B2B and federal. However, there are also assessments focused on B2C. Workera has free assessments for consumers.3:00: Five years ago, there were tech companies working on skill assessment. What were prior solutions before the rise of generative AI?3:27: Historically, assessments have been used for summative purposes. Pass/fail, high stakes, the goal is to admit or reject you. We provided the use of assessments for people to know where they stand, compare themselves to the market, and decide what to study next. That takes different technology.4:50: Generative AI became much more prominent with the rise of ChatGPT. What changed?5:09: Skills change faster than ever. You need to update skills much more frequently. The half-life of skills used to be over 10 years. Today, it’s estimated to be around 2.5 years in the digital area. Writing a quiz is easy. Writing a good assessment is extremely hard. Validity is a concept showing that what you intend to measure is what you are measuring. AI can help.6:39: AI can help with modeling the competencies you want to measure.6:57: AI can help streamline the creation of an assessment.7:22: AI can help test the assessment with synthetic users.7:42: AI can help with monitoring postassessment. There are a lot of things that can go wrong.8:25: Five years ago in program, people used tests to filter people out. That has changed; people will use coding assistants on the job. Why shouldn’t I be able to use a coding assistant when I’m doing an assessment?9:16: You should be able to use it. The assessment has to change. The previous generation of assessments focused on syntax. Do you care if you forgot a semicolon? Assessments should focus on other cognitive levels, such as analyzing and synthesizing information.10:06: Because of generative models, it’s become easier to build an impressive prototype. Evaluation is the hard point. Assessment is all about evaluation, so the bar is much higher for you.10:48: Absolutely. We have a study that calculates the number of skills needed to prototype versus deploy AI. You need about 1,000 skills to prototype AI. You need about 10,000 skills for production AI.12:39: If I want to do skills assessment on an unfamiliar workflow, say full stack web development, what’s your process for onboarding?13:17: We have one agent that’s responsible for competency modeling. You can have a subject-matter expert (SME) share a job description or task analysis or job architecture. We take that information and granularize the tasks worth measuring. At that point, there’s a human in the loop.14:27: Where does AI help? What does the AI need? What would you like to see from people using your tool?15:04: Language models have been trained on pretty much everything online. You can get a pretty good answer from AI. The SME takes that from 80% to 100%. Now, there are issues with that process. We separate the core catalog of skills from the custom catalog, where customers create custom assessments. A standardized assessment lets you benchmark against other people or companies.16:32: If you take a custom assessment, it’s highly relevant to your needs, even though comparisons aren’t possible.16:41: It’s obviously anonymized, right?

    31 min
  5. 5D AGO

    Chloé Messdaghi on AI Security, Policy, and Regulation

    Chloé Messdaghi and Ben Lorica discuss AI security—a subject of increasing importance as AI-driven applications roll out into the real world. There’s a knowledge gap: Security workers don’t understand AI, and AI developers don’t understand security. It’s important to be aware of all the resources that are available. Make sure to bring everyone together to develop AI security policies and playbooks, including AI developers and experts. Be aware of all the resources that are available; we expect to see AI security certifications and training becoming available in the coming year. Points of Interest 0:24: How does AI security differ from traditional cybersecurity?0:44: AI is a black box: We don’t have transparency to show how AI works or explainability to show how it makes decisions. Black boxes are hard to secure.2:12: There’s a huge knowledge gap. Companies aren’t doing what is needed.2:24: When you talk to executives, do you distinguish between traditional AI and ML and the new generative AI models?2:43: We talk about older models as well. But security is as much about, What am I supposed to do? We’ve had AI for a while, but for some time, security has not been part of that conversation.3:26: Where do security folks go to learn how to secure AI? There are no certifications. We’re playing a massive catchup game.3:53: What’s the state of awareness about incident response strategies for AI?4:15: Even in traditional cybersecurity, we’ve always had an issue of making sure incident response plans aren’t ad hoc or expired. A lot of it is being aware of all the technologies and products that the company has been using. It’s hard to protect if you don’t know everything in your environment.5:19: The AI Threat Landscape report found that 77% of the companies reported breaches in their AI systems.5:40: Last year, a statistic came out about the adoption of AI-related cybersecurity measures. For North America, 70% of the organizations said they did one or two out of five security measures. 24% adopted two to four measures.6:35: What are some of the first things I should be thinking about to update my incident response playbook?6:51: Make sure you have all the right people in the room. We still have issues with department silos. CISOs can be dismissed or not even in the room when it comes to decisions. There are concerns about restricting innovation or product launch dates. You have to have CTOs, data scientists, ML developers, and all the right people to ensure that there is safety and that everyone has taken precautions.7:48: For companies with a mature cybersecurity incident playbook that they want to update for AI, what AI brings is that you have to include more people.8:17: You have to realize that there’s an AI knowledge gap, and that there’s insufficient security training for data scientists. Security folks don’t know where to turn for education. There aren’t a lot of courses or programs out there. We’ll see a lot of that develop this year.10:13: You’d think we’d have addressed communications silos by now, but AI has ripped the bandaids off. There are resources out there. I recommend Databricks’ AI Security Framework (DASF); it’s mapped to the MITRE ATLAS. Also be familiar with the NIST Risk Framework and the OWASP AI Exchange.11:40: This knowledge gap is on both sides. What are some of the best practices for addressing this two-sided knowledge gap?12:20: Be honest about where your company stands. Where are we right now? Are we doing a good job of governance? Am I doing a good enough job as a leader? Is there something I don’t know about the environment? Be the leader who’s a bridge, breaks down silos, knows who owns what, and who’s responsible for what.13:24: One issue is the notion of shadow AI. Knowledge workers go home and use things that aren’t sanctioned by companies. Are there specific things that companies should be doing about shadow AI?

    30 min
  6. AUG 29

    Tom Smoker on Getting Started with GraphRAG

    Join Ben Lorica and Tom Smoker for a discussion of GraphRAG, one of the hottest topics of the last few months. GraphRAG goes a step beyond RAG to make the output of language models more consistent, accurate, and explainable. But what is a graph? A graph is a way of structuring data. In the end, it’s the structure that’s important, along with the work you do to create that structure. Points of Interest 0:15: GraphRAG is RAG with a knowledge graph. Do you have a more strict definition?1:00: A lot of what I do is the R in RAG: retrieve. Retrieval is better if you have structured data. I’ve yet to find a definition for GraphRAG. You want to bring in structured data.2:03: At the end of the day, the lesson is structure. Sometimes structure is a SQL database. Don’t lose hope if you don’t have a knowledge graph.2:49: A knowledge graph is a knowledge base and a list of axioms (rules). The knowledge base is just a word connected to another word through a third word. Fundamentally, the benefit comes from the list of triples. The value is in having extracted and defined those triples.4:01: Knowledge graphs are cool again. What are your two favorite examples of GraphRag in production?4:57: My examples are people who are structuring their data so that it’s consistent. Then you can bring it into a context window and do something with it.5:18: LinkedIn and Pinterest are the best examples of existing graph structures that work.5:35: A new application is a veterinary radiology example. Without GraphRAG, the LLM kept recommending conditions specific to Labradors not bulldogs. GraphRAG controlled the problem.6:37: The underlying data was almost exclusively text. It’s difficult to build up a consistent dataset for veterinary radiology because animals move.7:12: My favorite examples: Google uses their data commons to build a Q&A application. Metaphor Data: The starting point is structured data, then they create a second graph from the first graph that maps technical terms to business terms. Then they construct a social graph based on who is using the data.9:41: Structured data can be the basis for a graph.10:06: Unstructured data is valuable, but you need a way to navigate and categorize unstructured data.11:04: Where are we on GraphRAG? Do you still have to explain what GraphRAG is?11:28: More people know about it, but I have to explain it more than I did previously. Exactly what are we referring to? Most people want accuracy in the beginning; the value is often that it is more explainable. People may have seen a fantastic example, but what they haven’t seen is the iterative process in schema design. The upfront cost of these systems is nontrivial.13:13: What are the key bottlenecks? How do I get a knowledge graph?13:23: The biggest question is: Do you need a graph in the first place? There’s a whole spectrum. It’s in most people's interest to stop before they get to the end.14:01: For people who come to us brand-new, we say, “You should try vector RAG first. If that doesn’t work, there’s a lot of good that structuring data can provide.”15:01: If the chunks are structured, and a lot of the work is done up front, then it’s possible to navigate through structured information. At that point, you get value out of vector RAG. Academic papers have to follow a certain structure. If you spend time making sure you know what the chunks are, where they’re split and why, and they’re labeled, you can get a lot of value.16:43: What are some of your pointers about how to get started?16:47: The knowledge base is often a compressed representation. That means less tokens. That means better rate limits and less cost. So some people want a graph to help scale. That’s one start. Another is the desire for a system to be explainable. Getting that information into a structured representation and tracing back that structured representation can be very useful.

    35 min
  7. AUG 28

    Robert Nishihara on AI and the Future of Data

    Robert Nishihara is one of the creators of Ray and cofounder of Anyscale, a platform for high-performance distributed data analysis and artificial intelligence. Ben Lorica and Robert discuss the need for data for the next generation of AI, which will be multimodal. What kinds of data will we need to develop models for video and multimodal data? And what kinds of tools will we use to prepare that data? Points of Interest 1:06: Are we running out of data?1:35: There is a paradigm shift in how ML is thinking about AI. The innovation is on the data side: finding data, evaluating sources of data, curating data, creating synthetic data, filtering low-quality data. People are curating and processing data using AI. Filtering out low-quality data or unimportant image data is an AI task.5:02: A lot of the tools were aimed at warehouses and lakehouses. Now we increasingly have more unstructured multimodal data. What's the challenge for tooling?5:44: Lots of companies have lots of data. They get value out of data by running SQL queries on structured data, but structured data is limited. The real insight is in unstructured data, which will be analyzed using AI. Data will shift from SQL-centric to AI-centric. And tooling for multimodal data processing is almost nonexistent.8:23: In part of the pipeline, you might be able to use CPUs instead of GPUs.8:44: Data processing is not just running inference with an LLM. You might want to decompress video, re-encode video, find scene changes, transcribe, or classify. Some stages will be GPU bound, some will be memory bound, some will be CPU bound. You will want to be able to aggregate these different resources.10:03: Most likely, with this kind of data, it's assumed you will have to go distributed and scale out. There is no choice but to scale the computation.10:46: In the past, we were only using structured data. Now we have multimodal data. We are only scratching the surface of what we can do with video—so people weren't collecting it as much. We will now collect more data.11:41: We need to enable training on 100 times more data.12:43: ML infrastructure teams are now on the critical path.13:52: Companies at the cutting edge have been doing this, but nearly every company has its own data about its specific business that they can use to improve their platform. The value is there. The challenge is the tooling and the infrastructure.15:15: There's another interesting angle around data and scale: experimentation. You will have to run experiments. Data processing and experimentation is part of experimentation.16:18: Customization isn't just at the level of the model. There are decisions to be made at every stage of the pipeline. What to collect, how to chunk, how to embed, how to do retrieval, what model to use, what data to use to fine tune—there are so many decisions to make. To iterate quickly, you need to try different choices and evaluate how they work. Companies should overinvest in evals early.17:29: If you don't have the right foundation, these experiments will be impossible.18:23: What's the next data type to get popular?18:42: Image data will be ubiquitous. People will do a lot with PDFs. Video will be the most challenging. Video combines images and audio; text can be in video too. But the data size is enormous. There are modeling challenges around video understanding. There's so much information in video that isn't being mined.22:50: Companies aren't saying that scaling laws are over, but scaling is slowing down. What's happening?

    30 min
  8. AUG 27

    Getting Ahead of the Curve with Claire Vo

    In this episode, Ben Lorica talks with Claire Vo, chief product officer at Launch Darkly and founder of ChatPRD. AI gives us a new set of tools that make everyone more productive and efficient. Those tools will allow more experimentation; they will allow more people to participate in product development; and they will create new opportunities for startups. As Claire says, this new tooling lets everyone get more ambitious—and if you start now, you’re on the leading edge. Lean in to the opportunities. Points of Interest 0:25: ChatPRD is an AI copilot for product managers and people who build products. The goal is to make more efficient people who need to generate ideas, build our requirements.1:15: It improves the quality of product work: it’s an on-demand coach or colleague.2:05: In a hybrid world, there needs to be some kind of artifact describing what we want to build. No matter the culture, you should try to make high-quality documents to improve the thinking.3:44: We ingest your product documents for two reasons: to have context of what you’ve built, what matter, and to inform style and quality.5:13: To become a 100x PM you need to embrace tools and accelerate your work. It’s learning how to scale and do your best in a highly efficient way. Getting 2–3 days back in your week.7:17: Will the programming language of the future be natural language? You will still have to think and describe things as a software engineer or a product manager.7:54: My favorite users are engineers who don’t have product managers, sales people who get customer requests, and even founders who can’t afford a product manager.8:41: In frontier models, I’d like to see up-to-date training data. The killer feature is performance. The models need to support a workflow that requires speed. Models need more control over output mechanisms than they have now, so users don’t have to massage output.10:38: There isn’t capability parity between the models, so you have to make trade-offs between performance, features, API support, latency, user experience, and streaming.11:05: Always design your application to be model agnostic. LaunchDarkly allows engineers to decouple the configuration and release of their code from deploying in production.12:14: With AI, prompts become feature flags. You can measure things like latency and token count, and make informed decisions about what works best.13:21: It’s important to have the ability to experiment in classic software development. That matters even more with nondeterministic software, because the ability to predict output goes down. You need to think about instrumentation from the beginning.14:37: I have been through a couple of technology waves, but this one has stopped me in my tracks. The difference between what is possible and what is not possible is unbelievable. I could have built the product from my startup 10 years ago before lunchtime.16:01: People need to prepare to be expected to do more because the ability to do more is powered by these tools and automations. People should educate themselves on how to automate tasks in their current job, and they should add additional skills like the ability to code.16:42: The shape of organizations will change. The triad of the product manager, engineering lead, and design lead will collapse into an individual. Individual contributors will become more efficient.17:35: Everyone can get more ambitious. There won’t be less to do. More people will be empowered to do more things and have bigger impact.18:44: Everything requires a radical cultural shift inside companies. It can feel scary. You need to set the aspiration and why it matters; you need to organize among motivated individuals and reward the behavior you want to see; new organizations will fall out of the centers of gravity around people who are operating in an AI-native way.

    27 min

About

In 2023, ChatGPT put AI on everyone’s agenda. Now, the challenge will be turning those agendas into reality. In Generative AI in the Real World, Ben Lorica interviews leaders who are building with AI. Learn from their experience to help put AI to work in your enterprise.