340 episodes

Weekly talks and fireside chats about everything that has to do with the new space emerging around DevOps for Machine Learning aka MLOps aka Machine Learning Operations.

MLOps.community Demetrios Brinkmann

    • Technology

Weekly talks and fireside chats about everything that has to do with the new space emerging around DevOps for Machine Learning aka MLOps aka Machine Learning Operations.

    Uber's Michelangelo: Strategic AI Overhaul and Impact // Demetrios Brinkmann // #239

    Uber's Michelangelo: Strategic AI Overhaul and Impact // Demetrios Brinkmann // #239

    Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/

    Uber's Michelangelo: Strategic AI Overhaul and Impact // MLOps podcast #239 with Demetrios Brinkmann.

    Huge thank you to Weights & Biases for sponsoring this episode. WandB Free Courses - http://wandb.me/courses_mlops

    // Abstract
    Uber's Michelangelo platform has evolved significantly through three major phases, enhancing its capabilities from basic ML predictions to sophisticated uses in deep learning and generative AI. Initially, Michelangelo 1.0 faced several challenges such as a lack of deep learning support and inadequate project tiering. To address these issues, Michelangelo 2.0 and subsequently 3.0 introduced improvements like support for Pytorch, enhanced model training, and integration of new technologies like Nvidia’s Triton and Kubernetes. The platform now includes advanced features such as a Genai gateway, robust compliance guardrails, and a system for monitoring model performance to streamline and secure AI operations at Uber.

    // Bio
    At the moment Demetrios is immersing himself in Machine Learning by interviewing experts from around the world in the weekly MLOps.community meetups. Demetrios constantly learns and engages in new activities to get uncomfortable and learn from his mistakes. He tries to bring creativity into every aspect of his life, whether analyzing the best paths forward, overcoming obstacles, or building Lego houses with his daughter.

    // MLOps Jobs board
    https://mlops.pallet.xyz/jobs

    // MLOps Swag/Merch
    https://mlops-community.myshopify.com/

    // Related Links
    From Predictive to Generative – How Michelangelo Accelerates Uber’s AI Journey blog post: https://www.uber.com/en-JP/blog/from-predictive-to-generative-ai/Uber's Michelangelo: https://www.uber.com/en-JP/blog/michelangelo-machine-learning-platform/
    The Future of Feature Stores and Platforms // Mike Del Balso & Josh Wills // MLOps Podcast # 186: https://youtu.be/p5F7v-w4EN0Machine Learning Education at Uber // Melissa Barr & Michael Mui // MLOps Podcast #156: https://youtu.be/N6EbBUFVfO8

    --------------- ✌️Connect With Us ✌️ -------------
    Join our slack community: https://go.mlops.community/slack
    Follow us on Twitter: @mlopscommunity
    Sign up for the next meetup: https://go.mlops.community/register
    Catch all episodes, blogs, newsletters, and more: https://mlops.community/

    Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/

    Timestamps:
    [00:00] Uber's Michelangelo platform evolution analyzed in podcast
    [03:51 - 4:50] Weights & Biases Ad
    [05:57] Uber creates Michelangelo to streamline machine learning
    [07:44] Michelangelo platform's tech and flexible system
    [11:49] Uber Michelangelo platform adapted for deep learning
    [16:48] Uber invests in ML training for employees
    [19:08] Explanation of blog content, ML quality metrics
    [22:38] Michelangelo 2.0 prioritizes serving latency and Kubernetes
    [26:30] GenAI gateway manages model routing and costs
    [31:35] ML platform evolution, legacy systems, and maintenance
    [33:22] Team debates maintaining outdated tools or moving on
    [34:41] Please like, share, leave feedback, and subscribe to our MLOps channels!
    [34:57] Wrap up

    • 35 min
    AWS Tranium and Inferentia // Kamran Khan and Matthew McClean // #238

    AWS Tranium and Inferentia // Kamran Khan and Matthew McClean // #238

    Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/

    Matthew McClean is a Machine Learning Technology Leader with the leading Amazon Web Services (AWS) cloud platform. He leads the customer engineering teams at Annapurna ML helping customers adopt AWS Trainium and Inferentia for their Gen AI workloads.

    Kamran Khan, Sr Technical Business Development Manager for AWS Inferentina/Trianium at AWS. He has over a decade of experience helping customers deploy and optimize deep learning training and inference workloads using AWS Inferentia and AWS Trainium.

    AWS Tranium and Inferentia // MLOps podcast #238 with Kamran Khan, BD, Annapurna ML and Matthew McClean, Annapurna Labs Lead Solution Architecture at AWS.

    Huge thank you to AWS for sponsoring this episode. AWS - https://aws.amazon.com/

    // Abstract
    Unlock unparalleled performance and cost savings with AWS Trainium and Inferentia! These powerful AI accelerators offer MLOps community members enhanced availability, compute elasticity, and energy efficiency. Seamlessly integrate with PyTorch, JAX, and Hugging Face, and enjoy robust support from industry leaders like W&B, Anyscale, and Outerbounds. Perfectly compatible with AWS services like Amazon SageMaker, getting started has never been easier. Elevate your AI game with AWS Trainium and Inferentia!

    // Bio
    Kamran Khan
    Helping developers and users achieve their AI performance and cost goals for almost 2 decades.

    Matthew McClean
    Leads the Annapurna Labs Solution Architecture and Prototyping teams helping customers train and deploy their Generative AI models with AWS Trainium and AWS Inferentia

    // MLOps Jobs board
    https://mlops.pallet.xyz/jobs

    // MLOps Swag/Merch
    https://mlops-community.myshopify.com/

    // Related Links
    AWS Trainium: https://aws.amazon.com/machine-learning/trainium/
    AWS Inferentia: https://aws.amazon.com/machine-learning/inferentia/

    --------------- ✌️Connect With Us ✌️ -------------
    Join our slack community: https://go.mlops.community/slack
    Follow us on Twitter: @mlopscommunity
    Sign up for the next meetup: https://go.mlops.community/register
    Catch all episodes, blogs, newsletters, and more: https://mlops.community/

    Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
    Connect with Kamran on LinkedIn: https://www.linkedin.com/in/kamranjk/
    Connect with Matt on LinkedIn: https://www.linkedin.com/in/matthewmcclean/

    Timestamps:
    [00:00] Matt's & Kamran's preferred coffee
    [00:53] Takeaways
    [01:57] Please like, share, leave a review, and subscribe to our MLOps channels!
    [02:22] AWS Trainium and Inferentia rundown
    [06:04] Inferentia vs GPUs: Comparison
    [11:20] Using Neuron for ML
    [15:54] Should Trainium and Inferentia go together?
    [18:15] ML Workflow Integration Overview
    [23:10] The Ec2 instance
    [24:55] Bedrock vs SageMaker
    [31:16] Shifting mindset toward open source in enterprise
    [35:50] Fine-tuning open-source models, reducing costs significantly
    [39:43] Model deployment cost can be reduced innovatively
    [43:49] Benefits of using Inferentia and Trainium
    [45:03] Wrap up

    • 45 min
    Build Reliable Systems with Chaos Engineering // Benjamin Wilms // #237

    Build Reliable Systems with Chaos Engineering // Benjamin Wilms // #237

    Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/.

    Benjamin Wilms is a developer and software architect at heart, with 20 years of experience. He fell in love with chaos engineering. Benjamin now spreads his enthusiasm and new knowledge as a speaker and author – especially in the field of chaos and resilience engineering.

    Retrieval Augmented Generation // MLOps podcast #237 with Benjamin Wilms, CEO & Co-Founder of Steadybit.

    Huge thank you to Amazon Web Services for sponsoring this episode. AWS - https://aws.amazon.com/

    // Abstract
    How to build reliable systems under unpredictable conditions with Chaos Engineering.

    // Bio
    Benjamin has over 20 years of experience as a developer and software architect. He fell in love with chaos engineering 7 years ago and shares his knowledge as a speaker and author. In October 2019, he founded the startup Steadybit with two friends, focusing on developers and teams embracing chaos engineering. He relaxes by mountain biking when he's not knee-deep in complex and distributed code.

    // MLOps Jobs board
    https://mlops.pallet.xyz/jobs

    // MLOps Swag/Merch
    https://mlops-community.myshopify.com/

    // Related Links
    Website: https://steadybit.com/

    --------------- ✌️Connect With Us ✌️ -------------
    Join our slack community: https://go.mlops.community/slack
    Follow us on Twitter: @mlopscommunity
    Sign up for the next meetup: https://go.mlops.community/register
    Catch all episodes, blogs, newsletters, and more: https://mlops.community/

    Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
    Connect with Benjamin on LinkedIn: https://www.linkedin.com/in/benjamin-wilms/

    Timestamps:
    [00:00] Benjamin's preferred coffee
    [00:28] Takeaways
    [02:10] Please like, share, leave a review, and subscribe to our MLOps channels!
    [02:53] Chaos Engineering tldr
    [06:13] Complex Systems for smaller Startups
    [07:21] Chaos Engineering benefits
    [10:39] Data Chaos Engineering trend
    [15:29] Chaos Engineering vs ML Resilience
    [17:57 - 17:58] AWS Trainium and AWS Infecentia Ad
    [19:00] Chaos engineering tests system vulnerabilities and solutions

    [23:24] Data distribution issues across different time zones

    [27:07] Expertise is essential in fixing systems

    [31:01] Chaos engineering integrated into machine learning systems

    [32:25] Pre-CI/CD steps and automating experiments for deployments

    [36:53] Chaos engineering emphasizes tool over value

    [38:58] Strong integration into observability tools for repeatable experiments

    [45:30] Invaluable insights on chaos engineering

    [46:42] Wrap up

    • 46 min
    Managing Small Knowledge Graphs for Multi-agent Systems // Tom Smoker // #236

    Managing Small Knowledge Graphs for Multi-agent Systems // Tom Smoker // #236

    Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/

    Tom Smoker is the cofounder of an early stage tech company empowering developers to create knowledge graphs within their RAG pipelines. Tom is a technical founder, and owns the research and development of knowledge graphs tooling for the company.

    Managing Small Knowledge Graphs for Multi-agent Systems // MLOps podcast #236 with Tom Smoker, Technical Founder of whyhow.ai.

    A big thank you to  @latticeflow  for sponsoring this episode! LatticeFlow - https://latticeflow.ai/

    // Abstract
    RAG is one of the more popular use cases for generative models, but there can be issues with repeatability and accuracy. This is especially applicable when it comes to using many agents within a pipeline, as the uncertainty propagates. For some multi-agent use cases, knowledge graphs can be used to structurally ground the agents and selectively improve the system to make it reliable end to end.

    // Bio
    Technical Founder of WhyHow.ai. Did Masters and PhD in CS, specializing in knowledge graphs, embeddings, and NLP. Worked as a data scientist to senior machine learning engineer at large resource companies and startups.

    // MLOps Jobs board
    https://mlops.pallet.xyz/jobs

    // MLOps Swag/Merch
    https://mlops-community.myshopify.com/

    // Related Links

    A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models: https://arxiv.org/abs/2401.01313Understanding the type of Knowledge Graph you need — Fixed vs Dynamic Schema/Data: https://medium.com/enterprise-rag/understanding-the-type-of-knowledge-graph-you-need-fixed-vs-dynamic-schema-data-13f319b27d9e

    --------------- ✌️Connect With Us ✌️ -------------
    Join our slack community: https://go.mlops.community/slack
    Follow us on Twitter: @mlopscommunity
    Sign up for the next meetup: https://go.mlops.community/register
    Catch all episodes, blogs, newsletters, and more: https://mlops.community/

    Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
    Connect with Tom on LinkedIn: https://www.linkedin.com/in/thomassmoker/

    Timestamps:
    [00:00] Tom's preferred coffee
    [00:33] Takeaways
    [03:04] Please like, share, leave a review, and subscribe to our MLOps channels!
    [03:23] Academic Curiosity and Knowledge Graphs
    [05:07] Logician
    [05:53] Knowledge graphs incorporated into RAGs
    [07:53] Graphs & Vectors Integration
    [10:49] "Exactly wrong"
    [12:14] Data Integration for Robust Knowledge Graph
    [14:53] Structured and Dynamic Data
    [21:44] Scoped Knowledge Retrieval Strategies
    [28:01 - 29:32] LatticeFlow Ad
    [29:33] RAG Limitations and Solutions
    [36:10] Working on multi agents, questioning agent definition

    [40:01] Concerns about performance of agent information transfer

    [43:45] Anticipating agent-based systems with modular processes

    [52:04] Balancing risk tolerance in company operations and control

    [54:11] Using AI to generate high-quality, efficient content

    [01:03:50] Wrap up

    • 1 hr 4 min
    Just when we Started to Solve Software Docs, AI Blew Everything Up // Dave Nunez // #235

    Just when we Started to Solve Software Docs, AI Blew Everything Up // Dave Nunez // #235

    Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/

    David Nunez, based in Santa Barbara, CA, US, is currently a Co-Founder and Partner at Abstract Group, bringing experience from previous roles at First Round Capital, Stripe, and Slab.

    Just when we Started to Solve Software Docs, AI Blew Everything Up // MLOps Podcast #235 with Dave Nunez, Partner of Abstract Group co-hosted by Jakub Czakon.

    Huge thank you to Zilliz for sponsoring this episode. Zilliz - https://zilliz.com/.

    // Abstract
    Over the previous decade, the recipe for making excellent software docs mostly converged on a set of core goals:

    Create high-quality, consistent content
    Use different content types depending on the task
    Make the docs easy to find

    For AI-focused software and products, the entire developer education playbook needs to be rewritten.

    // Bio
    Dave lives in Santa Barbara, CA with his wife and four kids.

    He started his tech career at various startups in Santa Barbara before moving to San Francisco to work at Salesforce. After Salesforce, he spent 2+ years at Uber and 5+ years at Stripe leading internal and external developer documentation efforts.

    In 2021, he co-authored Docs for Developers to help engineers become better writers. He's now a consultant, advisor, and angel investor for fast-growing startups. He typically invests in early-stage startups focusing on developer tools, productivity, and AI.

    He's a reading nerd, Lakers fan, and golf masochist.

    // MLOps Jobs board
    https://mlops.pallet.xyz/jobs

    // MLOps Swag/Merch
    https://mlops-community.myshopify.com/

    // Related Links
    Website: https://www.abstractgroup.co/
    Book: docsfordevelopers.com
    About Dave: https://gamma.app/docs/Dave-Nunez-about-me-002doxb23qbblme?mode=doc
    https://review.firstround.com/investing-in-internal-documentation-a-brick-by-brick-guide-for-startups
    https://increment.com/documentation/why-investing-in-internal-docs-is-worth-it/

    Writing to Learn paper by Peter Elbow: https://peterelbow.com/pdfs/Writing_for_Learning-Not_just_Demonstrating.PDF


    --------------- ✌️Connect With Us ✌️ -------------
    Join our slack community: https://go.mlops.community/slack
    Follow us on Twitter: @mlopscommunity
    Sign up for the next meetup: https://go.mlops.community/register
    Catch all episodes, blogs, newsletters, and more: https://mlops.community/

    Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
    Connect with Dave on LinkedIn: https://www.linkedin.com/in/djnunez/
    Connect with Kuba on LinkedIn: https://www.linkedin.com/in/jakub-czakon/?locale=en_US

    Timestamps:
    [00:00] Dave's preferred coffee
    [00:13] Introducing this episode's co-host, Kuba
    [00:36] Takeaways
    [02:55] Please like, share, leave a review, and subscribe to our MLOps channels!
    [03:23] Good docs, bad docs, and how to feel them
    [06:51] Inviting Dev docs and checks
    [10:36] Stripe's writing culture
    [12:42] Engineering team writing culture
    [14:15] Bottom-up tech writer change
    [18:31] Strip docs cult following
    [24:40] TriDocs Smart API Injection
    [26:42] User research for documentation
    [29:51] Design cues
    [32:15] Empathy-driven docs creation
    [34:28 - 35:35] Zilliz Ad
    [35:36] Foundational elements in documentation
    [38:23] Minimal infrastructure of information in "Read Me"
    [40:18] Measuring documentation with OKRs
    [43:58] Improve pages with Analytics
    [47:33] Google branded doc searches
    [48:35] Time to First Action
    [52:52] Dave's day in and day out and what excites him
    [56:01] Exciting internal documentation
    [59:55] Wrap up

    • 1 hr 1 min
    Open Standards Make MLOps Easier and Silos Harder // Cody Peterson // #234

    Open Standards Make MLOps Easier and Silos Harder // Cody Peterson // #234

    Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/


    Cody Peterson has a diverse work experience in the field of product management and engineering. Cody is currently working as a Technical Product Manager at Voltron Data, starting from May 2023. Previously, they worked as a Product Manager at dbt Labs from July 2022 to March 2023.

    MLOps podcast #234 with Cody Peterson, Senior Technical Product Manager at Voltron Data | Ibis project // Open Standards Make MLOps Easier and Silos Harder.

    Huge thank you to Weights & Biases for sponsoring this episode. WandB Free Courses -http://wandb.me/courses_mlops

    // Abstract
    MLOps is fundamentally a discipline of people working together on a system with data and machine learning models. These systems are already built on open standards we may not notice -- Linux, git, scikit-learn, etc. -- but are increasingly hitting walls with respect to the size and velocity of data.

    Pandas, for instance, is the tool of choice for many Python data scientists -- but its scalability is a known issue. Many tools make the assumption of data that fits in memory, but most organizations have data that will never fit in a laptop. What approaches can we take?

    One emerging approach with the Ibis project (created by the creator of pandas, Wes McKinney) is to leverage existing "big" data systems to do the heavy lifting on a lightweight Python data frame interface. Alongside other open source standards like Apache Arrow, this can allow data systems to communicate with each other and users of these systems to learn a single data frame API that works across any of them.

    Open standards like Apache Arrow, Ibis, and more in the MLOps tech stack enable freedom for composable data systems, where components can be swapped out allowing engineers to use the right tool for the job to be done. It also helps avoid vendor lock-in and keep costs low.

    // Bio
    Cody is a Senior Technical Product Manager at Voltron Data, a next-generation
    data systems builder that recently launched an accelerator-native GPU query
    engine for petabyte-scale ETL called Theseus. While Theseus is proprietary,
    Voltron Data takes an open periphery approach -- it is built on and interfaces
    through open standards like Apache Arrow, Substrait, and Ibis. Cody focuses on the Ibis project, a portable Python dataframe library that aims to be the
    standard Python interface for any data system, including Theseus and over 20
    other backends.

    Prior to Voltron Data, Cody was a product manager at dbt Labs focusing on the open source dbt Core and launching Python models (note: models is a confusing term here). Later, he led the Cloud Runtime team and drastically improved the efficiency of engineering execution and product outcomes.

    Cody started his carrer as a Product Manager at Microsoft working on Azure ML. He spent about 2 years on the dedicated MLOps product team, and 2 more years on various teams across the ML lifecycel including data, training, and inferencing.

    He is now passionate about using open source standards to break down the silos and challenges facing real world engineering teams, where engineering
    increasingly involves data and machine learning.

    // MLOps Jobs board
    https://mlops.pallet.xyz/jobs

    // MLOps Swag/Merch
    https://mlops-community.myshopify.com/

    // Related Links
    Ibis Project: https://ibis-project.org
    Apache Arrow and the “10 Things I Hate About pandas”: https://wesmckinney.com/blog/apache-arrow-pandas-internals/

    --------------- ✌️Connect With Us ✌️ -------------
    Join our slack community: https://go.mlops.community/slack
    Follow us on Twitter: @mlopscommunity
    Sign up for the next meetup: https://go.mlops.community/register
    Catch all episodes, blogs, newsletters, and more: https://mlops.community/

    Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/
    Connect with Cody on LinkedIn: https://linkedin.com/in/codydkdc

    • 46 min

Top Podcasts In Technology

Lex Fridman Podcast
Lex Fridman
All-In with Chamath, Jason, Sacks & Friedberg
All-In Podcast, LLC
Apple Events (video)
Apple
Apple Events (audio)
Apple
The TED AI Show
TED
Deep Questions with Cal Newport
Cal Newport

You Might Also Like

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
Sam Charrington
Practical AI: Machine Learning, Data Science
Changelog Media
Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al
Alessio + swyx
Super Data Science: ML & AI Podcast with Jon Krohn
Jon Krohn
Data Skeptic
Kyle Polich
"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis
Erik Torenberg, Nathan Labenz