54 min

S2E29 - "Synthetic Data in AI: Challenges, Techniques & Use Cases" with Andrew Clark and Sid Mangalik (Monitaur‪)‬ The Shifting Privacy Left Podcast

    • Technology

This week I welcome Dr. Andrew Clark, Co-founder & CTO of Monitaur, a trusted domain expert on the topic of machine learning, auditing and assurance; and Sid Mangalik, Research Scientist at Monitaur and PhD student at Stony Brook University. I discovered Andrew and Sid's new podcast show, The AI Fundamentalists Podcast. I very much enjoyed their lively episode on Synthetic Data & AI, and am delighted to introduce them to my audience of privacy engineers.

In our conversation, we explore why data scientists must stress test their model validations, especially for consequential systems that affect human safety and reliability. In fact, we have much to learn from the aerospace engineering field who has been using ML/AI since the 1960s. We discuss the best and worst use cases for using synthetic data'; problems with LLM-generated synthetic data; what can go wrong when your AI models lack diversity; how to build fair, performant systems; & synthetic data techniques for use with AI.

Topics Covered:
What inspired Andrew to found Monitaur and focus on AI governanceSid’s career path and his current PhD focus on NLPWhat motivated Andrew & Sid to launch their podcast, The AI FundamentalistsDefining 'synthetic data' & why academia takes a more rigorous approach to synthetic data than industryWhether the output of LLMs are synthetic data & the problem with training LLM base models with this dataThe best and worst 'synthetic data' use cases for ML/AIWhy the 'quality' of input data is so important when training AI models Thoughts on OpenAI's announcement that it will use LLM-generated synthetic data; and critique of OpenAI's approach, the AI hype machine, and the problems with 'growth hacking' corner-cuttingThe importance of diversity when training AI models; using 'multi-objective modeling' for building fair & performant systemsAndrew unpacks the "fairness through unawareness fallacy"How 'randomized data' differs from 'synthetic data'4 techniques for using synthetic data with ML/AI: 1) the Monte Carlo method; 2) Latin hypercube sampling; 3) gaussian copulas; & 4) random walkingWhat excites Andrew & Sid about synthetic data and how it will be used with AI in the futureResources Mentioned:
Check out Podchaser Listen to The AI Fundamentalists PodcastCheck out MonitaurGuest Info:
Follow Andrew on LinkedInFollow Sid on LinkedIn

Privado.aiPrivacy assurance at the speed of product development. Get instant visibility w/ privacy code scans.Shifting Privacy Left MediaWhere privacy engineers gather, share, & learnDisclaimer: This post contains affiliate links. If you make a purchase, I may receive a commission at no extra cost to you.Copyright © 2022 - 2024 Principled LLC. All rights reserved.

This week I welcome Dr. Andrew Clark, Co-founder & CTO of Monitaur, a trusted domain expert on the topic of machine learning, auditing and assurance; and Sid Mangalik, Research Scientist at Monitaur and PhD student at Stony Brook University. I discovered Andrew and Sid's new podcast show, The AI Fundamentalists Podcast. I very much enjoyed their lively episode on Synthetic Data & AI, and am delighted to introduce them to my audience of privacy engineers.

In our conversation, we explore why data scientists must stress test their model validations, especially for consequential systems that affect human safety and reliability. In fact, we have much to learn from the aerospace engineering field who has been using ML/AI since the 1960s. We discuss the best and worst use cases for using synthetic data'; problems with LLM-generated synthetic data; what can go wrong when your AI models lack diversity; how to build fair, performant systems; & synthetic data techniques for use with AI.

Topics Covered:
What inspired Andrew to found Monitaur and focus on AI governanceSid’s career path and his current PhD focus on NLPWhat motivated Andrew & Sid to launch their podcast, The AI FundamentalistsDefining 'synthetic data' & why academia takes a more rigorous approach to synthetic data than industryWhether the output of LLMs are synthetic data & the problem with training LLM base models with this dataThe best and worst 'synthetic data' use cases for ML/AIWhy the 'quality' of input data is so important when training AI models Thoughts on OpenAI's announcement that it will use LLM-generated synthetic data; and critique of OpenAI's approach, the AI hype machine, and the problems with 'growth hacking' corner-cuttingThe importance of diversity when training AI models; using 'multi-objective modeling' for building fair & performant systemsAndrew unpacks the "fairness through unawareness fallacy"How 'randomized data' differs from 'synthetic data'4 techniques for using synthetic data with ML/AI: 1) the Monte Carlo method; 2) Latin hypercube sampling; 3) gaussian copulas; & 4) random walkingWhat excites Andrew & Sid about synthetic data and how it will be used with AI in the futureResources Mentioned:
Check out Podchaser Listen to The AI Fundamentalists PodcastCheck out MonitaurGuest Info:
Follow Andrew on LinkedInFollow Sid on LinkedIn

Privado.aiPrivacy assurance at the speed of product development. Get instant visibility w/ privacy code scans.Shifting Privacy Left MediaWhere privacy engineers gather, share, & learnDisclaimer: This post contains affiliate links. If you make a purchase, I may receive a commission at no extra cost to you.Copyright © 2022 - 2024 Principled LLC. All rights reserved.

54 min

Top Podcasts In Technology

Acquired
Ben Gilbert and David Rosenthal
All-In with Chamath, Jason, Sacks & Friedberg
All-In Podcast, LLC
Hard Fork
The New York Times
Lex Fridman Podcast
Lex Fridman
TED Radio Hour
NPR
Darknet Diaries
Jack Rhysider