In this episode of Big Ideas Only, host Mikkel Svold takes a theoretical deep dive into how computers “see” with Andreas Møgelmose (Associate Professor of AI, Aalborg University; Visual Analysis & Perception Lab). We unpack the neural-network ideas behind modern vision, why 2012 was a turning point, how convolutional networks work, the difference between training, fine-tuning and adding context, plus explainability, bias traps, multimodality, and what still needs solving. In this episode, you’ll learn about: How a 2012 vision breakthrough reshaped speech and language research2. Neural networks explained simply — how they learn patterns from data 3. CNNs: how computers spot shapes and textures in images 4. Training, fine-tuning, and adding context to make models smarter 5. From hand-crafted features to fully data-driven learning 6. Explainability: the “ruler in skin-cancer photos” bias trap and what it teaches us 7. Multimodal systems: models combining text, images, and tools 8. Depth sensing with stereo, lidar, radar, and time-of-flight — and when 3D is essential 9. Privacy and governance: why real risk lies in implementation, not vision itself 10. Open challenges: fine-grained recognition, explainability, and machine unlearning 11. The pace of progress: steady research with headline-making leaps Episode Content 01:09 How computer vision differs from other AI fields 01:16 The 2012 breakthrough: neural networks in vision that spread to speech and text 04:05 Neural networks 101: neurons, weights, and simple math scaled up to complex decisions 07:06 Training at scale: millions of images, pretraining, and fine-tuning for specific tasks 10:39 Fine-tuning vs. adding context in large language models; backpropagation explained 16:52 Layered learning: from edges to shapes, faces, and full objects 18:22 Before deep learning: feature engineering and why it hit its limits 20:44 How it’s built: data collection, architecture design, training loops, and learning plateaus 22:54 Bias pitfalls: the “ruler in skin-cancer photos” example and why explainability matters 25:23 Regulation and trust: high-risk uses and the demand for transparency 26:13 Connecting vision to action: from black-box outputs to robots with “vision in the loop” 27:41 Ensemble systems: language models coordinating other models (e.g., text-to-image) 29:03 True multimodality: training models jointly on text and images 30:17 AGI reflections: embodiment, experience, and the limits of data 32:44 Human vision vs. computer vision: depth of field, aperture, and why machines see everything in focus 34:40 Is progress slowing or steady? Research milestones versus quiet, continuous work 36:43 Public perception: many versions, but most still see “just ChatGPT” 37:41 Why the research pace feels natural — more people means faster progress This podcast is produced by Montanus.