AI Post Transformers

mcgrof

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

  1. 1D AGO

    Automating DNN Compilation for FPGA Accelerators

    This episode explores FP-DNN, a 2017 framework that aims to compile TensorFlow-era neural networks onto FPGAs automatically, reducing the need for hand-designed accelerators for each model. It explains how the system maps convolutional layers, fully connected layers, and parts of LSTM computation into a shared matrix-multiplication core, while combining hand-tuned RTL for performance-critical components with HLS-generated logic for orchestration and layer-specific handling. The discussion highlights why this hybrid design matters for performance-per-watt, latency, and communication efficiency, especially as deeper CNNs and recurrent models were pushing hardware limits. Listeners would find it interesting for its clear look at an early attempt to turn FPGA deployment from an expert-only craft into a more reusable compiler-driven workflow, while also showing where the paper’s claims about broad model coverage may be too optimistic. Sources: 1. Automating DNN Compilation for FPGA Accelerators https://ceca.pku.edu.cn/media/lw/e3d0e0cd92452e0504b148220d442b9a.pdf 2. A Survey of FPGA-based Neural Network Inference Accelerators — Kaiyuan Guo, Shulin Zeng, Jincheng Yu, Yu Wang, Huazhong Yang, 2019 https://scholar.google.com/scholar?q=A+Survey+of+FPGA-based+Neural+Network+Inference+Accelerators 3. DeepBurning: Automatic Generation of FPGA-based Learning Accelerators for the Neural Network Family — Ying Wang, Jie Xu, Yudeng Sun, Baohua Cao, Chunyuan Xu, Yibo Kong, Chundao Han, Xuan Wang, 2016 https://scholar.google.com/scholar?q=DeepBurning:+Automatic+Generation+of+FPGA-based+Learning+Accelerators+for+the+Neural+Network+Family 4. FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates — Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, Jason Cong, 2017 https://scholar.google.com/scholar?q=FP-DNN:+An+Automated+Framework+for+Mapping+Deep+Neural+Networks+onto+FPGAs+with+RTL-HLS+Hybrid+Templates 5. DNNBuilder: An Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs — Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-Mei Hwu, Deming Chen, 2018 https://scholar.google.com/scholar?q=DNNBuilder:+An+Automated+Tool+for+Building+High-Performance+DNN+Hardware+Accelerators+for+FPGAs 6. From High-Level Deep Neural Models to FPGAs — Hardik Sharma, Jongse Park, Emmanuel Amaro, Bradley Thwaites, Priyanka Kotha, Anmol Gupta, Joon Kyung Kim, Asit Mishra, and Hsien-Hsin S. Lee, 2016 https://scholar.google.com/scholar?q=From+High-Level+Deep+Neural+Models+to+FPGAs 7. Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks — Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong, 2016 https://scholar.google.com/scholar?q=Caffeine:+Towards+Uniformed+Representation+and+Acceleration+for+Deep+Convolutional+Neural+Networks 8. Throughput-Optimized OpenCL-Based FPGA Accelerator for Large-Scale Convolutional Neural Networks — Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarita Vrudhula, Jae-sun Seo, and Yu Cao, 2016 https://scholar.google.com/scholar?q=Throughput-Optimized+OpenCL-Based+FPGA+Accelerator+for+Large-Scale+Convolutional+Neural+Networks 9. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network — Jiantao Qiu, Jie Wang, Song Yao, Kai Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, and Song Wang, 2016 https://scholar.google.com/scholar?q=Going+Deeper+with+Embedded+FPGA+Platform+for+Convolutional+Neural+Network 10. Optimizing FPGA-Based Accelerator Design for Deep Convolutional Neural Networks — Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong, 2015 https://scholar.google.com/scholar?q=Optimizing+FPGA-Based+Accelerator+Design+for+Deep+Convolutional+Neural+Networks 11. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding — Song Han, Huizi Mao, and William J. Dally, 2015 https://scholar.google.com/scholar?q=Deep+Compression:+Compressing+Deep+Neural+Networks+with+Pruning,+Trained+Quantization+and+Huffman+Coding 12. BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach — Zhen Zheng et al., 2023 https://scholar.google.com/scholar?q=BladeDISC:+Optimizing+Dynamic+Shape+Machine+Learning+Workloads+via+Compiler+Approach 13. TSCompiler: efficient compilation framework for dynamic-shape models — Xiang Luo, Chen Zhang, Chenbo Geng, Yanzhi Yi, Jiahui Hu, Renwei Zhang, Zhen Zhang, Gianpietro Consolaro, Fan Yang, Tun Lu, Ning Gu, Li Shang, 2024 https://scholar.google.com/scholar?q=TSCompiler:+efficient+compilation+framework+for+dynamic-shape+models 14. TATAA: Programmable Mixed-Precision Transformer Acceleration with a Transformable Arithmetic Architecture — Jiajun Wu, Mo Song, Jingmin Zhao, Yizhao Gao, Jia Li, Hayden Kwok-Hay So, 2024 https://scholar.google.com/scholar?q=TATAA:+Programmable+Mixed-Precision+Transformer+Acceleration+with+a+Transformable+Arithmetic+Architecture 15. FPGA Acceleration With Hessian-Based Comprehensive Intra-Layer Mixed-Precision Quantization for Transformer Models — Woohong Byun, Jongseok Woo, Saibal Mukhopadhyay, 2025 https://scholar.google.com/scholar?q=FPGA+Acceleration+With+Hessian-Based+Comprehensive+Intra-Layer+Mixed-Precision+Quantization+for+Transformer+Models 16. Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference — Zifan He, Rui Ma, Yizhou Sun, Jason Cong, 2026 https://scholar.google.com/scholar?q=Understand+and+Accelerate+Memory+Processing+Pipeline+for+Disaggregated+LLM+Inference 17. AI Post Transformers: FPGA Neural Network Accelerators for Space — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-26-fpga-neural-network-accelerators-for-spa-3087ae.mp3 18. AI Post Transformers: Continuous Batching for LLM Inference: Throughput and Latency Gains — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/continuous-batching-for-llm-inference-throughput-and-latency-gains/ 19. AI Post Transformers: Advancements in Efficient KV Cache Quantization and Management — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/advancements-in-efficient-kv-cache-quantization-and-management/ Interactive Visualization: Automating DNN Compilation for FPGA Accelerators

  2. 1D AGO

    Boosted Decision Trees for CMS Muon Triggers

    This episode explores how the CMS experiment uses machine learning inside its Level-1 endcap muon trigger, where hardware must estimate muon momentum within roughly 500 nanoseconds while filtering an enormous stream of proton-collision data. It explains why boosted decision trees were chosen over neural networks: not because they are trendier, but because they fit strict FPGA constraints around deterministic latency, fixed-point arithmetic, and bounded memory. A central finding is that the online system does not run the trees directly; instead, the model is trained offline and compiled into a massive precomputed lookup table, turning inference into a single fast memory access. The discussion is especially interesting because it shows machine learning as a systems-and-hardware co-design problem, grounded in detector physics, feature engineering, and the practical realities of deploying learned functions in one of the harshest real-time environments in science. Sources: 1. Boosted Decision Trees for CMS Muon Triggers https://indico.cern.ch/event/567550/papers/2629686/files/6172-acat_bdt_l1t.pdf 2. Applications and Techniques for Fast Machine Learning in Science — Allison McCarn Deiana, Nhan Tran, Joshua Agar, Michaela Blott, Giuseppe Di Guglielmo, Javier Duarte, Philip Harris, Mia Liu, Mark Neubauer, Jennifer Ngadiuba, Maurizio Pierini and many others, 2022 https://scholar.google.com/scholar?q=Applications+and+Techniques+for+Fast+Machine+Learning+in+Science 3. Fast inference of Boosted Decision Trees in FPGAs for particle physics — Sioni Summers, Giuseppe Di Guglielmo, Javier Duarte, Philip Harris, Duc Hoang, Sergo Jindariani, Edward Kreinar, Vladimir Loncar, Jennifer Ngadiuba, Maurizio Pierini, Dylan Rankin, Nhan Tran and Zhenbin Wu, 2020 https://scholar.google.com/scholar?q=Fast+inference+of+Boosted+Decision+Trees+in+FPGAs+for+particle+physics 4. Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors — Claudionor N. Coelho Jr, Aki Kuusela, Shan Li, Hao Zhuang, Jennifer Ngadiuba, Thea Klaeboe Aarrestad, Vladimir Loncar, Maurizio Pierini, Adrian Alan Pol and Sioni Summers, 2021 https://scholar.google.com/scholar?q=Automatic+heterogeneous+quantization+of+deep+neural+networks+for+low-latency+inference+on+the+edge+for+particle+detectors 5. Serving DNNs in Real Time at Datacenter Scale with Project Brainwave — Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengill, Ming Liu, Mahdi Ghandi, Daniel Lo and others, 2018 https://scholar.google.com/scholar?q=Serving+DNNs+in+Real+Time+at+Datacenter+Scale+with+Project+Brainwave 6. The CMS Trigger System — CMS Collaboration, not specified in excerpt https://scholar.google.com/scholar?q=The+CMS+Trigger+System 7. The CMS Endcap Muon Track Finder — CMS Collaboration or EMTF-related authors, not specified in excerpt https://scholar.google.com/scholar?q=The+CMS+Endcap+Muon+Track+Finder 8. TMVA: Toolkit for Multivariate Data Analysis — Andreas Hoecker and collaborators, not specified in excerpt https://scholar.google.com/scholar?q=TMVA:+Toolkit+for+Multivariate+Data+Analysis 9. Fast Machine Learning for Science: how accelerated hardware and software are enabling real-time data analysis at the edge — Javier Duarte and collaborators, 2022 https://scholar.google.com/scholar?q=Fast+Machine+Learning+for+Science:+how+accelerated+hardware+and+software+are+enabling+real-time+data+analysis+at+the+edge 10. hls4ml: An Open-Source Codesign Workflow to Empower Scientific Low-Power Machine Learning Devices — Giuseppe Di Guglielmo, Javier Duarte and collaborators, 2021 https://scholar.google.com/scholar?q=hls4ml:+An+Open-Source+Codesign+Workflow+to+Empower+Scientific+Low-Power+Machine+Learning+Devices 11. End-to-end codesign of Hessian-aware quantized neural networks for FPGAs and ASICs — Javier Campos, Zhen Dong, Javier Duarte, Nhan Tran, et al., 2023 https://scholar.google.com/scholar?q=End-to-end+codesign+of+Hessian-aware+quantized+neural+networks+for+FPGAs+and+ASICs 12. FPGA-QNN: Quantized Neural Network Hardware Acceleration on FPGAs — Mustafa Tasci, Ayhan Istanbullu, Vedat Tumen, Selahattin Kosunalp, 2025 https://scholar.google.com/scholar?q=FPGA-QNN:+Quantized+Neural+Network+Hardware+Acceleration+on+FPGAs 13. An FPGA-Based Time-to-Digital Converter with Online Dual-Chain Calibration — Zhengsen Jia, Yuzhuo Wang, Jie Ding, Qian Xu, et al., 2025 https://scholar.google.com/scholar?q=An+FPGA-Based+Time-to-Digital+Converter+with+Online+Dual-Chain+Calibration 14. A Novel FPGA-based Time-to-Digital Converter featuring Machine Learning-Aided Self-Calibration — Arash Amini Bardpareh, Eleonora Vacca, Davide Nicolini, Luca Sterpone, et al., 2026 https://scholar.google.com/scholar?q=A+Novel+FPGA-based+Time-to-Digital+Converter+featuring+Machine+Learning-Aided+Self-Calibration 15. AI Post Transformers: FlatAttention for Tile-Based Accelerator Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-flatattention-for-tile-based-accelerator-56e6ca.mp3 16. AI Post Transformers: SolidAttention: Co-Designing Sparse Attention and SSD I/O — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-18-solidattention-co-designing-sparse-atten-5a8622.mp3 Interactive Visualization: Boosted Decision Trees for CMS Muon Triggers

  3. 1D AGO

    Caffe and the Rise of CNN Frameworks

    This episode explores why Caffe mattered as a systems breakthrough for the early CNN era, even though it did not introduce a new learning algorithm. It explains how the framework helped researchers and engineers move from handcrafted vision features to learned feature embeddings, and why separating model definition from implementation made experimentation and deployment far more practical. The discussion highlights Caffe’s use of declarative Protocol Buffers configurations, directed acyclic graph model structure, and the blob abstraction that hid CPU versus GPU details while supporting modular extensions. Listeners would find it interesting for its clear account of how deep learning became usable at scale in 2014, and for its nuanced take on Caffe’s evidence: strong engineering promises, impressive throughput figures, and a major role in shaping the emerging model-development ecosystem. Sources: 1. Caffe: Convolutional Architecture for Fast Feature Embedding — Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, Trevor Darrell, 2014 http://arxiv.org/abs/1408.5093 2. ImageNet Classification with Deep Convolutional Neural Networks — Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, 2012 https://scholar.google.com/scholar?q=ImageNet+Classification+with+Deep+Convolutional+Neural+Networks 3. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks — Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, Yann LeCun, 2013 https://scholar.google.com/scholar?q=OverFeat:+Integrated+Recognition,+Localization+and+Detection+using+Convolutional+Networks 4. Decaf: A Deep Convolutional Activation Feature for Generic Visual Recognition — Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, Trevor Darrell, 2014 https://scholar.google.com/scholar?q=Decaf:+A+Deep+Convolutional+Activation+Feature+for+Generic+Visual+Recognition 5. cuda-convnet — Alex Krizhevsky, 2012 https://scholar.google.com/scholar?q=cuda-convnet 6. Theano: new features and speed improvements — Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud Bergeron, Nicolas Bouchard, David Warde-Farley, Yoshua Bengio and others, 2012 https://scholar.google.com/scholar?q=Theano:+new+features+and+speed+improvements 7. Pylearn2: a machine learning research library — Ian Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, Yoshua Bengio, 2013 https://scholar.google.com/scholar?q=Pylearn2:+a+machine+learning+research+library 8. Torch7 — Ronan Collobert, Koray Kavukcuoglu, Clément Farabet and others, 2011 https://scholar.google.com/scholar?q=Torch7 9. Efficient inference of Vision Transformer with structural pruning and operator fusion on GPU — unknown from snippet, recent https://scholar.google.com/scholar?q=Efficient+inference+of+Vision+Transformer+with+structural+pruning+and+operator+fusion+on+GPU 10. I-ViT: Integer-only quantization for efficient vision transformer inference — unknown from snippet, recent https://scholar.google.com/scholar?q=I-ViT:+Integer-only+quantization+for+efficient+vision+transformer+inference 11. DeViT: Decomposing vision transformers for collaborative inference in edge devices — unknown from snippet, recent https://scholar.google.com/scholar?q=DeViT:+Decomposing+vision+transformers+for+collaborative+inference+in+edge+devices 12. Raman: A reconfigurable and sparse TinyML accelerator for inference on edge — unknown from snippet, recent https://scholar.google.com/scholar?q=Raman:+A+reconfigurable+and+sparse+TinyML+accelerator+for+inference+on+edge 13. Hardware accelerator design for sparse DNN inference and training: A tutorial — unknown from snippet, recent https://scholar.google.com/scholar?q=Hardware+accelerator+design+for+sparse+DNN+inference+and+training:+A+tutorial 14. Inference serving with end-to-end latency SLOs over dynamic edge networks — unknown from snippet, recent https://scholar.google.com/scholar?q=Inference+serving+with+end-to-end+latency+SLOs+over+dynamic+edge+networks 15. Training data attribution via approximate unrolling — unknown from snippet, recent https://scholar.google.com/scholar?q=Training+data+attribution+via+approximate+unrolling 16. Exploring Training Data Attribution under Limited Access Constraints — unknown from snippet, recent https://scholar.google.com/scholar?q=Exploring+Training+Data+Attribution+under+Limited+Access+Constraints 17. DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models — unknown from snippet, recent https://scholar.google.com/scholar?q=DATE-LM:+Benchmarking+Data+Attribution+Evaluation+for+Large+Language+Models 18. AI Post Transformers: Speculative Decoding in Real vLLM Serving — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-speculative-decoding-in-real-vllm-servin-6f4e2b.mp3 19. AI Post Transformers: GPT-NeoX: Large-Scale Autoregressive Language Modeling in PyTorch — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/gpt-neox-large-scale-autoregressive-language-modeling-in-pytorch/ 20. AI Post Transformers: NVMe Offload on Colossal AI: Breaking the GPU Memory Wall — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/nvme-offload-on-colossal-ai-breaking-the-gpu-memory-wall/ 21. AI Post Transformers: Mamba-3 for Efficient Sequence Modeling — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-16-mamba-3-for-efficient-sequence-modeling-97a22a.mp3 Interactive Visualization: Caffe and the Rise of CNN Frameworks

  4. 1D AGO

    Caffeine: A Unified FPGA for CNNs

    This episode explores the 2016 Caffeine FPGA accelerator and its central claim that a single FPGA design can handle an entire CNN efficiently, rather than excelling at convolutions while bottlenecking on fully connected layers. It explains why that mattered in the AlexNet-to-VGG era, when convolutional layers were compute-bound but dense layers often became communication-bound because moving weights and activations through memory was the real constraint. The discussion focuses on Caffeine’s main technical idea: a unified matrix-multiplication-oriented representation that supports both convolution and fully connected layers without the heavy data expansion of standard `im2col` approaches, plus memory-access scheduling choices such as weight-major mapping to improve reuse and burst efficiency. Listeners would find it interesting because the episode makes a precise systems argument about how hardware performance depends not just on arithmetic throughput, but on matching dataflow, buffering, and bandwidth to the structure of the network. Sources: 1. Caffeine: A Unified FPGA for CNNs https://ceca.pku.edu.cn/media/lw/83b308c75c56a94fbf706b92dbe57917.pdf 2. Gradient-Based Learning Applied to Document Recognition — Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, 1998 https://scholar.google.com/scholar?q=Gradient-Based+Learning+Applied+to+Document+Recognition 3. ImageNet Classification with Deep Convolutional Neural Networks — Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, 2012 https://scholar.google.com/scholar?q=ImageNet+Classification+with+Deep+Convolutional+Neural+Networks 4. Very Deep Convolutional Networks for Large-Scale Image Recognition — Karen Simonyan, Andrew Zisserman, 2014 https://scholar.google.com/scholar?q=Very+Deep+Convolutional+Networks+for+Large-Scale+Image+Recognition 5. Deep Residual Learning for Image Recognition — Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, 2016 https://scholar.google.com/scholar?q=Deep+Residual+Learning+for+Image+Recognition 6. DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning — Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, Olivier Temam, 2014 https://scholar.google.com/scholar?q=DianNao:+A+Small-Footprint+High-Throughput+Accelerator+for+Ubiquitous+Machine-Learning 7. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks — Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, Vivienne Sze, 2016 https://scholar.google.com/scholar?q=Eyeriss:+An+Energy-Efficient+Reconfigurable+Accelerator+for+Deep+Convolutional+Neural+Networks 8. Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks — Chen Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, Jason Cong, 2016 https://scholar.google.com/scholar?q=Caffeine:+Towards+Uniformed+Representation+and+Acceleration+for+Deep+Convolutional+Neural+Networks 9. In-Datacenter Performance Analysis of a Tensor Processing Unit — Norman P. Jouppi and many colleagues at Google, 2017 https://scholar.google.com/scholar?q=In-Datacenter+Performance+Analysis+of+a+Tensor+Processing+Unit 10. Optimizing FPGA-Based Accelerator Design for Deep Convolutional Neural Networks — Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, Jason Cong, 2015 https://scholar.google.com/scholar?q=Optimizing+FPGA-Based+Accelerator+Design+for+Deep+Convolutional+Neural+Networks 11. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network — Jiantao Qiu, Jingsheng Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, Huazhong Yang, 2016 https://scholar.google.com/scholar?q=Going+Deeper+with+Embedded+FPGA+Platform+for+Convolutional+Neural+Network 12. fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs — Stylianos I. Venieris, Christos-Savvas Bouganis, 2016 https://scholar.google.com/scholar?q=fpgaConvNet:+A+Framework+for+Mapping+Convolutional+Neural+Networks+on+FPGAs 13. A high-performance FPGA-based depthwise separable convolution accelerator — approximate; recent FPGA accelerator authors, recent https://scholar.google.com/scholar?q=A+high-performance+FPGA-based+depthwise+separable+convolution+accelerator 14. Fpga-based acceleration for convolutional neural networks: A comprehensive review — approximate; review authors, recent https://scholar.google.com/scholar?q=Fpga-based+acceleration+for+convolutional+neural+networks:+A+comprehensive+review 15. Mobile-X: Dedicated FPGA implementation of the MobileNet accelerator optimizing depthwise separable convolution — approximate; Mobile-X authors, recent https://scholar.google.com/scholar?q=Mobile-X:+Dedicated+FPGA+implementation+of+the+MobileNet+accelerator+optimizing+depthwise+separable+convolution 16. Design of a convolutional neural network accelerator based on on-chip data reordering — approximate; accelerator authors, recent https://scholar.google.com/scholar?q=Design+of+a+convolutional+neural+network+accelerator+based+on+on-chip+data+reordering 17. Energy-efficient and high-throughput CNN inference engine based on memory-sharing and data-reusing for edge applications — approximate; edge-CNN accelerator authors, recent https://scholar.google.com/scholar?q=Energy-efficient+and+high-throughput+CNN+inference+engine+based+on+memory-sharing+and+data-reusing+for+edge+applications 18. An efficient sparse CNN inference accelerator with balanced intra-and inter-PE workload — approximate; sparse accelerator authors, recent https://scholar.google.com/scholar?q=An+efficient+sparse+CNN+inference+accelerator+with+balanced+intra-and+inter-PE+workload 19. Hardware accelerator design for sparse DNN inference and training: A tutorial — approximate; tutorial authors, recent https://scholar.google.com/scholar?q=Hardware+accelerator+design+for+sparse+DNN+inference+and+training:+A+tutorial 20. AI Post Transformers: FlatAttention for Tile-Based Accelerator Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-flatattention-for-tile-based-accelerator-56e6ca.mp3 21. AI Post Transformers: RFNoC SISO Processor via High-Level Synthesis — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-06-rfnoc-siso-processor-via-high-level-synt-c892f3.mp3 22. AI Post Transformers: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-computation-bandwidth-memory-trade-offs-a83f2b.mp3 23. AI Post Transformers: Los Alamos: overcoming the memory wall fighting sparse memory access — Hal Turing & Dr. Ada Shannon, 2025 https://podcast.do-not-panic.com/episodes/los-alamos-overcoming-the-memory-wall-fighting-sparse-memory-access/ Interactive Visualization: Caffeine: A Unified FPGA for CNNs

  5. 1D AGO

    Fast FPGA BDT Inference for LHC Triggers

    This episode explores how boosted decision trees can be compiled directly into FPGA firmware for ultra-low-latency particle-physics triggers at the Large Hadron Collider. It explains why this setting favors shallow, quantized tree ensembles over larger neural networks: trigger decisions must happen within a tiny hardware budget, with strict limits on latency, power, and on-chip resources. The discussion focuses on a concrete benchmark where a 100-tree, depth-4 gradient-boosted model for five-class jet tagging is mapped to a Xilinx VU9P FPGA and compared against a similarly deployed multilayer perceptron. Listeners would find it interesting because it shows how model choice changes when every nanosecond matters, and how familiar ML methods can become hardwired decision circuits rather than conventional software inference. Sources: 1. Fast inference of Boosted Decision Trees in FPGAs for particle physics — Sioni Summers, Giuseppe Di Guglielmo, Javier Duarte, Philip Harris, Duc Hoang, Sergo Jindariani, Edward Kreinar, Vladimir Loncar, Jennifer Ngadiuba, Maurizio Pierini, Dylan Rankin, Nhan Tran, Zhenbin Wu, 2020 http://arxiv.org/abs/2002.02534 2. Greedy Function Approximation: A Gradient Boosting Machine — Jerome H. Friedman, 2001 https://scholar.google.com/scholar?q=Greedy+Function+Approximation:+A+Gradient+Boosting+Machine 3. XGBoost: A Scalable Tree Boosting System — Tianqi Chen, Carlos Guestrin, 2016 https://scholar.google.com/scholar?q=XGBoost:+A+Scalable+Tree+Boosting+System 4. LightGBM: A Highly Efficient Gradient Boosting Decision Tree — Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu, 2017 https://scholar.google.com/scholar?q=LightGBM:+A+Highly+Efficient+Gradient+Boosting+Decision+Tree 5. CatBoost: unbiased boosting with categorical features — Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, Andrey Gulin, 2018 https://scholar.google.com/scholar?q=CatBoost:+unbiased+boosting+with+categorical+features 6. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference — Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko, 2018 https://scholar.google.com/scholar?q=Quantization+and+Training+of+Neural+Networks+for+Efficient+Integer-Arithmetic-Only+Inference 7. Quantizing deep convolutional networks for efficient inference: A whitepaper — Raghuraman Krishnamoorthi, 2018 https://scholar.google.com/scholar?q=Quantizing+deep+convolutional+networks+for+efficient+inference:+A+whitepaper 8. Post-training 4-bit quantization of convolution networks for rapid-deployment — Ron Banner, Yury Nahshan, Elad Hoffer, Daniel Soudry, 2019 https://scholar.google.com/scholar?q=Post-training+4-bit+quantization+of+convolution+networks+for+rapid-deployment 9. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh, 2022 https://scholar.google.com/scholar?q=GPTQ:+Accurate+Post-Training+Quantization+for+Generative+Pre-trained+Transformers 10. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, Song Han, 2023 https://scholar.google.com/scholar?q=AWQ:+Activation-aware+Weight+Quantization+for+LLM+Compression+and+Acceleration 11. Fast inference of deep neural networks in FPGAs for particle physics — Javier Duarte, Song Han, Philip Harris, Sergo Jindariani, Edward Kreinar, Benjamin Kreis, Jennifer Ngadiuba, Maurizio Pierini, Ryan Rivera, Nhan Tran, Zhenbin Wu, 2018 https://scholar.google.com/scholar?q=Fast+inference+of+deep+neural+networks+in+FPGAs+for+particle+physics 12. Efficient, reliable and fast high-level triggering using a bonsai boosted decision tree — V. V. Gligorov, M. Williams, 2013 https://scholar.google.com/scholar?q=Efficient,+reliable+and+fast+high-level+triggering+using+a+bonsai+boosted+decision+tree 13. Boosted Decision Trees in the Level-1 Muon Endcap Trigger at CMS — CMS Collaboration, 2018 https://scholar.google.com/scholar?q=Boosted+Decision+Trees+in+the+Level-1+Muon+Endcap+Trigger+at+CMS 14. Scalable inference of decision tree ensembles: Flexible design for CPU-FPGA platforms — Muhsen Owaida, Hantian Zhang, Ce Zhang, Gustavo Alonso, 2017 https://scholar.google.com/scholar?q=Scalable+inference+of+decision+tree+ensembles:+Flexible+design+for+CPU-FPGA+platforms 15. Machine learning at the energy and intensity frontiers of particle physics — A. Radovic et al., 2018 https://scholar.google.com/scholar?q=Machine+learning+at+the+energy+and+intensity+frontiers+of+particle+physics 16. Low latency transformer inference on FPGAs for physics applications with hls4ml — Zhixing Jiang et al., 2025 https://scholar.google.com/scholar?q=Low+latency+transformer+inference+on+FPGAs+for+physics+applications+with+hls4ml 17. Ultrafast jet classification at the HL-LHC — Patrick Odagiu et al., 2024 https://scholar.google.com/scholar?q=Ultrafast+jet+classification+at+the+HL-LHC 18. AI Post Transformers: FlatAttention for Tile-Based Accelerator Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-flatattention-for-tile-based-accelerator-56e6ca.mp3 19. AI Post Transformers: Computation-Bandwidth-Memory Trade-offs for AI Infrastructure — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-09-computation-bandwidth-memory-trade-offs-a83f2b.mp3 20. AI Post Transformers: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-03-25-turboquant-online-vector-quantiz-1967b7.mp3 Interactive Visualization: Fast FPGA BDT Inference for LHC Triggers

  6. 1D AGO

    Fast FPGA Inference for LHC Triggers

    This episode explores how a 2018 paper brings neural network inference into the Level-1 trigger at the Large Hadron Collider, where event decisions must be made under sub-microsecond latency constraints. It explains why FPGAs are a natural fit for this setting, emphasizing batch-one, deterministic inference and the hardware realities that make model size, timing, memory use, and routing just as important as accuracy. The discussion centers on a compact dense network for jet substructure classification, using 16 engineered features to distinguish quark, gluon, W, Z, and top jets while preserving rare physics signals. It also highlights the paper’s broader argument: tools like High-Level Synthesis and hls4ml can let physicists deploy hardware-aware ML workflows directly, making real-time AI a practical part of scientific instrumentation rather than just a benchmark exercise. Sources: 1. Fast inference of deep neural networks in FPGAs for particle physics — Javier Duarte, Song Han, Philip Harris, Sergo Jindariani, Edward Kreinar, Benjamin Kreis, Jennifer Ngadiuba, Maurizio Pierini, Ryan Rivera, Nhan Tran, Zhenbin Wu, 2018 http://arxiv.org/abs/1804.06913 2. A Survey on Performance Optimization of High-Level Synthesis Tools — Lan Huang, Da-Lin Li, Kang-Ping Wang, Teng Gao, Adriano Tavares, 2020 https://scholar.google.com/scholar?q=A+Survey+on+Performance+Optimization+of+High-Level+Synthesis+Tools 3. FINN-R: An End-to-End Deep-Learning Framework for Fast Exploration of Quantized Neural Networks — Michaela Blott, Thomas B. Preusser, Nicholas J. Fraser, Giulio Gambardella, Kenneth O'Brien, Yaman Umuroglu, Miriam Leeser, Kees Vissers, 2018 https://scholar.google.com/scholar?q=FINN-R:+An+End-to-End+Deep-Learning+Framework+for+Fast+Exploration+of+Quantized+Neural+Networks 4. Fast inference of deep neural networks in FPGAs for particle physics — Javier Duarte, Song Han, Philip Harris, Sergo Jindariani, Edward Kreinar, Jennifer Ngadiuba, Maurizio Pierini, Nhan Tran, Zhenbin Wu, et al., 2018 https://scholar.google.com/scholar?q=Fast+inference+of+deep+neural+networks+in+FPGAs+for+particle+physics 5. Fast convolutional neural networks on FPGAs with hls4ml — Thea Aarrestad, Vladimir Loncar, Nicolo Ghielmetti, Maurizio Pierini, Sioni Summers, Jennifer Ngadiuba, Javier Duarte, Philip Harris, et al., 2021 https://scholar.google.com/scholar?q=Fast+convolutional+neural+networks+on+FPGAs+with+hls4ml 6. FINN: A Framework for Fast, Scalable Binarized Neural Network Inference — Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip H. W. Leong, Magnus Jahre, Kees Vissers, 2017 https://scholar.google.com/scholar?q=FINN:+A+Framework+for+Fast,+Scalable+Binarized+Neural+Network+Inference 7. Serving DNNs in Real Time at Datacenter Scale with Project Brainwave — Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengill, Ming Liu, Doug Burger, et al., 2018 https://scholar.google.com/scholar?q=Serving+DNNs+in+Real+Time+at+Datacenter+Scale+with+Project+Brainwave 8. Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors — Claudionor N. Coelho Jr., Aki Kuusela, Shan Li, Hao Zhuang, Jennifer Ngadiuba, Thea K. Aarrestad, Vladimir Loncar, Maurizio Pierini, et al., 2021 https://scholar.google.com/scholar?q=Automatic+heterogeneous+quantization+of+deep+neural+networks+for+low-latency+inference+on+the+edge+for+particle+detectors 9. Learning both Weights and Connections for Efficient Neural Network — Song Han, Jeff Pool, John Tran, William J. Dally, 2015 https://scholar.google.com/scholar?q=Learning+both+Weights+and+Connections+for+Efficient+Neural+Network 10. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding — Song Han, Huizi Mao, William J. Dally, 2016 https://scholar.google.com/scholar?q=Deep+Compression:+Compressing+Deep+Neural+Networks+with+Pruning,+Trained+Quantization+and+Huffman+Coding 11. Jet Substructure at the Large Hadron Collider: A Review of Recent Advances in Theory and Machine Learning — Andrew J. Larkoski, Ian Moult, Benjamin Nachman, 2020 https://scholar.google.com/scholar?q=Jet+Substructure+at+the+Large+Hadron+Collider:+A+Review+of+Recent+Advances+in+Theory+and+Machine+Learning 12. Deep-learning Top Taggers or The End of QCD? — Gregor Kasieczka, Tilman Plehn, Michael Russell, Torben Schell, 2017 https://scholar.google.com/scholar?q=Deep-learning+Top+Taggers+or+The+End+of+QCD? 13. From High-Level Deep Neural Models to FPGAs — Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, Hadi Esmaeilzadeh, 2016 https://scholar.google.com/scholar?q=From+High-Level+Deep+Neural+Models+to+FPGAs 14. Distance-Weighted Graph Neural Networks on FPGAs for Real-Time Particle Reconstruction in High Energy Physics — Yutaro Iiyama, Gianluca Cerminara, Abhijay Gupta, Jan Kieseler, Vladimir Loncar, Maurizio Pierini, Shah Rukh Qasim and collaborators, 2021 https://scholar.google.com/scholar?q=Distance-Weighted+Graph+Neural+Networks+on+FPGAs+for+Real-Time+Particle+Reconstruction+in+High+Energy+Physics 15. Low latency transformer inference on FPGAs for physics applications with hls4ml — not confirmed from snippet; likely hls4ml/particle-physics collaboration, recent https://scholar.google.com/scholar?q=Low+latency+transformer+inference+on+FPGAs+for+physics+applications+with+hls4ml 16. Optimizing transformer models for low-latency inference: techniques, architectures, and code implementations — not confirmed from snippet, recent https://scholar.google.com/scholar?q=Optimizing+transformer+models+for+low-latency+inference:+techniques,+architectures,+and+code+implementations 17. Low-bit mixed-precision quantization and acceleration of CNN for FPGA deployment — not confirmed from snippet, recent https://scholar.google.com/scholar?q=Low-bit+mixed-precision+quantization+and+acceleration+of+CNN+for+FPGA+deployment 18. MPQA: Mixed-Precision Quantization Accelerator for CNN Inference — not confirmed from snippet, recent https://scholar.google.com/scholar?q=MPQA:+Mixed-Precision+Quantization+Accelerator+for+CNN+Inference 19. Fine-grained structured sparse computing for FPGA-based AI inference — not confirmed from snippet, recent https://scholar.google.com/scholar?q=Fine-grained+structured+sparse+computing+for+FPGA-based+AI+inference 20. Efficient CNN inference acceleration on FPGAs: a pattern pruning-driven approach — not confirmed from snippet, recent https://scholar.google.com/scholar?q=Efficient+CNN+inference+acceleration+on+FPGAs:+a+pattern+pruning-driven+approach 21. Online Learning Extreme Learning Machine with Low-Complexity Predictive Plasticity Rule and FPGA Implementation — not confirmed from snippet, recent https://scholar.google.com/scholar?q=Online+Learning+Extreme+Learning+Machine+with+Low-Complexity+Predictive+Plasticity+Rule+and+FPGA+Implementation 22. An FPGA architecture for online learning using the Tsetlin machine — not confirmed from snippet, recent https://scholar.google.com/scholar?q=An+FPGA+architecture+for+online+learning+using+the+Tsetlin+machine 23. AI Post Transformers: FPGA Neural Network Accelerators for Space — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-26-fpga-neural-network-accelerators-for-spa-3087ae.mp3 Interactive Visualization: Fast FPGA Inference for LHC Triggers

  7. 1D AGO

    Synchronous Data Flow for Signal Processing

    This episode explores the 1987 paper on synchronous data flow and how it turns stream-processing programs into analyzable graphs with fixed token production and consumption rates. It explains how those fixed rates let a compiler precompute a repeating execution schedule, prove steady-state consistency through balance equations, and allocate bounded buffers ahead of time instead of relying on expensive runtime scheduling. The discussion highlights why that tradeoff works so well for digital signal processing workloads like filtering, resampling, and codecs, while also showing why the model is too restrictive for messier software with irregular control flow. Listeners would find it interesting because it shows how a carefully limited programming model can unlock strong guarantees about performance, memory use, and parallel execution. Sources: 1. Synchronous Data Flow for Signal Processing https://ptolemy.berkeley.edu/publications/papers/87/synchdataflow/synchdataflow.pdf 2. Synchronous Data Flow — Edward A. Lee and David G. Messerschmitt, 1987 https://scholar.google.com/scholar?q=Synchronous+Data+Flow 3. Dataflow Process Networks — Edward A. Lee and Thomas M. Parks, 1995 https://scholar.google.com/scholar?q=Dataflow+Process+Networks 4. Cycle-Static Dataflow — Greet Bilsen, Marc Engels, Rudy Lauwereins, and Jean Peperstraete, 1996 https://scholar.google.com/scholar?q=Cycle-Static+Dataflow 5. Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing — Edward A. Lee and David G. Messerschmitt, 1987 https://scholar.google.com/scholar?q=Static+Scheduling+of+Synchronous+Data+Flow+Programs+for+Digital+Signal+Processing 6. Bounded Scheduling of Process Networks — Thomas M. Parks, 1995 https://scholar.google.com/scholar?q=Bounded+Scheduling+of+Process+Networks 7. Synthesis of Embedded Software from Synchronous Dataflow Specifications — Shuvra S. Bhattacharyya, Praveen K. Murthy, and Edward A. Lee, 1999 https://scholar.google.com/scholar?q=Synthesis+of+Embedded+Software+from+Synchronous+Dataflow+Specifications 8. StreamIt: A Language for Streaming Applications — William Thies, Michal Karczmarek, and Saman Amarasinghe, 2002 https://scholar.google.com/scholar?q=StreamIt:+A+Language+for+Streaming+Applications 9. Memory Management for Dataflow Programming of Multirate Signal Processing Algorithms — Shuvra S. Bhattacharyya and Edward A. Lee, 1994 https://scholar.google.com/scholar?q=Memory+Management+for+Dataflow+Programming+of+Multirate+Signal+Processing+Algorithms 10. Joint Minimization of Code and Data for Synchronous Dataflow Programs — Praveen K. Murthy, Shuvra S. Bhattacharyya, and Edward A. Lee, 1994 https://scholar.google.com/scholar?q=Joint+Minimization+of+Code+and+Data+for+Synchronous+Dataflow+Programs 11. Buffer Merging: A Powerful Technique for Reducing Memory Requirements of Synchronous Dataflow Specifications — Praveen K. Murthy and Shuvra S. Bhattacharyya, 2000 https://scholar.google.com/scholar?q=Buffer+Merging:+A+Powerful+Technique+for+Reducing+Memory+Requirements+of+Synchronous+Dataflow+Specifications 12. Pipeline Interleaved Programmable DSP's: Synchronous Data Flow Programming — Edward A. Lee and David G. Messerschmitt, 1987 https://scholar.google.com/scholar?q=Pipeline+Interleaved+Programmable+DSP's:+Synchronous+Data+Flow+Programming 13. Multirate Digital Filters, Filter Banks, Polyphase Networks, and Applications: A Tutorial — P. P. Vaidyanathan, 1990 https://scholar.google.com/scholar?q=Multirate+Digital+Filters,+Filter+Banks,+Polyphase+Networks,+and+Applications:+A+Tutorial 14. The Semantics of a Simple Language for Parallel Programming — Gilles Kahn, 1974 https://scholar.google.com/scholar?q=The+Semantics+of+a+Simple+Language+for+Parallel+Programming 15. First Version of a Data Flow Procedure Language — Jack B. Dennis, 1974 https://scholar.google.com/scholar?q=First+Version+of+a+Data+Flow+Procedure+Language 16. On the Boundedness of Process Networks — Gilles Kahn and David B. MacQueen, 1977 https://scholar.google.com/scholar?q=On+the+Boundedness+of+Process+Networks 17. Algorithm Design for Signal Processing — Charles S. Burrus, 1982 https://scholar.google.com/scholar?q=Algorithm+Design+for+Signal+Processing 18. DynVec: An End-to-End Framework for Efficient Vector-Dataflow Execution — approximate; recent systems/compiler authors, recent https://scholar.google.com/scholar?q=DynVec:+An+End-to-End+Framework+for+Efficient+Vector-Dataflow+Execution 19. Compiler discovered dynamic scheduling of irregular code in high-level synthesis — approximate; recent HLS/compiler authors, recent https://scholar.google.com/scholar?q=Compiler+discovered+dynamic+scheduling+of+irregular+code+in+high-level+synthesis 20. Dataflow Models of computation for programming heterogeneous multicores — approximate; recent embedded/parallel-systems authors, recent https://scholar.google.com/scholar?q=Dataflow+Models+of+computation+for+programming+heterogeneous+multicores 21. Heuristic & Expert-Guided Buffer Sizing for Neural Network Inference Applications on FPGAs — approximate; recent FPGA/dataflow authors, recent https://scholar.google.com/scholar?q=Heuristic+&+Expert-Guided+Buffer+Sizing+for+Neural+Network+Inference+Applications+on+FPGAs 22. Sgcn: Exploiting compressed-sparse features in deep graph convolutional network accelerators — approximate; recent accelerator authors, recent https://scholar.google.com/scholar?q=Sgcn:+Exploiting+compressed-sparse+features+in+deep+graph+convolutional+network+accelerators 23. Safe shared state in dataflow systems — approximate; recent programming-systems authors, recent https://scholar.google.com/scholar?q=Safe+shared+state+in+dataflow+systems 24. An Intermediate Representation for Stateful Dataflows — approximate; recent systems authors, recent https://scholar.google.com/scholar?q=An+Intermediate+Representation+for+Stateful+Dataflows 25. AI Post Transformers: FlatAttention for Tile-Based Accelerator Inference — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-04-flatattention-for-tile-based-accelerator-56e6ca.mp3 26. AI Post Transformers: Caffeine: A Unified FPGA for CNNs — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-06-caffeine-a-unified-fpga-for-cnns-e8acbe.mp3 27. AI Post Transformers: Caffe and the Rise of CNN Frameworks — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-05-06-caffe-and-the-rise-of-cnn-frameworks-cf15f3.mp3 Interactive Visualization: Synchronous Data Flow for Signal Processing

  8. 1D AGO

    Why LightGBM Made Boosted Trees Fast

    This episode explores why LightGBM became a dominant tool for tabular machine learning by unpacking the algorithmic and systems ideas behind its speed. It explains how gradient boosting decision trees work, why split search becomes expensive on massive sparse datasets, and how LightGBM differs from neural-network-style training despite using gradient information. The discussion focuses on two core contributions: Gradient-based One-Side Sampling, which keeps high-gradient examples while subsampling easier ones without badly distorting split-gain estimates, and Exclusive Feature Bundling, which compresses sparse features by grouping columns that rarely activate together. Listeners would find it interesting for its clear account of how classical ideas like histograms, greedy tree growth, and graph coloring were combined into a highly practical system that reshaped real-world applications such as ranking, fraud detection, credit scoring, and forecasting. Sources: 1. Why LightGBM Made Boosted Trees Fast https://proceedings.neurips.cc/paper_files/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf 2. Greedy Function Approximation: A Gradient Boosting Machine — Jerome H. Friedman, 2001 https://scholar.google.com/scholar?q=Greedy+Function+Approximation:+A+Gradient+Boosting+Machine 3. XGBoost: A Scalable Tree Boosting System — Tianqi Chen and Carlos Guestrin, 2016 https://scholar.google.com/scholar?q=XGBoost:+A+Scalable+Tree+Boosting+System 4. LightGBM: A Highly Efficient Gradient Boosting Decision Tree — Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu, 2017 https://scholar.google.com/scholar?q=LightGBM:+A+Highly+Efficient+Gradient+Boosting+Decision+Tree 5. CatBoost: Unbiased Boosting with Categorical Features — Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, Andrey Gulin, 2018 https://scholar.google.com/scholar?q=CatBoost:+Unbiased+Boosting+with+Categorical+Features 6. Feature Hashing for Large Scale Multitask Learning — Kilian Weinberger, Anirban Dasgupta, Josh Attenberg, John Langford, Alex Smola, 2009 https://scholar.google.com/scholar?q=Feature+Hashing+for+Large+Scale+Multitask+Learning 7. An Upper Bound for the Chromatic Number of a Graph and Its Application to Timetabling Problems — D. J. A. Welsh and M. B. Powell, 1967 https://scholar.google.com/scholar?q=An+Upper+Bound+for+the+Chromatic+Number+of+a+Graph+and+Its+Application+to+Timetabling+Problems 8. New Methods to Color the Vertices of a Graph — Daniel Brélaz, 1979 https://scholar.google.com/scholar?q=New+Methods+to+Color+the+Vertices+of+a+Graph 9. Worst Case Behavior of Graph Coloring Algorithms — David S. Johnson, 1974 https://scholar.google.com/scholar?q=Worst+Case+Behavior+of+Graph+Coloring+Algorithms 10. A Communication-Efficient Parallel Algorithm for Decision Tree — Qi Meng, Guolin Ke, Taifeng Wang, Wei Chen, Qiwei Ye, Zhi-Ming Ma, Tie-Yan Liu, 2016 https://scholar.google.com/scholar?q=A+Communication-Efficient+Parallel+Algorithm+for+Decision+Tree 11. Stochastic Gradient Boosting — Jerome H. Friedman, 2002 https://scholar.google.com/scholar?q=Stochastic+Gradient+Boosting 12. Parallel Boosted Regression Trees for Web Search Ranking — Stephen Tyree, Kilian Q. Weinberger, Kunal Agrawal, and Jennifer Paykin, 2011 https://scholar.google.com/scholar?q=Parallel+Boosted+Regression+Trees+for+Web+Search+Ranking 13. Best-First Decision Tree Learning — Haijian Shi, 2007 https://scholar.google.com/scholar?q=Best-First+Decision+Tree+Learning 14. GPU-Acceleration for Large-Scale Tree Boosting — Huan Zhang, Si Si, and Cho-Jui Hsieh, 2017 https://scholar.google.com/scholar?q=GPU-Acceleration+for+Large-Scale+Tree+Boosting 15. Implementing machine learning methods with complex survey data: Lessons learned on the impacts of accounting sampling weights in gradient boosting — authors not identified in the provided snippet, recent (2020s) https://scholar.google.com/scholar?q=Implementing+machine+learning+methods+with+complex+survey+data:+Lessons+learned+on+the+impacts+of+accounting+sampling+weights+in+gradient+boosting 16. Explainable boosting algorithms: sparse-group and interaction-aware variable selection in complex data — authors not identified in the provided snippet, recent (2020s) https://scholar.google.com/scholar?q=Explainable+boosting+algorithms:+sparse-group+and+interaction-aware+variable+selection+in+complex+data 17. Multi-objective optimization of performance and interpretability of tabular supervised machine learning models — authors not identified in the provided snippet, recent (2020s) https://scholar.google.com/scholar?q=Multi-objective+optimization+of+performance+and+interpretability+of+tabular+supervised+machine+learning+models 18. AI Post Transformers: Breiman's Two Cultures of Statistical Modeling — Hal Turing & Dr. Ada Shannon, 2026 https://podcast.do-not-panic.com/episodes/2026-04-24-breimans-two-cultures-of-statistical-mod-71e49f.mp3 Interactive Visualization: Why LightGBM Made Boosted Trees Fast

Ratings & Reviews

3.7
out of 5
3 Ratings

About

AI-generated podcast where hosts Hal Turing and Dr. Ada Shannon discuss the latest research papers and reports in machine learning, AI systems, and optimization. Featuring honest critical analysis, proper citations, and nerdy humor.

You Might Also Like