AI: post transformers

mcgrof

The transformer architecture revolutionized the world of Neural Networks. It was a springboard for what we know today as modern artificial intelligence. This podcast focuses on modern state of the art research paper reviews starting from the transformer and on.

  1. Google: R&D inference value on HBF + PNM + low latency interconnect

    3 DAYS AGO

    Google: R&D inference value on HBF + PNM + low latency interconnect

    To address the hardware bottlenecks of LLM inference, Google researchers Ma and Patterson propos in their paper "Challenges and Research Directions for Large Language Model Inference Hardware" published on January 8, 2026 a few focus areas of research: High Bandwidth Flash (HBF), Processing-Near-Memory (PNM), and low-latency interconnects. **HBF** addresses the "Memory Wall" by stacking flash dies to achieve **10X the capacity** of HBM, making it ideal for storing model weights and long contexts despite its write endurance limitations. **PNM** is advocated over Processing-In-Memory (PIM) for datacenters because placing logic on separate but nearby dies (e.g., 3D stacking) allows for larger software shards (avoiding fine-grained partitioning), utilizes standard high-performance logic processes, and offers better thermal management than integrating logic directly into memory dies. Finally, arguing that **latency trumps bandwidth** for the frequent small messages in inference, the authors suggest optimizing interconnects through high-connectivity topologies (like dragonfly or trees) and **processing-in-network** to accelerate communication collectives. Modern large language model (LLM) inference faces a critical memory wall, where hardware compute power outpaces the growth of data transfer speeds. Research suggests addressing these bottlenecks through **3D memory-logic stacking**, near-memory processing, and specialized **interconnect strategies** to reduce latency. Optimization techniques for **Mixture-of-Experts (MoE)** architectures involve balancing **tensor and expert parallelism** across devices to ensure efficient data handling. While high-bandwidth memory remains expensive, alternative storage solutions like **flash memory** are being explored to expand capacity for data centers. Historical data further illustrates the evolving **cost and density** of memory, underscoring the long-term economic shifts in hardware development. Together, these sources outline a roadmap for evolving **AI hardware** to meet the rigorous demands of real-time model decoding. Source: January 8, 2026 Challenges and Research Directions for Large Language Model Inference Hardware Google https://arxiv.org/pdf/2601.05047

    18 min
  2. Meta's solution to massive DLRM inference through software defined memory

    5 DAYS AGO

    Meta's solution to massive DLRM inference through software defined memory

    On November, 2021 Meta (back then Facebook) in collaboration with George Mason University and University of Illinois Chicago published their paper "Supporting Massive DLRM inference through software defined memory". Meta addressed the infrastructure challenge of serving massive Deep Learning Recommendation Models by extending the memory hierarchy to include NVMe Storage Class Memory. Because standard storage devices read large data blocks that exceed the small size of embedding rows the company faced significant read amplification and bandwidth waste. To resolve this the engineering team implemented a solution using the NVMe SGL Bit Bucket feature within a software defined memory stack. This modification to the Linux kernel and drivers allows applications to perform direct input output requests for specific data chunks down to four bytes rather than transferring full logical blocks. The implementation of bit buckets enables the system to transfer only the requested portion of a data block which significantly optimizes link bandwidth and reduces memory utilization. This granular approach saves approximately 75 percent of bus bandwidth and lowers individual read latency by 3 to 5 percent by removing unnecessary data transfer and memory copies. When applied to production environments this architecture allows data centers to replace expensive DRAM with efficient flash storage for specific model components. These optimizations result in up to 20 percent power savings on simpler hardware and a projected 29 percent increase in performance per watt for multi tenant serving scenarios. Sources: https://arxiv.org/pdf/2110.11489 https://lore.kernel.org/linux-nvme/20220630204212.1265638-1-kbusch@fb.com/

    17 min
  3. 5 DAYS AGO

    Storage-next: Do We Need New Hardware for AI Storage, or Just Better Layouts?

    We review the "Storage-Next" paper, published in November 2025, which argues that a fundamental hardware architectural shift is required to elevate NAND flash from a passive storage tier to an active memory tier capable of "seconds-scale" caching. The authors contend that standard SSDs impose a "channel-side ceiling" on IOPS because they are optimized for 4KB blocks, creating massive bandwidth waste when AI applications demand fine-grained access to small items, such as 128-byte embedding vectors. To solve this, they propose specialized "Storage-Next" drives capable of scalable IOPS for small block sizes (e.g., 50M IOPS at 512B), arguing this hardware is necessary to simplify software stacks and enable high-throughput random access without the read amplification penalties inherent in current technology. However, the episode explores how concurrent research largely rebuts the strict need for this new hardware by demonstrating that intelligent software and driver modifications can mask these inefficiencies on standard drives. Systems like PageANN and FusionANNS prove that aggregating topologically related vectors into 4KB pages allows existing SSDs to handle billion-scale search efficiently, while Strata utilizes GPU-assisted I/O to bundle fragmented LLM token pages. Furthermore, for workloads specifically requiring fine-grained access like DLRM, Meta researchers successfully implemented a "software-defined memory" solution using the NVMe SGL Bit Bucket feature to strip unwanted data at the driver level, reducing PCIe bandwidth consumption by 75% on standard hardware. These innovations suggest that aside from the specific niche of random hash-based lookups where locality is mathematically impossible, software optimization remains a viable alternative to a physical overhaul of storage media. We've previously covered some of the papers here individually: Meta's massive DLRM Linux NVMe SGL bit bucket solution: https://open.spotify.com/episode/7fPOvegGpWWYqChIVYGfwx?si=uxNPv4hZQvumhwwPGowwTA&context=spotify%3Ashow%3A48ygM4upvm6noxCbmhlz8i PageANNS: https://open.spotify.com/episode/5rrXWA4KJxGHp4xckirlZ2?si=_Qhzy_g1SZyPrBFmHvlY5g FusionsANNS: https://open.spotify.com/episode/6Ys51jB54GilRlYsvz4yXR?si=yI8KwDE1QpS6BbnFsinl6g Strata: https://open.spotify.com/episode/18kCgDcrOsQ5nw58V2HGBB?si=4Rr4ZfqIR-SzaVxyS8hOWA Sources: November 2025, From Minutes to Seconds: Redefining the Five-Minute Rule for AI-Era Memory Hierarchies, ScaleFlux and NVIDIA and Stanford University https://arxiv.org/pdf/2511.03944 September 2025, Scalable Disk-Based Approximate Nearest Neighbor Search with Page-Aligned Graph, University of Texas at Dallas and Rutgers University https://arxiv.org/pdf/2509.25487 August 2025, Strata: Hierarchical Context Caching for Long Context Language Model Serving, Stanford University and NVIDIA https://arxiv.org/pdf/2508.18572 September 2024, FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search, Huazhong University of Science and Technology and Huawei Technologies https://arxiv.org/pdf/2409.16576 October 2021, Supporting Massive DLRM Inference Through Software Defined Memory, Facebook https://arxiv.org/pdf/2110.11489

    15 min

About

The transformer architecture revolutionized the world of Neural Networks. It was a springboard for what we know today as modern artificial intelligence. This podcast focuses on modern state of the art research paper reviews starting from the transformer and on.