AI: post transformers

mcgrof

The transformer architecture revolutionized the world of Neural Networks. It was a springboard for what we know today as modern artificial intelligence. This podcast focuses on modern state of the art research paper reviews starting from the transformer and on.

  1. 2 HR AGO

    Meta's solution to massive DLRM inference through software defined memory

    On November, 2021 Meta (back then Facebook) in collaboration with George Mason University and University of Illinois Chicago published their paper "Supporting Massive DLRM inference through software defined memory". Meta addressed the infrastructure challenge of serving massive Deep Learning Recommendation Models by extending the memory hierarchy to include NVMe Storage Class Memory. Because standard storage devices read large data blocks that exceed the small size of embedding rows the company faced significant read amplification and bandwidth waste. To resolve this the engineering team implemented a solution using the NVMe SGL Bit Bucket feature within a software defined memory stack. This modification to the Linux kernel and drivers allows applications to perform direct input output requests for specific data chunks down to four bytes rather than transferring full logical blocks. The implementation of bit buckets enables the system to transfer only the requested portion of a data block which significantly optimizes link bandwidth and reduces memory utilization. This granular approach saves approximately 75 percent of bus bandwidth and lowers individual read latency by 3 to 5 percent by removing unnecessary data transfer and memory copies. When applied to production environments this architecture allows data centers to replace expensive DRAM with efficient flash storage for specific model components. These optimizations result in up to 20 percent power savings on simpler hardware and a projected 29 percent increase in performance per watt for multi tenant serving scenarios. Sources: https://arxiv.org/pdf/2110.11489 https://lore.kernel.org/linux-nvme/20220630204212.1265638-1-kbusch@fb.com/

    17 min
  2. 3 HR AGO

    Storage-next: Do We Need New Hardware for AI Storage, or Just Better Layouts?

    We review the "Storage-Next" paper, published in November 2025, which argues that a fundamental hardware architectural shift is required to elevate NAND flash from a passive storage tier to an active memory tier capable of "seconds-scale" caching. The authors contend that standard SSDs impose a "channel-side ceiling" on IOPS because they are optimized for 4KB blocks, creating massive bandwidth waste when AI applications demand fine-grained access to small items, such as 128-byte embedding vectors. To solve this, they propose specialized "Storage-Next" drives capable of scalable IOPS for small block sizes (e.g., 50M IOPS at 512B), arguing this hardware is necessary to simplify software stacks and enable high-throughput random access without the read amplification penalties inherent in current technology. However, the episode explores how concurrent research largely rebuts the strict need for this new hardware by demonstrating that intelligent software and driver modifications can mask these inefficiencies on standard drives. Systems like PageANN and FusionANNS prove that aggregating topologically related vectors into 4KB pages allows existing SSDs to handle billion-scale search efficiently, while Strata utilizes GPU-assisted I/O to bundle fragmented LLM token pages. Furthermore, for workloads specifically requiring fine-grained access like DLRM, Meta researchers successfully implemented a "software-defined memory" solution using the NVMe SGL Bit Bucket feature to strip unwanted data at the driver level, reducing PCIe bandwidth consumption by 75% on standard hardware. These innovations suggest that aside from the specific niche of random hash-based lookups where locality is mathematically impossible, software optimization remains a viable alternative to a physical overhaul of storage media. We've previously covered some of the papers here individually: Meta's massive DLRM Linux NVMe SGL bit bucket solution: https://open.spotify.com/episode/7fPOvegGpWWYqChIVYGfwx?si=uxNPv4hZQvumhwwPGowwTA&context=spotify%3Ashow%3A48ygM4upvm6noxCbmhlz8i PageANNS: https://open.spotify.com/episode/5rrXWA4KJxGHp4xckirlZ2?si=_Qhzy_g1SZyPrBFmHvlY5g FusionsANNS: https://open.spotify.com/episode/6Ys51jB54GilRlYsvz4yXR?si=yI8KwDE1QpS6BbnFsinl6g Strata: https://open.spotify.com/episode/18kCgDcrOsQ5nw58V2HGBB?si=4Rr4ZfqIR-SzaVxyS8hOWA Sources: November 2025, From Minutes to Seconds: Redefining the Five-Minute Rule for AI-Era Memory Hierarchies, ScaleFlux and NVIDIA and Stanford University https://arxiv.org/pdf/2511.03944 September 2025, Scalable Disk-Based Approximate Nearest Neighbor Search with Page-Aligned Graph, University of Texas at Dallas and Rutgers University https://arxiv.org/pdf/2509.25487 August 2025, Strata: Hierarchical Context Caching for Long Context Language Model Serving, Stanford University and NVIDIA https://arxiv.org/pdf/2508.18572 September 2024, FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search, Huazhong University of Science and Technology and Huawei Technologies https://arxiv.org/pdf/2409.16576 October 2021, Supporting Massive DLRM Inference Through Software Defined Memory, Facebook https://arxiv.org/pdf/2110.11489

    15 min
  3. 12 HR AGO

    LeCun's AMI Energy-Based Models and the Path to Autonomous Intelligence

    These sources collectively explore the current landscape and future trajectory of artificial intelligence, specifically focusing on the transition toward human-level reasoning. Renowned scientist Yann LeCun argues that current Large Language Models lack a fundamental understanding of the physical world and proposes a shift toward **objective-driven AI** that utilizes **world models** for better planning and common sense. This technological shift is supported by recent industry developments, such as the launch of **AMI Labs**, a high-valuation startup dedicated to these advanced architectures. Additionally, the materials emphasize the necessity of **open-source platforms** to ensure that the future of digital assistance remains transparent and culturally diverse. While addressing technical limitations, the documents maintain an optimistic view of **super-human intelligence** as a tool that will eventually amplify human potential under safe **guardrail objectives**. Practical elements like **LinkedIn's authentication** processes and **TechCrunch's venture coverage** further illustrate the integration of these technologies into the modern professional ecosystem. Sources: https://arxiv.org/pdf/2306.02572 https://cmsa.fas.harvard.edu/media/lecun-20240328-harvard_reduced.pdf https://www.lesswrong.com/posts/C5guLAx7ieQoowv3d/lecun-s-a-path-towards-autonomous-machine-intelligence-has-1 https://www.linkedin.com/mwlite/feed/posts/warrenbpowell_my-response-to-dimitri-bertsekass-thoughtful-activity-7394449098789261312-nXH3 https://techcrunch.com/2025/12/19/yann-lecun-confirms-his-new-world-model-startup-reportedly-seeks-5b-valuation/

    14 min
  4. 1 DAY AGO

    H-net: End-to-End Hierarchical Sequence Modeling via Dynamic Chunking

    On this July 15, 2025 collaboration between Carnegie Mellon University and Cartesia AI researchers introduce H-net in the paper "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling". H-Net is a hierarchical, **tokenizer-free** large language model that processes raw data like **bytes** or **DNA sequences** directly. Unlike traditional models that rely on predefined subword chunks, H-Net employs a **dynamic chunking (DC)** mechanism to learn semantically meaningful boundaries end-to-end through a differentiable **smoothing module**. The architecture uses efficient **encoder-decoder** stages, often powered by **Mamba-2**, to compress sequences for a high-capacity main network. This design addresses the inherent flaws of fixed tokenization, such as **multilingual unfairness** and fragility to **textual perturbations**. Experimental results demonstrate that H-Net achieves competitive performance and superior **robustness** compared to standard subword-based Transformers. By enabling **recursive hierarchy**, the model scales effectively across diverse modalities including **text, code, and genomic data**. H-Net excels at long-context processing through it's **hierarchical architecture** that progressively compresses raw inputs into significantly shorter sequences ($L_S \ll L_0$), allowing the heavy computational work to be performed on compact, high-level abstractions rather than long streams of raw bytes. This efficiency is driven by **Dynamic Chunking** and the integration of **State Space Models (Mamba-2)** in the encoder and decoder layers, which are specifically selected for their ability to handle long, uncompressed sequences with linear computation scaling,. By recursively compressing sequence length, H-Net creates a global structure that mitigates the information retrieval limitations common in long sequences, allowing the model to maintain a logarithmic state size while reasoning over extended contexts. Sources: July 15, 2025 Dynamic Chunking for End-to-End Hierarchical Sequence Modeling https://arxiv.org/pdf/2507.07955 Project tracking general advancements in this space: https://github.com/zjysteven/Awesome-Byte-LLM

    17 min

About

The transformer architecture revolutionized the world of Neural Networks. It was a springboard for what we know today as modern artificial intelligence. This podcast focuses on modern state of the art research paper reviews starting from the transformer and on.