AI: post transformers

mcgrof

0.0 (0)
TECHNOLOGY
UPDATED DAILY

The transformer architecture revolutionized the world of Neural Networks. It was a springboard for what we know today as modern artificial intelligence. This podcast focuses on modern state of the art research paper reviews starting from the transformer and on.

7 HR AGO

SLDAgent: Evolutionary Discovery of Superhuman AI Scaling Laws

The paper, titled "Can Language Models Discover Scaling Laws?" and published on January 22, 2026, represents a collaborative effort by researchers from **Peking University**, **Stanford University**, **Wizard Quant**, and **Tsinghua University**. he authors address the inefficiency of manual scaling law discovery by introducing **SLDAgent**, an evolution-based system designed to automate the search for predictive symbolic formulas. They prove that this agent consistently discovers laws with superior extrapolation accuracy compared to human experts and existing baselines across the newly curated **SLDBench**, a testbed aggregating over 5,000 training experiments. Additionally, the study demonstrates the practical superiority of these autonomously discovered laws in critical tasks such as analytical hyperparameter optimization and pre-trained model selection. Source: https://arxiv.org/pdf/2507.21184

18 min
2 DAYS AGO

Sequoia Capital: AGI is here

On January 14, 2026 Sequoia Capital published a piece assertion that Artificial General Intelligence has arrived ahead of schedule, redefined as the functional ability for AI to "figure things out" autonomously. The authors highlight the transition from simple conversational models to **long-horizon agents** that can persist through complex, multi-step tasks without constant human guidance. By leveraging **reasoning capabilities** and iterative problem-solving, these agents are now performing specialized work in fields like coding, recruiting, and law. The source predicts an **exponential growth** in agent performance, suggesting that AI will soon manage workloads that would take human experts years to complete. Consequently, founders are encouraged to pivot from building chatbots to developing **autonomous colleagues** that sell completed outcomes rather than just software. This shift marks a new era where AI moves from being a passive tool to a **proactive doer** capable of navigating real-world ambiguity. Source: https://sequoiacap.com/article/2026-this-is-agi/

13 min
2 DAYS AGO

Agentic Reasoning for Large Language Models: A Comprehensive Roadmap

This January 18, 2026 massive collaboration between University of Illinois Urbana-Champaign, Meta, Amazon, Google Deepmind, UCSD and Yale explores the evolution of **agentic reasoning** in large language models, moving beyond static text generation toward **dynamic planning** and **external interaction**. It details how models utilize **tool integration** and **multi-agent systems** to solve complex problems in fields like **software engineering**, **scientific discovery**, and **robotics**. The text categorizes specialized roles within these systems—such as **leaders, executors, and critics**—to facilitate sophisticated **collaboration** and **feedback-driven** behaviors. Furthermore, it examines various **post-training methods** and **architectural frameworks** designed to optimize how agents communicate and manage **long-term memory**. Finally, the sources provide a comprehensive overview of **benchmarks** used to evaluate these autonomous systems across diverse, **real-world applications**. Source: January 18, 2026 Agentic Reasoning for Large Language Models https://arxiv.org/pdf/2601.12538

16 min
3 DAYS AGO

OpenAI: Scaling PostgreSQL to 800 Million ChatGPT Users

OpenAI manages a massive **PostgreSQL infrastructure** to support hundreds of millions of users by utilizing a **single-primary architecture** with dozens of global read replicas. To maintain stability under extreme traffic, the engineering team implemented **rigorous query optimizations**, connection pooling through **PgBouncer**, and aggressive caching strategies. They mitigate the limitations of a single writer by migrating **write-heavy workloads** to sharded systems like **Azure Cosmos DB** and enforcing strict rate limits. High availability is ensured through **regional workload isolation** and the use of hot standbys to prevent total service outages. This technical evolution allows the platform to process **millions of queries per second** while maintaining low latency and high reliability. Future scaling plans include testing **cascading replication** to expand their global database footprint even further. Source: https://openai.com/index/scaling-postgresql/

14 min
3 DAYS AGO

MEMRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic

The January 6, 2026 paper introduces **MEMRL**, a framework designed to help AI agents master new skills by mimicking human **episodic memory** without needing to update the model's underlying weights. This approach addresses the **stability-plasticity dilemma** by decoupling a stable, frozen **Large Language Model** (the reasoning core) from a dynamic, evolving memory bank. Unlike standard retrieval methods that rely solely on semantic similarity, MEMRL uses **non-parametric reinforcement learning** to evaluate the actual utility of past experiences. It employs a **two-phase retrieval mechanism** that first identifies relevant candidates and then selects the most effective ones based on learned **Q-values**. These values are continuously refined through **environmental feedback**, allowing the agent to distinguish high-value strategies from distracting noise. Experiments across various benchmarks show that **MEMRL** significantly improves performance and supports stable **runtime learning** while avoiding the computational costs and forgetting associated with fine-tuning. Source: https://arxiv.org/pdf/2601.03192

15 min
3 DAYS AGO

Google: R&D inference value on HBF + PNM + low latency interconnect

To address the hardware bottlenecks of LLM inference, Google researchers Ma and Patterson propos in their paper "Challenges and Research Directions for Large Language Model Inference Hardware" published on January 8, 2026 a few focus areas of research: High Bandwidth Flash (HBF), Processing-Near-Memory (PNM), and low-latency interconnects. **HBF** addresses the "Memory Wall" by stacking flash dies to achieve **10X the capacity** of HBM, making it ideal for storing model weights and long contexts despite its write endurance limitations. **PNM** is advocated over Processing-In-Memory (PIM) for datacenters because placing logic on separate but nearby dies (e.g., 3D stacking) allows for larger software shards (avoiding fine-grained partitioning), utilizes standard high-performance logic processes, and offers better thermal management than integrating logic directly into memory dies. Finally, arguing that **latency trumps bandwidth** for the frequent small messages in inference, the authors suggest optimizing interconnects through high-connectivity topologies (like dragonfly or trees) and **processing-in-network** to accelerate communication collectives. Modern large language model (LLM) inference faces a critical memory wall, where hardware compute power outpaces the growth of data transfer speeds. Research suggests addressing these bottlenecks through **3D memory-logic stacking**, near-memory processing, and specialized **interconnect strategies** to reduce latency. Optimization techniques for **Mixture-of-Experts (MoE)** architectures involve balancing **tensor and expert parallelism** across devices to ensure efficient data handling. While high-bandwidth memory remains expensive, alternative storage solutions like **flash memory** are being explored to expand capacity for data centers. Historical data further illustrates the evolving **cost and density** of memory, underscoring the long-term economic shifts in hardware development. Together, these sources outline a roadmap for evolving **AI hardware** to meet the rigorous demands of real-time model decoding. Source: January 8, 2026 Challenges and Research Directions for Large Language Model Inference Hardware Google https://arxiv.org/pdf/2601.05047

18 min
5 DAYS AGO

Meta's solution to massive DLRM inference through software defined memory

On November, 2021 Meta (back then Facebook) in collaboration with George Mason University and University of Illinois Chicago published their paper "Supporting Massive DLRM inference through software defined memory". Meta addressed the infrastructure challenge of serving massive Deep Learning Recommendation Models by extending the memory hierarchy to include NVMe Storage Class Memory. Because standard storage devices read large data blocks that exceed the small size of embedding rows the company faced significant read amplification and bandwidth waste. To resolve this the engineering team implemented a solution using the NVMe SGL Bit Bucket feature within a software defined memory stack. This modification to the Linux kernel and drivers allows applications to perform direct input output requests for specific data chunks down to four bytes rather than transferring full logical blocks. The implementation of bit buckets enables the system to transfer only the requested portion of a data block which significantly optimizes link bandwidth and reduces memory utilization. This granular approach saves approximately 75 percent of bus bandwidth and lowers individual read latency by 3 to 5 percent by removing unnecessary data transfer and memory copies. When applied to production environments this architecture allows data centers to replace expensive DRAM with efficient flash storage for specific model components. These optimizations result in up to 20 percent power savings on simpler hardware and a projected 29 percent increase in performance per watt for multi tenant serving scenarios. Sources: https://arxiv.org/pdf/2110.11489 https://lore.kernel.org/linux-nvme/20220630204212.1265638-1-kbusch@fb.com/

17 min
5 DAYS AGO

Storage-next: Do We Need New Hardware for AI Storage, or Just Better Layouts?

We review the "Storage-Next" paper, published in November 2025, which argues that a fundamental hardware architectural shift is required to elevate NAND flash from a passive storage tier to an active memory tier capable of "seconds-scale" caching. The authors contend that standard SSDs impose a "channel-side ceiling" on IOPS because they are optimized for 4KB blocks, creating massive bandwidth waste when AI applications demand fine-grained access to small items, such as 128-byte embedding vectors. To solve this, they propose specialized "Storage-Next" drives capable of scalable IOPS for small block sizes (e.g., 50M IOPS at 512B), arguing this hardware is necessary to simplify software stacks and enable high-throughput random access without the read amplification penalties inherent in current technology. However, the episode explores how concurrent research largely rebuts the strict need for this new hardware by demonstrating that intelligent software and driver modifications can mask these inefficiencies on standard drives. Systems like PageANN and FusionANNS prove that aggregating topologically related vectors into 4KB pages allows existing SSDs to handle billion-scale search efficiently, while Strata utilizes GPU-assisted I/O to bundle fragmented LLM token pages. Furthermore, for workloads specifically requiring fine-grained access like DLRM, Meta researchers successfully implemented a "software-defined memory" solution using the NVMe SGL Bit Bucket feature to strip unwanted data at the driver level, reducing PCIe bandwidth consumption by 75% on standard hardware. These innovations suggest that aside from the specific niche of random hash-based lookups where locality is mathematically impossible, software optimization remains a viable alternative to a physical overhaul of storage media. We've previously covered some of the papers here individually: Meta's massive DLRM Linux NVMe SGL bit bucket solution: https://open.spotify.com/episode/7fPOvegGpWWYqChIVYGfwx?si=uxNPv4hZQvumhwwPGowwTA&context=spotify%3Ashow%3A48ygM4upvm6noxCbmhlz8i PageANNS: https://open.spotify.com/episode/5rrXWA4KJxGHp4xckirlZ2?si=_Qhzy_g1SZyPrBFmHvlY5g FusionsANNS: https://open.spotify.com/episode/6Ys51jB54GilRlYsvz4yXR?si=yI8KwDE1QpS6BbnFsinl6g Strata: https://open.spotify.com/episode/18kCgDcrOsQ5nw58V2HGBB?si=4Rr4ZfqIR-SzaVxyS8hOWA Sources: November 2025, From Minutes to Seconds: Redefining the Five-Minute Rule for AI-Era Memory Hierarchies, ScaleFlux and NVIDIA and Stanford University https://arxiv.org/pdf/2511.03944 September 2025, Scalable Disk-Based Approximate Nearest Neighbor Search with Page-Aligned Graph, University of Texas at Dallas and Rutgers University https://arxiv.org/pdf/2509.25487 August 2025, Strata: Hierarchical Context Caching for Long Context Language Model Serving, Stanford University and NVIDIA https://arxiv.org/pdf/2508.18572 September 2024, FusionANNS: An Efficient CPU/GPU Cooperative Processing Architecture for Billion-scale Approximate Nearest Neighbor Search, Huazhong University of Science and Technology and Huawei Technologies https://arxiv.org/pdf/2409.16576 October 2021, Supporting Massive DLRM Inference Through Software Defined Memory, Facebook https://arxiv.org/pdf/2110.11489

15 min

The transformer architecture revolutionized the world of Neural Networks. It was a springboard for what we know today as modern artificial intelligence. This podcast focuses on modern state of the art research paper reviews starting from the transformer and on.

Creator

mcgrof
Years Active

2025 - 2026
Episodes

365
Rating

Clean
Copyright

© mcgrof
Show Website

AI: post transformers