Shreya Shankar from UC Berkeley joins the Weaviate Podcast to discuss data agents, the Data Agent Benchmark, and DocETL. The conversation opens with defining what a data agent actually is, not just text-to-SQL over a single table, but an AI system that can reason across dozens of heterogeneous databases, flat files, and knowledge repositories to answer complex organizational questions. Shreya explains why this multi-database reality makes existing benchmarks insufficient, motivating the Data Agent Benchmark where the best-performing agent achieves only 34–37% pass@1 accuracy. From there, the discussion dives into where agents fail. They don't explore data properly, they generate broken regex patterns, they struggle with different SQL dialects, and they give up when datasets get large. Interestingly, agents tend to pull data into Pandas rather than use database operators directly, likely because LLMs are more fluent in Python than in the nuances of each SQL dialect. The conversation moves into semantic operators, natural language variants of relational algebra, filter, map, join, aggregation, where predicates like "Is this a sports article?" replace handwritten regex, with implementations ranging from per-row LLM calls to synthesized code. Shreya then presents DocETL, a declarative system for processing unstructured data that uses LLM agents to propose query rewrite strategies like chunking, splitting, and map-then-reduce decompositions, optimizing for both accuracy and cost on long documents. This leads into a broader discussion of declarative versus imperative agent design, the tradeoff between letting agents write arbitrary Python and constraining them within frameworks that handle optimization and caching. The conversation also explores tribal knowledge, structuring learned facts about data quality into retrievable tables so agents can reuse discoveries across queries, and connects to recent work on using LLMs to discover new database query rewrite rules. The episode closes with a reflection on how classical database principles like query optimization and cardinality estimation are finding new life in the age of LLM-powered data systems. 0:05 What are Data Agents? 2:10 Multi-Database Systems 9:44 Semantic Operators 13:18 Querying Databases with Python 17:05 DocETL 24:34 Advanced Text-to-SQL 29:30 Claude Code and Databases 34:34 Self-Driving Databases 42:00 Agent Memory for Querying Databases 53:48 Exciting Directions for AI