9 OCT
6 MIN

Computation and Language - Agent Bain vs. Agent McKinsey A New Text-to-SQL Benchmark for the Business Domain

Hey PaperLedge learning crew, Ernis here, ready to dive into some seriously fascinating research! Today we’re talking about how computers understand our questions about business data, and I promise, it's way cooler than it sounds!

Think about it: businesses are swimming in data. Sales figures, customer reviews, inventory levels... mountains of information. Wouldn't it be awesome if anyone could just ask a question like, "What marketing campaign led to the biggest increase in sales last quarter?" and get a straight answer from the database, without needing to be a SQL wizard? That's where "text-to-SQL" comes in. It's basically like having a super-smart translator that turns your everyday language into the special code (SQL) needed to pull information from a database.

Now, Large Language Models (LLMs), the brains behind AI tools, are getting really good at generating code, including SQL. But here's the catch: the tests they're using to measure how well these LLMs understand text-to-SQL are often too simple. They're like asking a chef to only make toast when they could be preparing a gourmet meal! Most existing benchmarks are geared toward retrieving existing facts, like, "How many customers ordered pizza last Tuesday?".

That's why some researchers created CORGI, a new benchmark designed to push these LLMs to the limit in a realistic business setting. Forget simple fact retrieval – CORGI throws problems at the AI that require actual business intelligence, like predicting future trends or recommending actions.

"CORGI is about 21% more difficult than the BIRD benchmark."

Imagine databases based on real-world companies like DoorDash, Airbnb, and Lululemon. The questions cover four levels of difficulty:

Descriptive: Simply describing what happened. Think "What were the average delivery times on Saturday nights?"
Explanatory: Figuring out why something happened. "Why did our Airbnb bookings drop in July compared to June?"
Predictive: Forecasting future trends. "Based on current trends, how many yoga pants will Lululemon sell next quarter?"
Recommendational: Recommending actions to take. "What promotions should DoorDash run to increase orders during slow hours?"

See how that gets progressively more complex? It's not just about pulling data, it's about causal reasoning, temporal forecasting, and strategic recommendation – stuff that requires multi-step thinking!

The researchers found that LLMs struggled with the higher-level questions. They could handle the simple "what happened" stuff, but when it came to predicting the future or recommending actions, their performance dropped significantly. CORGI is 21% more difficult than other text-to-SQL benchmarks, exposing a gap between LLM capabilities and true business intelligence needs.

This is important because it highlights the need for AI tools that can actually understand the complexities of the business world, not just regurgitate data. Think about the possibilities: imagine an AI assistant that can not only answer your questions about your business data but also proactively suggest strategies to improve your bottom line!

The researchers have released the CORGI dataset and evaluation framework publicly, so anyone can test their AI models and contribute to this exciting field.

So, here are a couple of things that popped into my head as I was reading this paper:

If LLMs are struggling with predictive questions, what are the implications for businesses currently relying on AI-powered forecasting tools? Are they making decisions based on potentially flawed insights?
How can we better train LLMs to understand causal relationships in business data, so they can provide more accurate and reliable recommendations? Is it just more data, or do we need fundamentally different AI architectures?

This is such a fascinating area, and I can’t wait to see how it develops. What do you think, learning crew? Share your thoughts in the comments! Until next time, keep learning and keep questioning!

Credit to Paper authors: Yue Li, Ran Tao, Derek Hommel, Yusuf Denizay Dönder, Sungyong Chang, David Mimno, Unso Eun Seo Jo

Episode Webpage

Show

PaperLedge
Published

9 October 2025 at 06:58 UTC
Length

6 min
Rating

Clean

Computation and Language - Agent Bain vs. Agent McKinsey A New Text-to-SQL Benchmark for the Business Domain

Information