[QA] Evaluation of Large Language Models via Coupled Token Generation

Arxiv Papers

This paper argues for controlling randomization in evaluating large language models, showing that coupled autoregressive generation can yield different rankings than vanilla methods, despite fewer required samples.

https://arxiv.org/abs//2502.01754

YouTube: https://www.youtube.com/@ArxivPapers

TikTok: https://www.tiktok.com/@arxiv_papers

Apple Podcasts: https://podcasts.apple.com/us/podcast/arxiv-papers/id1692476016

Spotify: https://podcasters.spotify.com/pod/show/arxiv-papers

To listen to explicit episodes, sign in.

Stay up to date with this show

Sign in or sign up to follow shows, save episodes, and get the latest updates.

Select a country or region

Africa, Middle East, and India

Asia Pacific

Europe

Latin America and the Caribbean

The United States and Canada