时 间:2026年6月23日(周二)10:00 - 11:00
地 点:普陀校区理科大楼A1514室
报告人:史成春 伦敦政治经济学院副教授
主持人:章迎莹 华东师范大学副教授
摘 要:
Group relative policy optimization (GRPO), a core methodological component of DeepSeekMath and DeepSeek-R1, has emerged as a cornerstone for scaling reasoning capabilities of large language models. Despite its widespread adoption and the proliferation of follow-up works, the theoretical properties of GRPO remain less studied. This paper provides a unified framework to understand GRPO through the lens of classical U-statistics. We demonstrate that the GRPO policy gradient is inherently a U-statistic, allowing us to characterize its mean squared error (MSE), derive the finite-sample error bound and asymptotic distribution of the suboptimality gap for its learned policy. Our findings reveal that GRPO is asymptotically equivalent to an oracle policy gradient algorithm – one with access to a value function that quantifies the goodness of its learning policy at each training iteration – and achieves asymptotically optimal performance within a broad class of policy gradient algorithms. Furthermore, we establish a universal scaling law that offers principled guidance for selecting the optimal group size. Empirical experiments further validate our theoretical findings, demonstrating that the optimal group size is universal, and verify the oracle property of GRPO.
报告人简介:
史成春博士,现任伦敦政治经济学院统计系副教授,曾在北卡罗来纳州立大学(North Carolina State University)获得统计学博士学位。他的研究主要集中在强化学习领域(Reinforcement Learning),特别是在策略评估(Policy Evaluation)、因果推断(Causal Inference)、半监督学习(Semi-Supervised Learning)等方面的应用与优化。史博士曾荣获Institute of Mathematical Statistics (IMS) Tweedie Award和Royal Statistical Society (RSS) Research Prize等奖项。