Can GRPO be used for multi-turn RL?
https://arxiv.org/abs/2402.03300
Some of you have probably seen the RL alternative to PPO, Group Relative Policy Optimization (GRPO), where instead of training a value model you sample the policy multiple times, get the average reward, and use that to figure out the advantage.
From reviewing the implementation, it looks there is only a single turn in the dialogue, since the LLM either correctly solves the math problem or it fails, so in this case the reward and the value are the same since the expected future reward is just the reward.
Could GRPO be applied to multi-turn RL or longer horizon projects where the policy interacts with the environment multiple times?