Studying Sutton and Barto's RL book and its connections to RL for LLMs (e.g., tool use, math reasoning, agents, and so on)? [D]

Reddit r/MachineLearning / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • A Reddit user asks for study guidance on how to learn RL (using Sutton & Barto) and connect it to modern “RL for LLMs,” especially topics like tool use, agent behavior, and math reasoning.
  • The user proposes an LLM-chosen reading path focused on foundational RL chapters (Intro, finite MDPs, TD learning, on-policy prediction/control, and policy gradients) and seeks feedback on whether these selections are appropriate.
  • They note Sutton & Barto predates widely discussed modern RL methods (e.g., PPO/GRPO) and question whether to supplement with other resources such as an online RL course or Joseph Suarez’s RL guide.
  • The thread implicitly frames the challenge of bridging classical RL theory (MDPs, value/policy learning) with current research directions in RL-driven LLM alignment and agentic tool use.

Hi everyone,

I graduated from a Master in Math program last summer. In recent months, I have been trying to understand more about ML/DL and LLMs, so I have been reading books and sometimes papers on LLMs and their reasoning capacities (I'm especially interested in AI for Math). When I read about RL on Wikipedia, I also found that it's also really interesting as well, so I wanted to learn more about RL and its connections to LLMs.

Since the canonical book on RL is "Sutton and Barto", which was published in 2020 before LLMs getting really popular, therefore it does not mention things like PPO, GRPO, and so on. I asked LLMs to select relevant chapters from the RL book so that I could study more focuses, and they select Chapters 1 (Intro), 3 (Finite MDP), 6 (TD Learning), and then 9 (On-policy prediction with approx), 10 (on-policy ...), 11 (on-policy control with approx), 13 (Policy gradient methods).

So I have the following questions that I was wonering if you could help me with:

What do you think of its selections and do you have better recommendations? Do you think it's good first steps to understand the landscape before reading and experimenting with modern RL-for-LLM papers? Or I should just go with the Alberta's online RL course? Joseph Suarez wrote "An Ultra Opinionated Guide to Reinforcement Learning" but I think it's mostly about non-LLM RL?

Thank you a lot for your time!

submitted by /u/hedgehog0
[link] [comments]