Value iteration

rasoolfa · December 3, 2022, 7:42am

https://d2l.ai/chapter_reinforcement-learning/value-iter.html

Ritsuki_YAMADA · April 9, 2023, 6:53pm

Typo: the left hand side of the equation (17.2.9) should be pi*(s).

leo_leng · October 29, 2023, 1:15pm

Great articles to clarify some basic ideas behind RL. I think it’s really a good starting point to begin RL learning.
But there are some small errors:

The expectation over $r(s_0, a_0)$ is also needed for Eq. 17.2.2;
In Eq. 17.2.9, it should be max rather than arg_max.

Ashutosh_Nirala · April 25, 2024, 6:07pm

Seems like a typo in Equation 17.2.2

Shouldn’t Expectation over a_0, also include the first term?

eTimber_lan · July 31, 2024, 9:55am

do these code blocks still work? or do you need to follow the order of the book for the code to work? i haven’t been able to get the first chunk running for this and i’d really like to try out these exercises for reinforcement learning

cddc · September 29, 2024, 6:03am

There are two problems with this article here.
The first is equation 17.2.9, where ‘argmax’ should be corrected to ‘max’.
The second is that in the code implementation of the value_iteration function, as it says in its own comments, “Calculate \sum_{s‘} p(s’\mid s,a) [r + \gamma v_k(s’)]”, which should be fixed to “Calculate [r + \sum_{s’} p(s’\mid s,a) \gamma v_k(s’)]”, based on the previous equation 17.2.13.

Q[k,s,a] += pr * (reward + gamma * V[k - 1, nextstate])

should be fixed to:

Q[k,s,a] += (reward + pr * gamma * V[k - 1, nextstate])

MarianSlassi · July 4, 2026, 7:16pm

Created a PR for fixing equation 17.2.2:

github.com/d2l-ai/d2l-en

Clarify value function decomposition in reinforcement learning

d2l-ai:master ← MarianSlassi:patch-1

opened 06:51PM - 04 Jul 26 UTC

MarianSlassi

+1 -1

Revised the mathematical representation of the value function to clarify the dec…omposition of state value into immediate reward and expected future value. *Move the immediate reward term inside the expectation over a_0, since r(s_0,a_0) depends on the action sampled from the policy. *Description of changes:* By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. Resolves #2728

Current formula:

V^pi(s_0) = r(s_0, a_0) + gamma * E_{a_0 ~ pi(s_0)}[
    E_{s_1 ~ P(s_1 | s_0, a_0)}[V^pi(s_1)]
]

Proposed formula:

V^pi(s_0) = E_{a_0 ~ pi(s_0)}[
    r(s_0, a_0) + gamma * E_{s_1 ~ P(s_1 | s_0, a_0)}[V^pi(s_1)]
]

Reason: the immediate reward term r(s_0, a_0) depends on a_0. Since a_0 is sampled from the policy pi(s_0), this reward term should be inside the expectation over a_0. Otherwise the formula looks as if a_0 is already fixed in r(s_0, a_0), while the following expectation still averages over a_0.

MarianSlassi · July 6, 2026, 8:54am

upd: also formula 17.2.7 here on the left we have stohastic policy, but argmax seems to return a set of actions, those are different mathematical objects. If we asume we can describe stohastic policy as a series of actions taken from every state, then we are save. But i am afraid it’s easy to get puzzled there.

upd 2: the definition of 17.2.9 is the same as 17.2.7, which almost defines stohastic/determenistic policy as value function. be careful. Othe contibutors stated to pay atenntion to where to use argmax, or just ‘max’ which returns different objects besides those operations similarities