![Python Reinforcement Learning](https://wfqqreader-1252317822.image.myqcloud.com/cover/708/36698708/b_36698708.jpg)
Deriving the Bellman equation for value and Q functions
Now let us see how to derive Bellman equations for value and Q functions.
You can skip this section if you are not interested in mathematics; however, the math will be super intriguing.
First, we define, as a transition probability of moving from state
to
while performing an action a:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/9998fe70-e372-43d5-83fc-60506a891aa7.png?sign=1739290714-Kmh0wEZtnpgmK0qbQ2tebGh6eNF5tIBt-0-64b06f86901544cc45f687eed4537449)
We define as a reward probability received by moving from state
to
while performing an action a:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/63181fd5-1f44-4a47-9106-93d4f684b535.png?sign=1739290714-7sGzwK1cBxHUBOti9sneQegt0YvQ86PR-0-c2715d768848c450287c2ea6ca14e7b1)
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/0fd31d5b-6201-49a3-a727-371141a0a26c.png?sign=1739290714-W7fYyLBKp50eGIV0vRmEsRdjtN4eDhpf-0-94eb213bbd068736706879cf5e0ec3a4)
We know that the value function can be represented as:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/a2ddd0f4-3b8d-463a-bf71-138e96be4483.png?sign=1739290714-NzWJeIhdHhXbY78KyWzBCUQhXLFkfhaA-0-e00173fb78c6e257731d8a3d92d66401)
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/5b19b61e-302c-4a2c-aafd-e9f9d7025348.png?sign=1739290714-xPiONxfCjlnH3T1VfD4daBcRC6P7CQSQ-0-c62b1e3a9ae484cdda135973c6a6b9f5)
We can rewrite our value function by taking the first reward out:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/a994ac86-d116-4c23-b858-5e3e298a1edf.png?sign=1739290714-V2cbHcQEafw81KO6DGNqIuqUdWXPpNSs-0-c823e5251c9940de621890a01827ec24)
The expectations in the value function specifies the expected return if we are in the state s, performing an action a with policy π.
So, we can rewrite our expectation explicitly by summing up all possible actions and rewards as follows:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/6896e6b0-a7b4-44e2-b4e4-150e035dd7d2.png?sign=1739290714-hPbckM7BCLXbgMimjSeDBbXXrdKmM7ue-0-9a165ee402face113d7da278a9dcffdb)
In the RHS, we will substitute from equation (5) as follows:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/0684dec7-058e-4913-a05e-086c976c510f.png?sign=1739290714-NCTYXQtNqzvIiFbZZXA9gmtpIcU9SeB9-0-e4fc47eb47433c60b6ce57d15b510337)
Similarly, in the LHS, we will substitute the value of rt+1 from equation (2) as follows:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/ad65380c-9d36-4c69-90ed-0e3b9f046a75.png?sign=1739290714-k6AC8XEYHrV5z1eOdqwx5s6Z1dby1dpN-0-d01d811eb66337d737076431de06232e)
So, our final expectation equation becomes:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/c50dbf15-94f8-48fe-bb5c-03043ee0ae1b.png?sign=1739290714-TSwzpxwNQztCfDz883xG6Y09Diablk5k-0-781e74708939a6b83d887cef0b52dfc8)
Now we will substitute our expectation (7) in value function (6) as follows:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/a48c08cf-988c-423c-8c0f-3da83cbff854.png?sign=1739290714-IcNVdttI5vdSVWnYPyK1WwCXn3lpgBWE-0-6fdffbcede548bbd057583834749e19b)
Instead of , we can substitute
with equation (6), and our final value function looks like the following:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/5786899a-68e1-496a-9053-6da456535804.png?sign=1739290714-NXPkIfpDu6sB8SuNzs3yYkj78z3hcfdy-0-bdf0becb0a75cba8619ecf4dada74528)
In very similar fashion, we can derive a Bellman equation for the Q function; the final equation is as follows:
![](https://epubservercos.yuewen.com/F4348E/19470379901496006/epubprivate/OEBPS/Images/c083a54f-b77b-4245-9c4a-d8650d8cf598.png?sign=1739290714-mibyQ4hyRVFfyEK1GeYjCoXHPVlkqVW6-0-d4f638900ddd25389ed2f579e8213b1a)
Now that we have a Bellman equation for both the value and Q function, we will see how to find the optimal policies.