Consider the MDP randomly generated using the code below.
import numpy as np
np.random.seed(1)
S = 10 # number of states
A = 2 # number of actions
gamma = 0.95 # discount factor
# transition model T[s1, a, s2] = Prob(s' | s, a)
T = np.random.dirichlet(np.ones(S), (S, A))
# reward function R[s, a] = Reward(s, a)
R = np.random.rand(S, A)
(a) Implement value iteration. Terminate the algorithm when the $\ell_{2}$ norm of the difference of two success estimates of the value is less than 1E-6. How many iterations does value iteration take before termination?
Answer. [Write your solution here. Add cells as needed.]
(b) Plot the $\ell_{2}$-norm of the differences against the number of updates. Comment on the convergence rate of the differences.
Answer. [Write your solution here. Add cells as needed.]
(c) What is the final value function?
Answer. [Write your solution here. Add cells as needed.]
(d)
What is the policy obtained using the final value function?
Here, given a value function $V$, the corresponding policy is defined as
\begin{align*}
\pi(s) = \arg\max_{a} (R(s, a) + \gamma \sum_{s'} T(s' \mid s, a) V(s')).
\end{align*}
Answer. [Write your solution here. Add cells as needed.]