STAT3007/7007 Deep Learning, Prac 12

2022 Semester 1

Q1. Value iteration

Consider the MDP randomly generated using the code below.

import numpy as np

np.random.seed(1)

S = 10 # number of states
A = 2 # number of actions
gamma = 0.95 # discount factor

# transition model T[s1, a, s2] = Prob(s' | s, a)
T = np.random.dirichlet(np.ones(S), (S, A)) 

# reward function R[s, a] = Reward(s, a)
R = np.random.rand(S, A)

(a) Implement value iteration. Terminate the algorithm when the $\ell_{2}$ norm of the difference of two success estimates of the value is less than 1E-6. How many iterations does value iteration take before termination?

Answer. [Write your solution here. Add cells as needed.]

(b) Plot the $\ell_{2}$-norm of the differences against the number of updates. Comment on the convergence rate of the differences.

Answer. [Write your solution here. Add cells as needed.]

(c) What is the final value function?

Answer. [Write your solution here. Add cells as needed.]

(d) What is the policy obtained using the final value function? Here, given a value function $V$, the corresponding policy is defined as
\begin{align*} \pi(s) = \arg\max_{a} (R(s, a) + \gamma \sum_{s'} T(s' \mid s, a) V(s')). \end{align*}

Answer. [Write your solution here. Add cells as needed.]