Q-learning

https://d2l.ai/chapter_reinforcement-learning/qlearning.html

1 Like

https://d2l.ai/chapter_reinforcement-learning/qlearning.html

When I click on this URL I get redirected to the homepage, does it mean it is not published yet? Is there a way I can see it anyway?

Found it at: http://preview.d2l.ai/d2l-en/master/chapter_reinforcement-learning/qlearning.html

1 Like

Looking forward to the DQN chapter.

Coming from the MITx’s McroMasters series on Stats and Probability a year ago, I can still understand (mostly) what’s going on.

Great work. Thank you.

1 Like

For convenience, these 2 functions should make it easy to switch into 8x8. Param “size” expects either “4x4” or “8x8”. These 2 functions exist within torch: i just copied them out and adapted for my own use.

Everything else kept the same: A 4 times increase in lake area resulted in an increase of iterations (to reach something usable), from 256 to about 17000, a 66 times increase. Even then, the bottom left area is clearly not optimal.

def frozen_lake_sizeable(seed, size):

    """Defined in :numref:`sec_utils`"""
    # See https://www.gymlibrary.dev/environments/toy_text/frozen_lake/ to learn more about this env
    # How to process env.P.items is adpated from https://sites.google.com/view/deep-rl-bootcamp/labs

    import gym
    env = gym.make('FrozenLake-v1', is_slippery=False, map_name=size)
    env.seed(seed)
    env.action_space.np_random.seed(seed)
    env.action_space.seed(seed)
    env_info = {}
    env_info['desc'] = env.desc  # 2D array specifying what each grid item means
    env_info['num_states'] = env.nS  # Number of observations/states or obs/state dim
    env_info['num_actions'] = env.nA  # Number of actions or action dim

    # Define indices for (transition probability, nextstate, reward, done) tuple
    env_info['trans_prob_idx'] = 0  # Index of transition probability entry
    env_info['nextstate_idx'] = 1  # Index of next state entry
    env_info['reward_idx'] = 2  # Index of reward entry
    env_info['done_idx'] = 3  # Index of done entry
    env_info['mdp'] = {}
    env_info['env'] = env

    for (s, others) in env.P.items():
        # others(s) = {a0: [ (p(s'|s,a0), s', reward, done),...], a1:[...], ...}
        for (a, pxrds) in others.items():
            # pxrds is [(p1,next1,r1,d1),(p2,next2,r2,d2),..].
            # e.g. [(0.3, 0, 0, False), (0.3, 0, 0, False), (0.3, 4, 1, False)]
            env_info['mdp'][(s,a)] = pxrds
    return env_info

def show_Q_function_progress(env_desc, V_all, pi_all, size):

    """Defined in :numref:`sec_utils`"""
    # This function visualizes how value and policy changes over time.
    # V: [num_iters, num_states]
    # pi: [num_iters, num_states]

    from matplotlib import pyplot as plt
    if size == "4x4":
        side = 4
    if size == "8x8":
        side = 8
    # We want to only shows few values
    num_iters_all = V_all.shape[0]
    num_iters = num_iters_all // 10
    vis_indx = np.arange(0, num_iters_all, num_iters).tolist()
    vis_indx.append(num_iters_all - 1)
    V = np.zeros((len(vis_indx), V_all.shape[1]))
    pi = np.zeros((len(vis_indx), V_all.shape[1]))
    for c, i in enumerate(vis_indx):
        V[c]  = V_all[i]
        pi[c] = pi_all[i]
    num_iters = V.shape[0]
    fig, ax = plt.subplots(figsize=(15, 15))
    for k in range(V.shape[0]):
        plt.subplot(4, 4, k + 1)
        plt.imshow(V[k].reshape(side, side), cmap="bone")
        ax = plt.gca()
        ax.set_xticks(np.arange(0, side+1)-.5, minor=True)
        ax.set_yticks(np.arange(0, side+1)-.5, minor=True)
        ax.grid(which="minor", color="w", linestyle='-', linewidth=3)
        ax.tick_params(which="minor", bottom=False, left=False)
        ax.set_xticks([])
        ax.set_yticks([])

        # LEFT action: 0, DOWN action: 1
        # RIGHT action: 2, UP action: 3
        action2dxdy = {0:(-.25, 0),1:(0, .25),
                       2:(0.25, 0),3:(-.25, 0)}

        for y in range(side):
            for x in range(side):
                action = pi[k].reshape(side, side)[y, x]
                dx, dy = action2dxdy[action]
                if env_desc[y,x].decode() == 'H':
                    ax.text(x, y, str(env_desc[y,x].decode()),
                       ha="center", va="center", color="y",
                         size=20, fontweight='bold')
                elif env_desc[y,x].decode() == 'G':
                    ax.text(x, y, str(env_desc[y,x].decode()),
                       ha="center", va="center", color="w",
                         size=20, fontweight='bold')
                else:
                    ax.text(x, y, str(env_desc[y,x].decode()),
                       ha="center", va="center", color="g",
                         size=15, fontweight='bold')

                # No arrow for cells with G and H labels
                if env_desc[y,x].decode() != 'G' and env_desc[y,x].decode() != 'H':
                    ax.arrow(x, y, dx, dy, color='r', head_width=0.2, head_length=0.15)
        ax.set_title("Step = "  + str(vis_indx[k] + 1), fontsize=20)
    fig.tight_layout()
    plt.show()
1 Like

Thank you for the feedback and glad that you found our RL chapters useful.
Sure, DQN will be coming in a few weeks.

Yes, the number of training steps increases as state space gets larger. epsilon and alpha might need to be changed for a larger version of this problem. In general, model-free methods have very high sample complexity.

We’ll definitely make our codes more flexible so it can better visualize larger size grids. Also, thank you for your code snippet.

Rasool

1 Like