I am learning about Reinforced Learning and came across this YouTube video series to get me started. It is definitely a little difficult for someone like me just starting out however I was intrigued by it and thought I’d see it through. I am most excited about learning how to capture metrics and generate graphs from a python script. The series has generally been useful for understanding the concept of training an algorithm. The full series is likely the best route to follow as a tutorial for all wanting to try this out. These are just my collected notes and points to remember to help me keep up with the YouTube series.

import gym env = gym.make(“MountainCar-v0”) env.reset() done = False while not done: action = 2 new_state, reward, done, _ = env.step(action) env.render() env.close()This piece of script can move the car however it does not have enough momentum to make it up the hill to the yellow flag.

The equation that is needed to make this work is:

This makes no sense to me but the course then went on to translate it into python which made an inch more sense. We want to create a large table that given any combination of state (velocity, position) we want to look up the 3 Q values and pick the one with the largest Q value from the Q table. Initially, it is at random. So it will explore initially and then slowly update the values as time goes on.

First, we need to build the Q table.

If we print the new_state variable and run the script we will see an output like as follows.

Every combination out to 8 decimal places would take forever to learn on a table of that size. We need to convert these continuous values or often referred to as discrete values, we want to bucket the information to a more manageable size. This will need to be tweaked as we go along.

We can pull the environment and print the following from the script which you can run in the terminal (Windows Sublime F7) and get the following output.

print(env.observation_space.high) print(env.observation_space.low) print(env.action_space.n)We need to next figure out how many chunks are between each step. What is the window Q size?

DISCRETE_OS_SIZE = [20] * len(env.observation_space.high) discrete_os_win_size = (env.observation_space.high – env.observation_space.low) / DISCRETE_OS_SIZE print(discrete_os_win_size)We now have created our Q table which is being updated with random values.

q_table = np.random.uniform(low=-2, high=0, size=(DISCRETE_OS_SIZE + [env.action_space.n])) print(q_table.shape) print(q_table)The table now has rewards and new states being updated when actions are being completed. The momentum can be built as the car learns and is rewarded over time.

The best way to track metrics is simply by the reward.

ep_rewards =[] aggr_ep_rewards = {‘ep’: [], ‘avg’: [], ‘min’: [], ‘max’: []}You might have cases where the average is the best model. It just depends on the task you are trying to achieve.

When we run these metrics and collect the average data and import it into a graph this is the output for the following metrics.

LEARNING_RATE = 0.1 DISCOUNT = 0.95 EPISODES = 2000 SHOW_EVERY = 500 epsilon = 0.5 START_EPSILON_DECAYING = 1 END_EPSILON_DECAYING = EPISODES // 2 epsilon_decay_value = epsilon/(END_EPSILON_DECAYING – START_EPSILON_DECAYING) plt.legend(loc=4)Let’s play around with the epsilon and amount of time that the algorithm trains for and the show count and see what difference this makes.

LEARNING_RATE = 0.1 DISCOUNT = 0.95 EPISODES = 10000 SHOW_EVERY = 2000 epsilon = 1.5 START_EPSILON_DECAYING = 3 END_EPSILON_DECAYING = EPISODES // 4 epsilon_decay_value = epsilon/(END_EPSILON_DECAYING – START_EPSILON_DECAYING) plt.legend(loc=6)Let’s change the DISCRETE_OS_SIZE to 40 and see what difference this makes.

Let’s do a final run on a larger table of 25k at the original settings which I felt performed best.

I tried the model at the same metrics but for only 2000 and showing every 100.

My final metric was to capture the average, min and max at 25000 episodes.

Seeing this, it looks like we’d like to maybe have the model around 17K-20K episodes since it had high overall rewards. Also, we probably want to train models to do this we need to save the final table.

if not episode % 10: np.save(f”qtables/{episode}-qtable.npy”, q_table)We now run a script :

from mpl_toolkits.mplot3d import axes3d import matplotlib.pyplot as plt from matplotlib import style import numpy as np style.use(‘ggplot’) def get_q_color(value, vals): if value == max(vals): return “green”, 1.0 else: return “red”, 0.3 fig = plt.figure(figsize=(12, 9)) ax1 = fig.add_subplot(311) ax2 = fig.add_subplot(312) ax3 = fig.add_subplot(313) i = 1990 q_table = np.load(f”qtables/{i}-qtable.npy”) for x, x_vals in enumerate(q_table): for y, y_vals in enumerate(x_vals): ax1.scatter(x, y, c=get_q_color(y_vals[0], y_vals)[0], marker=”o”, alpha=get_q_color(y_vals[0], y_vals)[1]) ax2.scatter(x, y, c=get_q_color(y_vals[1], y_vals)[0], marker=”o”, alpha=get_q_color(y_vals[1], y_vals)[1]) ax3.scatter(x, y, c=get_q_color(y_vals[2], y_vals)[0], marker=”o”, alpha=get_q_color(y_vals[2], y_vals)[1]) ax1.set_ylabel(“Action 0”) ax2.set_ylabel(“Action 1”) ax3.set_ylabel(“Action 2”) plt.show()This will graph for us our Q table for each action, giving us:

Now, we can graph all, or a lot of the episodes. With this, we can see the Q values changing over time and how the model “learns.” So let’s iterate over every 10 q tables, create, and save the chart.

Code is now:

from mpl_toolkits.mplot3d import axes3d import matplotlib.pyplot as plt from matplotlib import style import numpy as np style.use(‘ggplot’) def get_q_color(value, vals): if value == max(vals): return “green”, 1.0 else: return “red”, 0.3 fig = plt.figure(figsize=(12, 9)) for i in range(0, 25000, 10): print(i) ax1 = fig.add_subplot(311) ax2 = fig.add_subplot(312) ax3 = fig.add_subplot(313) q_table = np.load(f”qtables/{i}-qtable.npy”) for x, x_vals in enumerate(q_table): for y, y_vals in enumerate(x_vals): ax1.scatter(x, y, c=get_q_color(y_vals[0], y_vals)[0], marker=”o”, alpha=get_q_color(y_vals[0], y_vals)[1]) ax2.scatter(x, y, c=get_q_color(y_vals[1], y_vals)[0], marker=”o”, alpha=get_q_color(y_vals[1], y_vals)[1]) ax3.scatter(x, y, c=get_q_color(y_vals[2], y_vals)[0], marker=”o”, alpha=get_q_color(y_vals[2], y_vals)[1]) ax1.set_ylabel(“Action 0”) ax2.set_ylabel(“Action 1”) ax3.set_ylabel(“Action 2″) #plt.show() plt.savefig(f”qtable_charts/{i}.png”) plt.clf()This will make all of our images, and now we can make videos from them, with

import cv2 import os def make_video(): # windows: fourcc = cv2.VideoWriter_fourcc(*’XVID’) # Linux: #fourcc = cv2.VideoWriter_fourcc(‘M’,’J’,’P’,’G’) out = cv2.VideoWriter(‘qlearn.avi’, fourcc, 60.0, (1200, 900)) for i in range(0, 14000, 10): img_path = f”qtable_charts/{i}.png” print(img_path) frame = cv2.imread(img_path) out.write(frame) out.release() make_video()The end result is: