
This post will help you to write gaming bot for less rewarding games like MountainCar using OpenAI Gym and TensorFlow.
Once I built a model for playing CartPole game felt confident and thought let’s write code for one more game and found MountainCar game interesting then I thought why not write for it.
Once I started writing it realized its not an easy task. The biggest problem is it always gives a negative reward and whatever random action I took it doesn’t matter I ended up getting the total score -200 and finally losing the game. I checked different articles and tried different ways but didn’t found the proper answer.
After reading at so many places I realized instead of relying on the reward given by the game why not create one by myself based on specific condition this solved my problem. I wanted to share it with everyone so that nobody will go through the pain I went through.
Without wasting much time let’s start coding. If you are trying OpenAI Gym for the first time please read my previous article here.
First, let’s import the packages we need to implement this
import gym import random import numpy as np from keras.models import Sequential from keras.layers import Dense from keras.optimizers import Adam
Let’s create the environment and initialize the variables
env = gym.make('MountainCar-v0') env.reset() goal_steps = 200 score_requirement = -198 intial_games = 10000
Before we start writing the code first let’s understand what we are getting into
def play_a_random_game_first(): for step_index in range(goal_steps): env.render() action = env.action_space.sample() observation, reward, done, info = env.step(action) print("Step {}:".format(step_index)) print("action: {}".format(action)) print("observation: {}".format(observation)) print("reward: {}".format(reward)) print("done: {}".format(done)) print("info: {}".format(info)) if done: break env.reset()
You will get output like this if you execute this code
Step 0: action: 0 observation: [-0.55321127 -0.00078406] reward: -1.0 done: False info: {} Step 1: action: 1 observation: [-0.55377353 -0.00056225] reward: -1.0 done: False info: {} ... Step 198: action: 0 observation: [-0.40182971 0.01383677] reward: -1.0 done: False info: {} Step 199: action: 1 observation: [-0.38888603 0.01294368] reward: -1.0 done: True info: {}
According to the documentation “-1 for each time step, until the goal position of 0.5 is reached. As with MountainCarContinuous v0, there is no penalty for climbing the left hill, which upon reached acts as a wall.”
The episode ends when you reach 0.5(top) position, or if 200 iterations are reached. I played several times 10000 times but never reached the top position. So at the time of data population, I changed a small logic that finally gave me the solution.
Code for data population is
The key part lies in the above code let’s understand line by line and I will explain the tweak which helped me solving this problem also with this.
- We initialized training_data and accepted_scores arrays.
- We need to play multiple times so that we can collect the data which we can use further. So we will play 10000 times so that we get a decent amount of data. This line for that “for game_index in range(intial_games):”
- We initialized score, game_memory, previous_observation variables where will store the current game’s total score and previous step observation(means the position of Car and its velocity) and the action we took for that.
- for step_index in range(goal_steps): — This code is to play the game for 200 steps because episode ends when you reach 0.5(top) position, or if 200 iterations are reached.
- We need to take random actions so that we can play the game which may lead to successfully completing the step or losing the game. Here only 3 actions allowed push left(0), no push(1) and push right(2). So this code(random.randrange(0, 3)) is for taking one of the random action.
- We will take that action/step. Then we will check if it’s not a first action/step then we will store the previous observation and action we took for that.
- Then we will check whether the position of the car which is observation[0] is greater than -0.2 if yes then instead of taking the reward given by our game environment I took as 1 because -0.2 position is top of the hill which means our random actions giving somewhat fruitful results.
- Add reward to the score and check whether the game is completed or not if yes then stop playing it.
- We will check whether this game fulfilling our minimum requirement or not means are we able to got score more than or equal to -198 or not.
- If we are able to get the score greater than or equal to -198 then we will add this score to accept_scores which we further print to know how many games data and their score which we are feeding to our model.
- Then we will do hot encoding of action because its values 0(push left), 1(no push), 2(push right) represent categorical data.
- Then we will add that to our training_data.
- We will reset the environment to make sure everything clear to start playing next game.
- print(accepted_scores) — This code is to know how many games data and their score which we are feeding to our model. Then we will return the training data.
We will get some reasonable games scores like below
[-158.0, -172.0, -188.0, -196.0, -168.0, -182.0, -180.0, -184.0, -184.0, -184.0, -168.0, -184.0, -176.0, -182.0, -182.0, -196.0, -184.0, -194.0, -178.0, -176.0, -170.0, -190.0, -182.0, -184.0, -184.0, -188.0, -184.0, -192.0, -172.0, -186.0, -174.0, -166.0, -188.0, -186.0, -174.0, -190.0, -178.0, -170.0, -164.0, -180.0, -184.0, -172.0, -168.0, -174.0, -172.0, -174.0, -186.0]
So our data is ready. Its time to build our neural network.
def build_model(input_size, output_size): model = Sequential() model.add(Dense(128, input_dim=input_size, activation='relu')) model.add(Dense(52, activation='relu')) model.add(Dense(output_size, activation='linear')) model.compile(loss='mse', optimizer=Adam()) return model
Here we are going to use the sequential model.
def train_model(training_data): X = np.array([i[0] for i in training_data]).reshape(-1, len(training_data[0][0])) y = np.array([i[1] for i in training_data]).reshape(-1, len(training_data[0][1])) model = build_model(input_size=len(X[0]), output_size=len(y[0])) model.fit(X, y, epochs=5) return model
We have the training data so from that we will create features and labels.
Then we will start the training
trained_model = train_model(training_data)
The output we will get like this
Epoch 1/5 9353/9353 [==============================] - 1s 90us/step - loss: 0.2262 Epoch 2/5 9353/9353 [==============================] - 1s 66us/step - loss: 0.2217 Epoch 3/5 9353/9353 [==============================] - 1s 65us/step - loss: 0.2209 Epoch 4/5 9353/9353 [==============================] - 1s 64us/step - loss: 0.2201 Epoch 5/5 9353/9353 [==============================] - 1s 61us/step - loss: 0.2199
It’s time for our gaming bot to play the game for us.
scores = [] choices = [] for each_game in range(100): score = 0 game_memory = [] prev_obs = [] for step_index in range(goal_steps): env.render() if len(prev_obs)==0: action = random.randrange(0,2) else: action = np.argmax(trained_model.predict(prev_obs.reshape(-1, len(prev_obs)))[0]) choices.append(action) new_observation, reward, done, info = env.step(action) prev_obs = new_observation game_memory.append([new_observation, action]) score += reward if done: break env.reset() scores.append(score) print(scores) print('Average Score:',sum(scores)/len(scores)) print('choice 1:{} choice 0:{} choice 2:{}'.format(choices.count(1)/len(choices),choices.count(0)/len(choices),choices.count(2)/len(choices)))
Here you can see I didn’t touch the reward part at all. But our model got to know if it does what action it will go top of the hill so it automatically performs well. After executing this code you will get the score like this
[-164.0, -92.0, -162.0, -107.0, -105.0, -93.0, -97.0, -90.0, -96.0, -170.0, -99.0, -200.0, -164.0, -91.0, -200.0, -92.0, -195.0, -166.0, -104.0, -93.0, -164.0, -200.0, -200.0, -164.0, -179.0, -176.0, -122.0, -101.0, -91.0, -162.0, -99.0, -164.0, -190.0, -199.0, -101.0, -200.0, -186.0, -185.0, -170.0, -128.0, -164.0, -164.0, -166.0, -101.0, -167.0, -89.0, -105.0, -168.0, -166.0, -100.0, -100.0, -91.0, -90.0, -163.0, -165.0, -167.0, -165.0, -105.0, -88.0, -134.0, -95.0, -90.0, -166.0, -166.0, -89.0, -167.0, -162.0, -165.0, -164.0, -171.0, -163.0, -127.0, -95.0, -159.0, -89.0, -89.0, -96.0, -168.0, -96.0, -163.0, -89.0, -90.0, -183.0, -166.0, -164.0, -163.0, -171.0, -167.0, -163.0, -97.0, -171.0, -166.0, -89.0, -200.0, -162.0, -175.0, -198.0, -93.0, -200.0, -106.0] Average Score: -141.12 choice 1:0.007936507936507936 choice 0:0.5136054421768708 choice 2:0.47845804988662133
Great job your bot did a very good job.
Congrats!!! You understood the reward mechanism well and also you understood how to design a solution if your game is not friendly towards
rewards.
You will find Jupyter notebook for this implementation here.
Peace. Happy Coding.
So you changed the problem.
Hi Daniel,
I didn’t understand what you are saying. Can you please elaborate.