LIACS Robotics 2023

Reinforcement Learning Workshop

drawing drawing

This workshop reviews several basics of deep reinforcement learning when training agents in some environments in OpenAI gym.

For Mac or Linux

Open the terminal and go to this RL-workshop directory (Note: use Python 3.8.10 - 3.8.16) :

# first install swig:
sudo apt install swig

# Skip the following steps if you do already have python3.8.* available!
# Otherwise, you should install for example python3.8.16
sudo apt update && sudo apt upgrade
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.8.16
sudo apt install python3.8-distutils
sudo apt install python3.8-dev

# If you do not have the code already, get the workshop code and after unpacking go to the directory RL_Workshop
wget 'https://liacs.leidenuniv.nl/~bakkerem2/robotics/RL_Workshop.zip'
unzip RL_Workshop.zip

# create a virtual environment 'env'
virtualenv env --python=python3.8 source ./env/bin/activate

# Always upgrade pip!
pip install --upgrade pip

# Install the necessary packages.
chmod u+x install.sh
./install.sh

# Do a 'pip list' to check the packages installed in your virtual environment.
# If Box2D is not in the list, you should build and install the wheel for Box2D-2.3.10:
pip install https://github.com/pybox2d/pybox2d/archive/refs/tags/2.3.10.tar.gz

# start the workshop python src/RLWorkshop.py

For Windows

Run Windows PowerShell as Administrator and execute:

Set-Executionpolicy Unrestricted

Select the answer [Y] and cd to RL-workshop directory

Note it is assumed that the following programs are installed:
- use Python3.8.10
- Swig4.0.2: download swigwin4.0.2 and add the path to the directory with swig.exe to the env-path of the PowerShell using

$env:Path += ";PathtoSwigExe"

Then setup the virtual environment and install the necessary packages:

# If you do not have the code already, get the workshop code:
wget 'https://liacs.leidenuniv.nl/~bakkerem2/robotics/RL_Workshop.zip'
unzip RL_Workshop.zip
python -m venv env env\Scripts\Activate.ps1
python -m pip install --upgrade pip
pip install https://github.com/pybox2d/pybox2d/archive/refs/tags/2.3.10.tar.gz
# Install required packages (this can take a while)
.\install.bat python src/RLWorkshop.py

Reinforcement learning theory basics

Reinforcement learning is a framework for learning sequences of optimal actions. The main goal is to maximize the cumulative reward that the agent receives over multiple timesteps.​

drawing drawing

Reinforcement learning can be understood using the concepts of agents, environments, states, actions and rewards, all of which will be explained below. Capital letters tend to denote sets of things, and lower-case letters denote a specific instance of that thing; e.g. A is all possible actions, while a is a specific action contained in the set.

  1. Agent: An agent takes actions in an environment. The RL algorithm itself can also be called the agent.
  2. Action (A): A is the set of all possible moves the agent can make. The set of actions can be either discrete or continuous, e.g. discrete - [turn left, turn right]; continuous - [turn left by 2.0232 degrees, turn right by -0.023 degrees]. Most robotics and real world reinforcement learning formulations are continuous.
  3. Discount factor γ: The discount factor is multiplied by future rewards that acts as a parameter for controlling how the agent prioritizes short-term versus long-term rewards.
  4. Environment: The world through which the agent moves. The environment takes the agent’s current state and action as input, and returns as output the agent’s reward and its next state. If you are the agent, the environment could be the laws of physics (real world) or the rules of the simulation. The agent is also considered as part of the environment.
  5. State (S): A state is a concrete current configuration of the environment that the agent is in. Usually it is represented by a vector of a specific length that includes the relevant descriptors that the agent can use to make decisions.
  6. Reward (R): A reward is the feedback by which we measure the success or failure of an agent’s actions. Rewards can be immediate or delayed. They effectively evaluate the agent’s action and are represented by a single scalar value.
  7. Policy (π): The policy is the strategy that the agent employs to determine the next action based on the current state. It maps states to actions, and its goal is to find the optimal set of actions that maximizes the discounted reward.
  8. Value (V): The expected long-term return with discount, as opposed to the short-term reward R. Vπ(s) is defined as the expected long-term return of the current state under policy π. We discount rewards, or lower their estimated value, the further into the future they occur.
  9. Episode - a set of [state1-action1-state2-action2...state_n,action_n] transitions until the agent exceeds the time limit / achieves the goal or fails in some critical way.

So environments are functions that transform an action taken in the current state into the next state and a reward; agents are functions that transform the new state and reward into the next action. We can know the agent’s function, but we cannot know the function of the environment. It is a black box where we only see the inputs and outputs. Reinforcement learning represents an agent’s attempt to approximate the environment’s function, such that we can send actions into the black-box environment that maximize the rewards it gives out.

Deep reinforcement learning

drawing Most robotics control tasks have continuous state and action spaces, therefore the Markov Decision Processes that define them are essentially infinite. Since there is no way to sample this infinite* space of state-action transitions, we need some form of approximation function to get reasonable performance and currently most modern methods use deep neural networks to achieve this.

The reinforcement learning algorithm that you are going to be using today is Proximal Policy Optimization (PPO) which is one of the best performing RL algorithms to date. It is widely used in various robotics control tasks and it had many successes when applied to complicated environments:

  1. OpenAI Five - Dota 2
  2. Various simulated robot control tasks

This algorithm uses two neural networks:

  1. Actor (policy network) - takes the environment state as input and produces appropriate actions as outputs. (Look at the policy (π) definition 7. above)
  2. Critic (value network) - takes states and actions as inputs and outputs a single scalar value - the estimated cumulative discounted reward that the agent is going to aqcuire onwards as it makes further actions with the current policy. This output is then used as a part of the loss function for both networks. (Look at the value (V) definition 8. above).

When trained together these networks can solve a wide variety of tasks and are perfectly suited for continuous action and state spaces. The policies produced by this algorithm are stochastic, as instead of learning a specific action, given a state, the agent learns the parameters of a distribution of actions from which the actions are sampled. Therefore the actions that your agent produces are most likely going to be different each time you retrain the agent, even when using a constant random seed for network weight initialization.

Interface

drawing

For your convenience you are provided with an interface that makes it easy to control the internal tensorflow training code and set up the neural networks and the reinforcement learning parameters to solve the problems. To run it:

python3 src/RLWorkshop.py

Interface guidelines:

  1. Create environment - initializes the agent in an environment selected at the drop down list at the top with the neural network architecture as configured in the 'Network' table.
  2. Train - the agent runs the environment on an episode basis (until it exceeds the time limit / achieves the goal or fails in some critical way). While training you are shown only the last frame of the episode and the neural networks are updated every n episodes as indicated by the parameter batch_size. You can see plots for average reward per batch and the loss of the policy network on the left.
  3. Test - runs the current policy of an agent in the environment step-by-step. The bottom left plot shows the output of the Actor (policy network). You can pause the training at any time and use this mode to check what your agent is doing exactly during the episode in between the updates.
  4. Reset - destroys the agent and lets you rerun it with a different architecture or create a different environment. (Note, always do a Reset before starting a new task.)
  5. Record - when the 'Test' mode is on you can start record the policies of your agents by pressing 'Record'. Press the same button again and the gif of the recording will be saved in the current directory.

Apart from the neural net architectures the other parameters of the environments can be changed during run-time, therefore you can experiment to achieve better performance (or worse). Each parameter has a tool tip that explains its use and the general guidelines of how they should be configured depending on the complexity of the problem.

Tasks

1. Solving the MountainCarContinuous-v0 environment.

drawing

This OpenAI gym environment is a great illustration of a simple reinforcement learning problem where the agent has to take detrimental actions that give negative rewards in the short term in order to get a big reward for completing the task (a primitive form of planning). It has a very simple state and action space: one action [-1:1] indicating the velocity to left or right, and a state consisting of a vector: [position, velocity]. As the agent moves towards the goal (flagpole) it receives a positive reward, as it moves away from it it receives a negative reward. The agent does not have enough torque to just go uphill straight away.

Task: Try to find good learning parameters and neural network architectures that will solve the environment (consistently reaching the flagpole) with a reward around 90. Note: Given the right parameters the environment can be solved in 1-2 network updates.

Hints:

  1. The problem is very simple, therefore the neural networks required should be small (a couple of hiden layers with ~10 units each).
  2. Read the tool tips of the parameters to guide you.
  3. If the output of the agent is in the range [-1:1] what is the required activation function for the actor network? (Look up the functions online if you are not sure.)

2. Solving the LunarLanderContinuous-v2 environment.

drawing

This OpenAI gym environment shows a slightly more complicated agent. The landing pad is always at coordinates (0,0). The coordinates are the first two numbers in the state vector. The reward for moving from the top of the screen to landing at the landing pad at zero speed is about 100..140 points. If the lander moves away from the landing pad it loses part of the reward. Episodes finish, if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10 points. Firing the main engine is -0.3 points for each frame. Solving the problem is 200 points. Landing outside the landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Action is two real values vector from [-1:1]. The first value controls the main engine, [-1:0] engine is off, [0:1] throttle from 50% to 100% power. Engine can't work with less than 50% power. The second value is used to fire the left or right engine: [-1:-0.5] fire left engine, [0.5:1.0] fire right engine, [-0.5~0.5] engine is off.

Task: Try to propose good learning parameters for which you expect to achieve a good reward on average.

Hints:

  1. When the action and state spaces grow, you need to increase the sizes of hidden layers.
  2. Learning rates should decrease as complexity increases.
  3. Note: Try to first define a strategy for finding the optimal learning parameters for this problem.
  4. Do not spend more than 1 hour trying various proposed parameters and report your best results.

Submission

Submit a small pdf-report (max 1 page including images) of your findings and submit it to Bright space. The focus here is on explaining what you did and why, what the whole idea/strategy of your approach is and how your assessment of that is. Try to give some experimental evidence (reward screenshot or gif of the agent) for what you conclude.

Questions

If you have any problems running the environments, spot bugs in the code or if you have questions regarding reinforcement learning workshop in general, don't hesitate to contact us. There are several time-slots available on machines with the installed workshop. Contact: erwin@liacs.nl