Reinforcement learning as message passing

In theory, Reinforcement Learning is a discipline concerned with algorithms for decision-making in opaque environments that maximize cumulative reward. In practice, however, (and yours truly is guilty of this as much as everyone) reinforcement learning literature mostly discusses solving Markov Decision Processes. However, some decision making settings are fairly hard to model as a Markdov Decision Process with discrete timesteps. In this post, I would like to propose an alternative representation of reinforcement learning problems using message passing and discuss how they can be solved (tl;dr by reducing them to MDPs, but this reduction is a non-trivial problem in its own right).

Partially Observable Markov Decision Process

In a Partially Observable Markov Decision Process a reinforcement learning episode is a sequence of observations o1,o2,o_1,o_2,\dots from the observation space onOn o_n \in O \forall n, rewards r1,r2,r_1,r_2,\dots from the reward range rnRn r_n \in R \forall n and actions a1,a2,a_1,a_2,\dots from the action space anAn a_n \in A \forall n. Observations and actions are sampled from the environment and conditioned on the prior trajectory (a history of observations, actions, and rewards):

on,rnp(on,rno1,o2,,on1,r1,r2,,rn1,a1,a2,,an1) o_n, r_n \sim p(o_n,r_n|o_1,o_2,\dots,o_{n-1},r_1,r_2,\dots,r_{n-1},a_1,a_2,\dots,a_{n-1})

The actions are sampled from the agent, also conditioned on prior trajectory:

anπ(ano1,o2,,on1,r1,r2,,rn1,a1,a2,,an1) a_n \sim \pi(a_n|o_1,o_2,\dots,o_{n-1},r_1,r_2,\dots,r_{n-1},a_1,a_2,\dots,a_{n-1})

The agent’s distribution π\pi is also known as the policy. The environment and the agent can be represented with any conceiavble algorithm and are modeled as probability distributions rather than functions since it’s very useful to let these algorithms use randomness.

Please note that this is not a traditional way to formulate POMDP - in a traditional formulation distribution p(on,rn)p(o_n,r_n) is conditioned on the hidden state ss of the environment:

on,rnp(on,rnsn1) o_n, r_n \sim p(o_n,r_n|s_{n-1})

However, the hidden state sn1s_{n-1} is ultimately a function of o1,o2,,on1,r1,r2,,rn1,a1,a2,,an1o_1,o_2,\dots,o_{n-1},r_1,r_2,\dots,r_{n-1},a_1,a_2,\dots,a_{n-1}, so the formulations are equivalent.

The main limitations of this model, also known as desiderata for a better model, are:

  • Some decision making scenarios legitimately have a discrete flow of time: a CPU, for example, makes a decision every 1clock frequency, Hz\frac{1}{\text{clock frequency, Hz}} of a second. However, most real-life decision making settings happen in a continuous time flow with no limits on how often or how rare actions and observations occur.
  • The agent cannot respond to an observation with less than one (i.e. zero) or more than one action. This can be fixed by modeling ana_n as a set of actions.

Both of these problems have been successfully addresed with artificial time discretization and complex action spaces AA, however this is done on a case by case basis for each particular environment. A general model that addresses these challenges would be very useful.

Message Passing Decision Process

(working title, alternatives include the simpler but somewhat already taken Continuous Decision Process, the software-jargony Event Loop Decision Process and the insufferably vane Liventsev Decision Process)

Let us attach a timestamp tt to every observation, reward and action in the decision process, making them 2-tuples. An action a,t\langle a, t \rangle is a message from the agent to the environment, an observation o,t\langle o, t \rangle or a reward r,t\langle r, t \rangle is a message from the environment to the agent. Messages can be sent as often or as rare as needed:

MPDP

Observations and actions are sampled from the environment and conditioned on the timestamp and all actions of the agent before that time:

otp(ott,{a,tata<t}) o_t \sim p(o_t|t, \{ a, t_a | t_a < t \})

The actions are sampled from the agent, also conditioned on the timestamp and all observations before that time:

atπ(att,{o,toto<t}) a_t \sim \pi(a_t|t, \{ o, t_o | t_o < t \})

where otO+o_t \in O^{+} and atA+a_t \in A^{+}

O+={no observation}O O^{+} = \{\text{no observation} \} \cup O

A+={no action}A A^{+} = \{\text{no action} \} \cup A

Discretization

Now that we’re done with the theory, it is a good time to remember that we live in the real world and, in the real world, unless you plan to run reinforcement learning algorithms on a liquid computer (in which case, please let us know how it goes!), evaluation of π(at)\pi(a_t) will be done either by a regular computer that can only evaluate it a finite number of times a second or something even slower than that. Hence, discretization is still desirable. However, we need a discretization strategy that will make the resulting discrete decision process equivalent (or at least as equivalent as possible) to the continuous MPDP.

Decision schedules

So we’ve established that the practicalities of implementing a reinforcement learning algorithm mean that in any timeframe t,t+Δt\langle t, t + \Delta t \rangle a finite number of decisions should be taken. This can be achieved by a decision schedule that is some combination of:

  • making a decision every time any message from the environment (observation or reward) is received
  • making a decision at regular intervals Δt\Delta t
  • after making a decision, scheduling a decision time tdt_d later in the future

At every decision point the agent has to receive information about the observations that recently occured and output some number of actions (may be zero)

Observation space

When an observation is modeled as o,t\langle o, t \rangle, the most faithful way to represent the history of observations to the agent is

o=o1,t1,o2,t2, \overrightarrow{o} = \langle o_1, t_1, o_2, t_2, \dots \rangle

This representation has 2 issues:

  • It has a variable size. There are, of course, machine learning algorithms that can work with variable size inputs, however, most traditional RL approaches cannot and compatibility with them would be an advantage
  • There is a common type of observation that makes older observations obsolete. For example, in a thermostat a new temperature measurement for all intents and purposes overrides the old one. In a navigation task, a new “location observed” event means you are no longer in the previous location. This has to be taken into account, lest a vast array of outdated information will be fed to the agent at every decision point.

A solution (probably the solution?) to these is to sort observations into observation classes O1O2On=OO_1 \cup O_2 \cup \dots \cup O_n = O such that if several observations from the same class has been made, only the last of them is important. Then o\overrightarrow{o} should be a vector of the latest observation in each class

o=o1O1,exp(t1t),,onOn,exp(tnt), \overrightarrow{o} = \langle o_1 \in O_1, \exp(t_1-t), \dots, o_n \in O_n, \exp(t_n-t), \rangle

where tt is the decision time. exp(tnt)\exp(t_n-t) is preferable to the more naive approach of ttnt-t_n, because if no observation in the observation class OnO_n has occured yet, t=t=-\infty which creates all kinds of problems for actually solving the MDP downstream. exp(tnt)\exp(t_n-t) in this case would be simply zero and observations that have occured will have an exponentially decaying relevance factor attached to them - a more directly useful value for decision-making then “time since event”.

Action space

Using one of the decision schedules and the observation system described above, it is fairly trivial to support a variable number of actions. The agent has to output an action aA+a \in A^{+} and if that action is not a “no action”, action sampling repeats again.

Conclusion

This post has introduced Message Passing Decision Process for reinforcement learning that represents a much larger scope of real world decision making tasks than the traditional approach, Markov Decision Process, and proposed several methods to reduce MPDP to MDP to use the existing selection of reinforcement learning algorithms. The next step in developing this idea would probably be to implement it in software, to be able to do something like

openai_gym_env = MPDPtoMDPWrapper(MDPEnv, decision_schedule=, observation_space=, action_space=)

and test the approach experimentally

Confessions

This work is taxpayer funded