# Reinforcement learning as message passing

In theory, Reinforcement Learning is a discipline concerned with algorithms for decision-making in opaque environments that maximize cumulative reward. In practice, however, (and yours truly is guilty of this as much as everyone) reinforcement learning literature mostly discusses solving Markov Decision Processes. However, some decision making settings are fairly hard to model as a Markdov Decision Process with discrete timesteps. In this post, I would like to propose an alternative representation of reinforcement learning problems using message passing and discuss how they can be solved (tl;dr by reducing them to MDPs, but this reduction is a non-trivial problem in its own right).

## Partially Observable Markov Decision Process

In a Partially Observable Markov Decision Process a reinforcement learning episode is a sequence of observations $$o_1,o_2,\dots$$ from the observation space $$o_n \in O \forall n$$, rewards $$r_1,r_2,\dots$$ from the reward range $$r_n \in R \forall n$$ and actions $$a_1,a_2,\dots$$ from the action space $$a_n \in A \forall n$$. Observations and actions are sampled from the environment and conditioned on the prior trajectory (a history of observations, actions, and rewards):

$o_n, r_n \sim p(o_n,r_n|o_1,o_2,\dots,o_{n-1},r_1,r_2,\dots,r_{n-1},a_1,a_2,\dots,a_{n-1})$

The actions are sampled from the agent, also conditioned on prior trajectory:

$a_n \sim \pi(a_n|o_1,o_2,\dots,o_{n-1},r_1,r_2,\dots,r_{n-1},a_1,a_2,\dots,a_{n-1})$

The agent’s distribution $$\pi$$ is also known as the policy. The environment and the agent can be represented with any conceiavble algorithm and are modeled as probability distributions rather than functions since it’s very useful to let these algorithms use randomness.

Please note that this is not a traditional way to formulate POMDP - in a traditional formulation distribution $$p(o_n,r_n)$$ is conditioned on the hidden state $$s$$ of the environment:

$o_n, r_n \sim p(o_n,r_n|s_{n-1})$

However, the hidden state $$s_{n-1}$$ is ultimately a function of $$o_1,o_2,\dots,o_{n-1},r_1,r_2,\dots,r_{n-1},a_1,a_2,\dots,a_{n-1}$$, so the formulations are equivalent.

The main limitations of this model, also known as desiderata for a better model, are:

• Some decision making scenarios legitimately have a discrete flow of time: a CPU, for example, makes a decision every $$\frac{1}{\text{clock frequency, Hz}}$$ of a second. However, most real-life decision making settings happen in a continuous time flow with no limits on how often or how rare actions and observations occur.
• The agent cannot respond to an observation with less than one (i.e. zero) or more than one action. This can be fixed by modeling $$a_n$$ as a set of actions.

Both of these problems have been successfully addresed with artificial time discretization and complex action spaces $$A$$, however this is done on a case by case basis for each particular environment. A general model that addresses these challenges would be very useful.

## Message Passing Decision Process

(working title, alternatives include the simpler but somewhat already taken Continuous Decision Process, the software-jargony Event Loop Decision Process and the insufferably vane Liventsev Decision Process)

Let us attach a timestamp $$t$$ to every observation, reward and action in the decision process, making them 2-tuples. An action $$\langle a, t \rangle$$ is a message from the agent to the environment, an observation $$\langle o, t \rangle$$ or a reward $$\langle r, t \rangle$$ is a message from the environment to the agent. Messages can be sent as often or as rare as needed: Observations and actions are sampled from the environment and conditioned on the timestamp and all actions of the agent before that time:

$o_t \sim p(o_t|t, \{ a, t_a | t_a < t \})$

The actions are sampled from the agent, also conditioned on the timestamp and all observations before that time:

$a_t \sim \pi(a_t|t, \{ o, t_o | t_o < t \})$

where $$o_t \in O^{+}$$ and $$a_t \in A^{+}$$

$O^{+} = \{\text{no observation} \} \cup O$

$A^{+} = \{\text{no action} \} \cup A$

## Discretization

Now that we’re done with the theory, it is a good time to remember that we live in the real world and, in the real world, unless you plan to run reinforcement learning algorithms on a liquid computer (in which case, please let us know how it goes!), evaluation of $$\pi(a_t)$$ will be done either by a regular computer that can only evaluate it a finite number of times a second or something even slower than that. Hence, discretization is still desirable. However, we need a discretization strategy that will make the resulting discrete decision process equivalent (or at least as equivalent as possible) to the continuous MPDP.

### Decision schedules

So we’ve established that the practicalities of implementing a reinforcement learning algorithm mean that in any timeframe $$\langle t, t + \Delta t \rangle$$ a finite number of decisions should be taken. This can be achieved by a decision schedule that is some combination of:

• making a decision every time any message from the environment (observation or reward) is received
• making a decision at regular intervals $$\Delta t$$
• after making a decision, scheduling a decision time $$t_d$$ later in the future

At every decision point the agent has to receive information about the observations that recently occured and output some number of actions (may be zero)

### Observation space

When an observation is modeled as $$\langle o, t \rangle$$, the most faithful way to represent the history of observations to the agent is

$\overrightarrow{o} = \langle o_1, t_1, o_2, t_2, \dots \rangle$

This representation has 2 issues:

• It has a variable size. There are, of course, machine learning algorithms that can work with variable size inputs, however, most traditional RL approaches cannot and compatibility with them would be an advantage
• There is a common type of observation that makes older observations obsolete. For example, in a thermostat a new temperature measurement for all intents and purposes overrides the old one. In a navigation task, a new “location observed” event means you are no longer in the previous location. This has to be taken into account, lest a vast array of outdated information will be fed to the agent at every decision point.

A solution (probably the solution?) to these is to sort observations into observation classes $$O_1 \cup O_2 \cup \dots \cup O_n = O$$ such that if several observations from the same class has been made, only the last of them is important. Then $$\overrightarrow{o}$$ should be a vector of the latest observation in each class

$\overrightarrow{o} = \langle o_1 \in O_1, \exp(t_1-t), \dots, o_n \in O_n, \exp(t_n-t), \rangle$

where $$t$$ is the decision time. $$\exp(t_n-t)$$ is preferable to the more naive approach of $$t-t_n$$, because if no observation in the observation class $$O_n$$ has occured yet, $$t=-\infty$$ which creates all kinds of problems for actually solving the MDP downstream. $$\exp(t_n-t)$$ in this case would be simply zero and observations that have occured will have an exponentially decaying relevance factor attached to them - a more directly useful value for decision-making then “time since event”.

### Action space

Using one of the decision schedules and the observation system described above, it is fairly trivial to support a variable number of actions. The agent has to output an action $$a \in A^{+}$$ and if that action is not a “no action”, action sampling repeats again.

## Conclusion

This post has introduced Message Passing Decision Process for reinforcement learning that represents a much larger scope of real world decision making tasks than the traditional approach, Markov Decision Process, and proposed several methods to reduce MPDP to MDP to use the existing selection of reinforcement learning algorithms. The next step in developing this idea would probably be to implement it in software, to be able to do something like

openai_gym_env = MPDPtoMDPWrapper(MDPEnv, decision_schedule=, observation_space=, action_space=)


and test the approach experimentally

## Confessions

This work is taxpayer funded