Tishby, N. and Polani, D.
in: Perception-Action Cycle, Springer New York, pp. 601–636, 2011
URL, Google Scholar
The perception–action cycle is often defined as “the circular flow of information between an organism and its environment in the course of a sensory guided sequence of actions towards a goal” (Fuster, Neuron 30:319–333, 2001; International Journal of Psychophysiology 60(2):125–132, 2006). The question we address in this chapter is in what sense this “flow of information” can be described by Shannon’s measures of information introduced in his mathematical theory of communication. We provide an affirmative answer to this question using an intriguing analogy between Shannon’s classical model of communication and the perception–action cycle. In particular, decision and action sequences turn out to be directly analogous to codes in communication, and their complexity – the minimal number of (binary) decisions required for reaching a goal – directly bounded by information measures, as in communication. This analogy allows us to extend the standard reinforcement learning framework. The latter considers the future expected reward in the course of a behaviour sequence towards a goal (value-to-go). Here, we additionally incorporate a measure of information associated with this sequence: the cumulated information processing cost or bandwidth required to specify the future decision and action sequence (information-to-go). Using a graphical model, we derive a recursive Bellman optimality equation for information measures, in analogy to reinforcement learning; from this, we obtain new algorithms for calculating the optimal trade-off between the value-to-go and the required information-to-go, unifying the ideas behind the Bellman and the Blahut–Arimoto iterations. This trade-off between value-to-go and information-to-go provides a complete analogy with the compression–distortion trade-off in source coding. The present new formulation connects seemingly unrelated optimization problems. The algorithm is demonstrated on grid world examples.
Peter Dayan pointed me to this paper (which is actually a book chapter) when I told him that I find the continuous interaction between perception and action important and that Friston’s free energy framework is one of the few which covers this case. Now, this paper covers only discrete time (and states and actions), but certainly it addresses the issue that perception and action influence each other.
The main idea of the paper is to take the informational effort (they call it information-to-go) into account when finding a policy for a Markov decision process. A central finding is a recursive equation analogous to the (Bellman) equation for the Q-function in reinforcement learning which captures the expected (over all possible future state-action trajectories) informational effort of a certain state-action pair. Informational effort is defined as the KL-divergence between a factorising prior distribution over future states and actions (making them independent across time) and their true distribution. This means that the informational effort is the expected number of bits of information that you have to consider in addition to your prior when moving through the future. They then propose a free energy (also a recursive equation) which combines the informational effort with the Q-function of the underlying MDP and thus allows simultaneous optimisation of informational effort and reward where the two are traded off against each other.
Practically, this leads to “soft vs. sharp policies”: sharp policies which always choose the action with highest expected reward and soft policies which choose actions probabilistically with an associated penalty on reward compared to sharp policies. The softness of the resulting policy is controlled by the tradeoff parameter between informational effort and reward which can be interpreted as the informational capacity of the system under consideration. I understand it this way: the tradeoff parameter stands for the informational complexity/capacity of the distributions representing the internal model of the world in the agent and the optimal policy with a particular setting of tradeoff parameter is the optimal policy with respect to reward alone that a corresponding agent can achieve. This is easily seen when considering that informational effort depends on the prior for future state-action trajectories. For a given prior, tradeoff parameter and resulting policy you can find the corresponding more complex prior for which the same policy can be found for 0 informational effort. The prior here obviously corresponds to the internal model of the agent. Consequently, the authors present a general framework with which you can ask questions such as: “How much informational capacity does my agent need to solve a given task with a desired level of performance?” Or, in other words: “How complex does my agent need to be in order to solve the given task?” Or: “How well can my agent solve the given task?” Although this latter question is the standard question in RL. In particular, my intuition tells me that for every setting of the tradeoff parameter there probably is an equivalent POMDP formulation (which makes the corresponding difference between world and agent model explicit).
A particularly interesting discussion is that about “perfectly adapted environments” which seems to be directed towards Friston without mentioning him, though. The discussion results from the ability to optimise their free energy combined from informational effort and reward not only with respect to the policy, but also with respect to the (true) transition probabilities. The outcome of such an optimisation is an environment in which transition probabilities are directly related to rewards, or, in other words, an environment in which informational effort is equal to something like negative reward. In such an environment “minimizing the statistical surprise or maximizing the predictive information is equivalent to maximizing reward” which is what Friston argues (see also the associated discussion on hunch.net). Needless to say that they consider this as a very special case while in most other cases the environment contains information that is irrelevant in terms of reward. Nevertheless, they consider the possibility that the environments of living organisms are indeed perfectly or at least well adapted through millions of years of coevolution and they suggest to direct future research towards this issue. The question really is what is reward in this general sense? What is it that living organisms try to achieve? The more concrete reward is, for example, reward for a particular task, the less relevant most information in the environment will be. I’m tempted to say that the combined optimisation of informational effort and rewards, as presented here, will then lead to policies which particularly seak out relevant information, but I’m not sure whether this is a correct interpretation.
To sum up Tishby and Polani present a new theoretical framework which generalises reinforcement learning by incorporating ideas from information theory. They provide an interesting new perspective which is presented in a pleasingly accessible way. I do not think that they solved any particular problem in reinforcement learning, but they broadened the view by postulating that agents tradeoff informational effort (capacity?) and reward. Practically, computations derived from their framework may not be feasible in most cases, because original reinforcement learning is already hard and here a few expectations have been added. Or, maybe it’s not so bad, because you can do them together.