Action understanding and active inference.

Friston, K., Mattout, J., and Kilner, J.
Biol Cybern, 104:137–160, 2011
DOI, Google Scholar

Abstract

We have suggested that the mirror-neuron system might be usefully understood as implementing Bayes-optimal perception of actions emitted by oneself or others. To substantiate this claim, we present neuronal simulations that show the same representations can prescribe motor behavior and encode motor intentions during action-observation. These simulations are based on the free-energy formulation of active inference, which is formally related to predictive coding. In this scheme, (generalised) states of the world are represented as trajectories. When these states include motor trajectories they implicitly entail intentions (future motor states). Optimizing the representation of these intentions enables predictive coding in a prospective sense. Crucially, the same generative models used to make predictions can be deployed to predict the actions of self or others by simply changing the bias or precision (i.e. attention) afforded to proprioceptive signals. We illustrate these points using simulations of handwriting to illustrate neuronally plausible generation and recognition of itinerant (wandering) motor trajectories. We then use the same simulations to produce synthetic electrophysiological responses to violations of intentional expectations. Our results affirm that a Bayes-optimal approach provides a principled framework, which accommodates current thinking about the mirror-neuron system. Furthermore, it endorses the general formulation of action as active inference.

Review

In this paper the authors try to convince the reader that the function of the mirror neuron system may be to provide amodal expectations for how an agent’s body will change, or interact with the world. In other words, they propose that the mirror neuron system represents, more or less abstract, intentions of an agent. This interpretation results from identifying the mirror neuron system with hidden states in a dynamic model within Friston’s active inference framework. I will first comment on the active inference framework and the particular model used and will then discuss the biological interpretation.

Active inference framework:

Active inference has been described by Friston elsewhere (Friston et al. PLoS One, 2009; Friston et al. Biol Cyb, 2010). Note that all variables are continuous. The main idea is that an agent maximises the likelihood of its internal model of the world as experienced by its sensors by (1) updating the hidden states of this model and (2) producing actions on the world. Under the Gaussian assumptions made by Friston both ways to maximise the likelihood of the model are equivalent to minimising the precision-weighted prediction errors defined in the model. Potentially the models are hierarchical, but here only a single layer is used which consists of sensory states and hidden states. The prediction errors on sensory states are simply defined as the difference between sensory observations and sensory predictions from the model as you would intuitively do. The model also defines prediction errors on hidden states (*). Both types of prediction errors are used to infer hidden states (1) which explain sensory observations, but action is only produced (2) from sensory state prediction errors, because action is not part of the agent’s model and only affects sensory observations produced by the world.

Well, actually the agent needs a whole other model for action which implements the gradient of sensory observations with respect to action, i.e., which tells the agent how sensory observations change when it exerts action. However, Friston restricts sensory obervations in this context to proprioceptive observations, i.e., muscle feedback, and argues that the corresponding gradient may be sufficiently simple to learn and represent so that we don’t have to worry about it (in the simulation he just provides the gradient to the agent). Therefore, action solely tries to implement proprioceptive predictions. On the other hand, proprioceptive predictions may be coupled to predictions in other modalities (e.g. vision) through the agent’s model which allows the agent to execute (seemingly) higher-level actions. For example, if an agent sees its hand move from a cup to a glass on a table in front of it, its generative model must also represent the corresponding proprioceptive signals. If then the agent predicts this movement of its hand in visual space, the generative model must automatically predict the corresponding proprioceptive signals, because they always accompanied the seen movement. Action then minimises the resulting precision-weighted proprioceptive prediction error and so implements the hand movement from cup to glass.

Notice that the agent minimises the *precision-weighted* prediction errors. Precision here means the inverse *prior* covariance, i.e., it is a measure for how certain the agent *expects* to be about its observations. By changing the precisions, qualitatively very different results can be obtained within the active inference framework. Indeed, here they implement the switch from action generation to action observation by heavily reducing the precision of the proprioceptive observations. This makes the agent ignore any proprioceptive prediction errors when both updating hidden states (1) and generating action (2). This leads to an interesting prediction: when you observe an action by somebody else, you shouldn’t notice when the corresponding body part is moved externally, or alternatively, when you observe somebody elses movement, you shouldn’t be able to move the corresponding body part yourself (in a different way than the observed). In this strict formulation this prediction appears to be very unlikely, but formulating it more softly, that you should see interference effects in these situations, you may be able to find evidence for it.

This thought also points to the general problem of finding suitable precisions: How do you strike a balance between action (2) and perception (1)? Because they are both trying to reduce the same prediction errors, the agent has to tradeoff recognising the world as it is (1) and changing it so that it corresponds to his expectations (2). This dichotomy is not easily resolved. When asked about it, Friston usually points to empirical priors, i.e., that the agent has learnt to choose suitable precisions based on his past experience (not very helpful, if you want to know how they are chosen). I guess, it’s really a question about how strongly the agent expects (wants) a certain outcome. A useful practical consideration also is that action is constrained, e.g., an agent can’t move infinitely fast, which means that enough prediction error should be left over for perceiving changes in the world (1), in particular those that are not within reach of the agent’s actions on the expected time scale.

I do not discuss the most common reservation against Friston’s free-energy principle / active inference framework (that people seem to have an intrinsic curiosity towards new things as well), because it has been covered elsewhere (John Langford’s blogNature Neuroscience).

Handwriting model:

In this paper the particular model used is interpreted as a model for handwriting although neither a hand is modeled, nor actual writing. Rather, a two-joint system (arm) is used where the movement of the end-effector position (tip) is designed such that it is qualitatively similar to hand-writing without actually producing common letters. The dynamic model of the agent consists of two parts: (a) a stable heteroclinic channel (SHC) which produces a periodic sequence of 6 continuously changing states and (b) a linear attractor dynamics in joint angle space of the arm which is attracted to a rest position, but modulated by the distance of the tip to a desired point in Cartesian space which is determined by the SHC state. Thus, the agent expects that the tip of its arm moves along a sequence of 6 desired points where the dynamics of the arm movement is determined by the linear attractor. The agent observes the joint angle positions and velocities (proprioceptive) and the Cartesian positions of the elbow joint and tip (visual). The dynamic model of the world (so to say implementing the underlying physics) lacks the SHC dynamics and only defines the linear attractor in joint space which is modulated by action and some (unspecified) external variables which can be used to perturb the system. Interestingly, the arm is stronger attracted to its rest position in the world model than in the agent model. The reason for this is not clear to me, but it might not be important, because action could correct for this.

Biological interpretation:

The system is setup such that the agent model contains additional hidden states compared to the world which may be interpreted as intentions of the agent, because they determine the order of the points that the tip moves to. In simulations the authors show that the described models within the active inference framework indeed lead to actions of the agent which implement a “writing” movement even though the world model did not know anything about “writing” at all. This effect has already been shown in the previously mentioned publications.

Here is new that they show that the same model can be used to observe an action without generating action at the same time. As mentioned before, they simply reduce the precision of the proprioceptive observations to achieve this. They then replay the previously recorded actions of the agent in the world by providing them via the external variables. This produces an equivalent movement of the arm in the world without any action being exerted by the agent. Instead of generating its own movement the agent then has the task to recognise a movement executed by somebody/something else. This works, because the precision of the visual obserations was kept high such that the hidden SHC states can be inferred correctly (1). The authors mention a delay before the SHC states catch up with the equivalent trajectory under action. This should not be over-interpreted, because other than mentioned in the text the initial conditions for the two simulations were not the same (see figures and code). The important argument the authors try to make here is that the same set of variables (SHC states) are equally active during action as well as action observation and, therefore, provide a potential functional explanation for activity in the mirror neuron system.

Furthermore, the authors argue that SHC states represent the intentions of the agent, or, equivalently, the intentions of the agent which is observed, by noting that the desired tip positions as specified by the SHC states are only (approximately) reached at a later point in time in the world. This probably results from the inertia built into the joint angle dynamics. Probably there are dynamic models for which this effect disappears, but it sounds plausible to me to assume that when one dynamic system d1 influences the parameters of another dynamic system d2 (as here), that d2 first needs to catch up with its state to the new parameter setting. So these delays would be expected for most hierarchical dynamic systems.

Another line of argument of the authors is to relate prediction errors in the model with electrophysiological (EEG) findings. This is based on Friston’s previous suggestion that superficial pyramidal cells are likely candidates for implementing prediction error units. At the same time, activity of these cells is thought to dominate EEG signals. I cannot judge the validity of both hypothesis, although the former seems to have less experimental support than the latter. In any case, I find the corresponding arguments in this paper quite weak. The problem is that results from exactly one run with one particular setting of parameters of one particular model is used to make very general statements based on a mere qualitative fit of parts of the data to general experimental findings. In other words, I’m not confident that similar (desired) patterns would be seen in the prediction errors, if other settings of precisions, or parameters of the dynamical systems would be chosen.

Conclusion:

The authors suggest how the mirror neuron system can be understood within Friston’s active inference framework. These conceptual considerations make sense. In general, the active inference framework provides large explanatory power and many phenomena may be understood in its context. However, in my point of view, it is an entirely open question how the functional considerations of the active inference framework may be implemented in neurobiological substrate. The superficial arguments based on prediction errors generated by the model, which are presented in the paper, are not convincing. More evidence needs to be found which robustly links variables in an active inference model with neuroscientific measurements.

But also conceptually it is not clear whether the active inference solution correctly describes the computations of the brain. On the one hand, it potentially explains many important and otherwise disparate phenomena under a common principle (e.g. perception, action, learning, computing with noise, dynamics, internal models, prediction; this paper adds action understanding). On the other hand, we don’t know whether all brain functions actually follow a common principle and whether functionally equivalent solutions for subsets of phenomena may be better descriptions of the underlying computations.

An important issue for future studies which aim to discern these possibilities is that active inference is a general framework which needs to be instantiated with a particular model before its properties can be compared to experimental data. However, little is known about the kind of hierarchical, dynamic, functional models itself, which must serve as generative models for active inference. As in this paper, it then is hard to discern the properties of the chosen model from the properties imposed by the active inference framework. Therefore, great care has to be taken in the interpretation of corresponding results, but it would be exciting to learn about which properties of the active inference framework are crucial in brain function and which would need to be added, adapted, or dropped in a faithful description of (subsets of) brain function.

(*) Hidden state prediction errors result from Friston’s special treatment of dynamical systems by extending states by their temporal derivatives to obtain generalised states which represent a local trajectory of the states through time. The hidden state prediction errors, thus, can be seen, intuitively, as the difference between the velocity of the (previously inferred) hidden states as represented by the trajectory in generalised coordinates and the velocity predicted by the dynamic model.

BM: An iterative algorithm to learn stable non-linear dynamical systems with Gaussian mixture models.

Khansari-Zadeh, S. M. and Billard, A.
in: Proc. IEEE Int Robotics and Automation (ICRA) Conf, pp. 2381–2388, 2010
DOI, Google Scholar

Abstract

We model the dynamics of non-linear point-topoint robot motions as a time-independent system described by an autonomous dynamical system (DS). We propose an iterative algorithm to estimate the form of the DS through a mixture of Gaussian distributions. We prove that the resulting model is asymptotically stable at the target. We validate the accuracy of the model on a library of 2D human motions and to learn a control policy through human demonstrations for two multidegrees of freedom robots. We show the real-time adaptation to perturbations of the learned model when controlling the two kinematically-driven robots.

Review

The authors describe a system for learning nonlinear, multivariate dynamical systems based on Gaussian mixture regression (GMR). The difference to previous approaches using GMR (e.g. Gribovskaya2010) is that the GMR is done by pruning a Gaussian mixture model which has a Gaussian at each time point such that accuracy and stability criteria are adhered to. Pruning here actually means that two neighbouring Gaussians are merged. Consequently, the main contribution from the paper is the derivation and proof of the corresponding stability criteria – something that I haven’t checked properly.

They make a quantitative comparison between their binary merging approach, original EM learning of GMR, using LWPR to learn the dynamics and using DMPs. However, they do not tell the precise procedures. I am particular surprised about the very low accuracy of the DMPs compared to the other approaches. Unless they have done something special (such as introduce large temporal deviations as done for Fig. 2) I don’t see why the accuracy for DMPs should be so low.

They argue that the main advantages of their approach are that a minimal number of used Gaussians is automatically determined while the resulting dynamics is stable at all times, that the multivariate Gaussians can capture correlations between dimensions (in contrast to DMPs) and that the computations are less costly than when using Gaussian Process Regression. The disadvantages are that the number of parameters increases quadratically with the dimensionality (curse of dimensionality, not so crucial for their 2, 4 or 6D examples, but then?), but, in particular, that the pruning procedure is highly susceptible to local minima issues and results depend on the order in which Gaussians are merged. In the extreme case, imagine that through the presence of noise none of the initial Gaussians can be merged without violating the accuracy constraint. Again, this might not be a problem for their very smooth data, but it will become problematic for more noisy data. Similar problems lead to the dependency on the order of merges (which are selected randomly). To overcome the order dependency they suggest to restart the algorithm several times and then select the result with the smallest number of Gaussians. Note that this compromises their computational advantages over GPs. While computing a GP mapping is cubic in the number of data points, merging the Gaussians is quadratic in the number of time points, but if you consider that different merge orders need to be checked, you’ll notice that there are 2 to the power of time points possible merge sequences, meaning that your computational costs can increase exponentially in the worst case when really the best solution is supposed to be found (if you optimise the hyperparameters in GPs you’re in a similar situation in a continuous space, though).

Encoding of Motor Skill in the Corticomuscular System of Musicians.

Gentner, R., Gorges, S., Weise, D., aufm Kampe, K., Buttmann, M., and Classen, J.
Current Biology, 20:1869-1874
 , 2010
DOI, Google Scholar

Abstract

Summary How motor skills are stored in the nervous system represents a fundamental question in neuroscience. Although musical motor skills are associated with a variety of adaptations [[1], [2] and [3]], it remains unclear how these changes are linked to the known superior motor performance of expert musicians. Here we establish a direct and specific relationship between the functional organization of the corticomuscular system and skilled musical performance. Principal component analysis was used to identify joint correlation patterns in finger movements evoked by transcranial magnetic stimulation over the primary motor cortex while subjects were at rest. Linear combinations of a selected subset of these patterns were used to reconstruct active instrumental playing or grasping movements. Reconstruction quality of instrumental playing was superior in skilled musicians compared to musically untrained subjects, displayed taxonomic specificity for the trained movement repertoire, and correlated with the cumulated long-term training exposure, but not with the recent past training history. In violinists, the reconstruction quality of grasping movements correlated negatively with the long-term training history of violin playing. Our results indicate that experience-dependent motor skills are specifically encoded in the functional organization of the primary motor cortex and its efferent system and are consistent with a model of skill coding by a modular neuronal architecture [4].

Review

The authors use PCA on TMS induced postures to show that motor cortex represents building blocks of movements which adapt to everyday requirements. To be precise, the authors recorded finger movements which were induced by TMS over primary motor cortex and extracted for each of the different stimulations the posture which had the largest deviation from rest. From the resulting set of postures they computed the first 4 principal components (PCs) and looked how well a linear combination of the PCs could reconstruct postures recorded during normal behaviour of the subjects. This is made more interesting by comparing groups of subjects with different motor experience. They use highly trained violinists and pianists and a group of non-musicians and then compare the different combinations of who is used for determining PCs and what is trying to be reconstructed (violin playing, piano playing, or grasping where grasping can be that of violinists or non-musicians). Basis of comparison is a correlation (R) between the series of joint angle vectors as defined in Shadmehr1994 which can be interpreted as something like the average correlation between data points of the two sequences measured across joint angles (cf. normalised inner product matrix in GPLVM). Don’t ask me why they take exactly this measure, but probably it doesn’t matter. The main finding is that the PCs from violinists are significantly better in reconstructing violin playing than either the piano PCs, or the non-musician PCs. This table is missing in the text (but the data is there, showing mean R and its standard deviation):

R violinists pianists non-musicians

violin 0.69+0.09 0.63+0.11 0.64+0.09

piano 0.70+0.06 0.74+0.06 0.70+0.07

grasp 0.76+0.09 0.76+0.09 0.76+0.10

what is not discussed in the paper is that pianists’ PCs are worse in reconstructing violin playing than PCs of non-musicians. An interesting finding is that the years of intensive training of violinists correlates significantly with the reconstruction quality for violin playing of violinist PCs while it is anticorrelated with the reconstruction quality for grasping indicating that the postures activated in primary motor cortex become more adapted to frequently executed tasks. However, it has to be noted that this correlation analysis is based on only 9 data points.

In the beginning of the paper they show an analysis of the recorded behaviour which simply is supposed to ensure that violin playing, piano playing and grasping movements are sufficiently different which we may believe, although piano playing and grasping apparently are somewhat similar.