Flow Matching Policy Gradients

Simple Online Reinforcement Learning with Flow Matching

David McAllister* Songwei Ge* Brent Yi* Chung Min Kim Ethan Weber Hongsuk Choi Haiwen Feng Angjoo Kanazawa
Paper arXiv Code

Flow models have become the go-to approach to model distributions in continuous space. They soak up data with a simple, scalable denoising objective and now represent the state-of-the art in generating images, videos, audio and, more recently, robot actions. However, they’re still not widely used for learning from rewards with reinforcement learning.

To perform RL in continuous spaces, practitioners typically train far simpler Gaussian policies, which represent a single, ellipsoidal mode of the action distribution. Flow-based policies can capture complex, multimodal action distributions, but they are primarily trained in a supervised manner with behavior cloning (BC). We show that it’s possible to train RL policies using flow matching, the framework behind modern diffusion and flow models, to benefit from its expressivity.

We introduce Flow Policy Optimization (FPO), a new algorithm to train RL policies with flow matching. It can train expressive flow policies from only rewards. We find its particularly useful to learn underconditioned policies, like humanoid locomotion with simple joystick commands.

We approached this project as researchers primarily familiar with diffusion models. While working on VideoMimic, we felt limited by the expressiveness of Gaussian policies and thought diffusion could help. In this blog post, we’ll explain how we connect flow matching and on-policy RL in a way that makes sense without an extensive RL background.

Flow Matching

Flow matching optimizes a model to transform a simple distribution (e.g., the Gaussian distribution) into a complex one through a multi-step mapping called the marginal flow. We expand on the marginal flow in more detail in another blog post for Decentralized Diffusion Models.

The flow smoothly directs a particle $x_t$ to the data distribution, so integrating a particle’s position across time according to the flow yields a sample from the data distribution. Equivalently, sampling is the process of solving an ordinary differential equation (the flow), which we can do deterministically or with stochastic “churn” every step.

We can actually calculate the marginal flow analytically, which we do in real-time in the plot below. We added interactive control over the data distribution and sampling stochasticity, so try messing with it!

Each particle above represent an $x_t$ noisy latent that gets iteratively denoised as the time is integrated from zero to one. Drag the control points of the modes on the right to see how the underlying PDF and the particle trajectories change. Notice how the probability mass flows smoothly from the initial noise to form two distinct modes. The multi-step mapping is the magic that lets flow models transform a simple, tractable distribution into one of arbitrary complexity.

While it’s possible to interactively compute this flow in 1D, it becomes intractable over large datasets in high dimensional space. Instead, we use flow matching, which compresses the marginal flow into a neural network through a simple reconstruction objective.

Flow matching perturbs a clean data sample with Gaussian noise then tasks the model with reconstructing the sample by predicting the velocity, which is the derivative of $x_t$’s position w.r.t. time. In expectation over a fixed dataset, this optimization recovers the marginal flow for any $x_t$. Integrating $x_t$’s position across time according to a well-trained model’s velocity prediction will recover a sample from the data distribution.

DDM Overview
Flow matching the velocity prediction $v_t(x_t)$ to the conditional flow $u_t(x_t|x)$.

Geometrically, the marginal flow points to a weighted-average of the data where the weights are a function of the timestep and distance from $x_t$ to each data point. You can see the particles follow the marginal flow exactly in the plot above when stochasticity is turned off. At a high level, flow matching learns to point the model’s flow field, $v_t(x_t)$, to the data distribution.

Flow matching has statistical significance too. Instead of computing exact flow likelihoods (expensive and unstable), it optimizes a lower bound called the Evidence Lower Bound (ELBO). Increasing the ELBO pushes the model toward higher likelihoods without computing them directly. In the limit, the flow model will sample exactly from the probability distribution of the dataset. So if you’ve learned the flow function well, you’ve learned the underlying structure of the data.

TLDR: Flowing toward a data point increases its likelihood under the model.

On-Policy RL: Sample, Score, Reinforce

On-policy reinforcement learning follows a basic core loop: sample from your policy, score each action with rewards, then make high-reward actions more likely. Rinse and repeat.

This procedure climbs the policy gradient—the gradient of expected cumulative reward. Your model collects “experience” by sampling its learned distribution, sees which samples are most advantageous, and adjusts to perform similar actions more often.

On-policy RL can be cast as search iteratively distilled into a model. The policy “happens upon” good behaviors through exploration, then reinforces them. Over time, it discovers the patterns in the random successes and develops reliable strategies. You can start from a pretrained model and continue training with RL to explore within a pruned prior distribution rather than at random. This is the dominant approach to upcycle LLMs for preference alignment and mathematical reasoning.

Illustrative Example

We use the toy cartpole task from DMControl for clear illustration. The goal is to move a cart along a rail to balance an attached pole vertically. Here’s how this manifests as an RL loop:

  1. Sample an action from your model’s state-conditional distribution then simulate a step of physics. Do this back and forth in succession over a time horizon (rollouts).
  2. Score each sequence with rewards for each timestep (“how vertical is the pole?”).
  3. Train your model to boost the likelihood of actions that lead to high-reward sequences.

Sample and score rollouts:

On-policy RL samples multiple rollouts of actions then scores them according to the reward. In this case, only one (leftmost) rollout successfully balances the pole across the whole time horizon.

Calculate each advantage and estimate the policy gradient:

From the rewards, we estimate advantages. These can be viewed as the reward over time (return) normalized w.r.t. the expected return. This expectation is what the critic learns in PPO or computed as the average of a group’s rewards in GRPO.

Advantages are lower-variance estimates of action "goodness" than rewards. There is a design space to estimating advantages, but one way to think of them is as normalized rewards.

Given the advantages, we train the model on each data point with a gradient update scaled by its corresponding advantage. So, if the advantage is negative, it will become less likely. Postive advantage, more likely.

Typically, the policy gradient is computed in discrete space or using Gaussian likelihoods. Flow Policy Optimization extends the policy gradient to flow models, which introduces some important details we discuss in the following sections.

Flow Matching Policy Gradients

To reiterate, the goal of on-policy RL is simple: increase the likelihood of high-reward actions. Meanwhile, flow matching naturally increases likelihoods by redirecting probability flow toward training samples. This makes our objective clear—redirect the flow toward high reward actions.

In the limit of perfect optimization, flow matching assigns probabilities according to the frequency of samples in your training set. Since we’re using RL, that “training set” is dynamically generated from the model each epoch.

Advantages make the connection between synthetic data generation and on-policy RL explicit. In RL, we calculate the advantage of each sampled action, a quantity that indicates how much better it was than expected. These advantages are centered around zero to reduce variance: positive for better-than-expected actions, negative for worse. Advantages then become a loss weighting in the policy gradient. As a simple example, if an action is very advantageous, the model encounters a scaled-up loss on it and learns to boost it aggressively.

DDM Overview
The policy gradient resembles a standard log-likelihood supervised learning gradient on synthetic samples with the loss scaled by the reward or advantage (both are valid).

Zero-mean advantages are fine for RL in discrete spaces because a negative advantage simply pushes down the logit of a suboptimal action, and the softmax ensures that the resulting action probabilities remain valid and non-negative. Flow matching, however, learns probability flows to sample from a training data distribution. These are nonnegative by construction, so negative loss weights break this clean interpretation.

There’s a simple solution: make the advantages nonnegative. Shifting advantages by a constant doesn’t change the policy gradient. In fact, this is the mathematical property that lets us use advantages in the first place. Here’s how we can understand non-negative advantages in the flow matching framework:

DDM Overview
The marginal flow is a linear combination of the (conditional) flow to each data point. The weighting of each path scales with probability of drawing the data point from the dataset, $q(x)$.

Advantages manifest as loss-weighting, which can be intuitively expressed in the marginal flow framework. The marginal flow is the weighted average of the paths (the $u_t$’s) from the current noisy particle, $x_t$, to each data point $x$. The paths are also weighed by $q(x)$, the probability of drawing $x$ from your training set. This is typically a constant $\frac{1}{N}$ for a dataset of size $N$, assuming every data point is unique. Loss weights are equivalent to altering the frequency of the data points in your training set. If the loss for a data point is scaled by a factor of 2, its equivalent to that data point showing up twice in the train set.

Flow Policy Optimization

Now, we can get a complete picture of our algorithm that connects flow matching and reinforcement learning: Flow Policy Optimization. FPO follows a three-step loop:

1. Generate actions from your flow model using your choice of sampler

2. Score them with rewards and compute advantages

3. Flow match (add noise and reconstruct) on the actions with an advantage-weighed loss

This procedure boosts the likelihood of actions that achieve high reward while preserving the desirable properties of flow models—multimodality, expressivity and the improved exploration that stems from them. Since FPO uses flow matching as its fundamental primitive, FPO-trained policies inherit the body of techniques developed for flow and diffusion models. These include guidance for conditioning and Mean Flows for efficient sampling.

We visualize the three-step inner loop in the following interactive plot. We recommend viewing this on desktop. The red trace curve on the right determines the reward for different actions along the y-axis. It’s controllable, drag the control points around to shape the reward function! The plot shows how FPO optimizes a flow-based policy to maximize the specified reward. It follows the three following stages that line up with label above the plot:

First, sample actions from the flow-based policy. At the first iteration, this will be whatever the model is initialized to (or two arbitrary modes in the plot below).

Second, for each sampled data point, multiply its influence by the reward. We do a k-means approximation of the resulting distribution for illustration and display it in the blue trace between the heatmap and red reward trace.

Third, redirect the flow according to this advantage-weighed distribution. In a real model, this happens by optimizing the FPO ratio just like how standard PPO optimizes its likelihood ratio.

This represents one epoch of Flow Policy Optimization. The flow has been updated to sample higher-reward actions and we can repeat to continue climbing the policy gradient. The plot does this automatically, and you can reset it with the amber color reload button.

This is a pretty realistic analytical simulation of the FPO loop. It’s missing one major component though, which is the trust region constraint. This helps the optimization remain on-policy after multiple gradient steps per epoch. We encourage you to check out the paper to see how we implement this mechanism and for a more mathematical explanation of the algorithm.

FPO In Action

We include a few video examples of FPO working on a range of control tasks. These demonstrate FPO’s advantage over Gaussian policies for under-conditioned humanoid control. With only root-level commands, FPO successfully trains walking policies from scratch, while standard Gaussian policies fail to discover viable behaviors:

We compare Gaussian policies (orange) with FPO-trained polices (blue) when trained with sparse conditioning (gray).

Polices trained with FPO are robust to rough terrains for DeepMimic-style motion tracking. We show a couple of examples:

Trained with terrain randomization, FPO walks stably across unseen procedurally generated rough ground.

It’s not an RL paper without half cheetah! We compare quantitatively across DeepMind Control tasks to Gaussian policies and denoising MDPs in the main paper.

We show rollouts from our policy trained for the DeepMind Control task, CheetahRun, using FPO.

Acknowledgements

We thank Qiyang (Colin) Li, Oleg Rybkin, Lily Goli and Michael Psenka for helpful discussions and feedback on the manuscript. We thank Arthur Allshire, Tero Karras, Miika Aittala, Kevin Zakka and Seohong Park for insightful input and feedback on implementation details and the broader context of this work.

Code for the live plots on this blog is available here.