Dreaming in Latent Space: Deriving the Sequence-Level ELBO for World Models
Motivation: Why Latent World Models?
Imagine an RL agent that no longer needs to interact with the real environment at every timestep. Instead, it imagines futures inside its own learned model, a fast, compact, and differentiable simulator. This is the promise of latent world models.
But how do we train such models?
What’s the right objective to learn latent dynamics, reconstruct observations, and learn a policy, all at once?
This blog walks through the full derivation of the sequence-level Evidence Lower Bound (ELBO) used in world model training, inspired by methods like Dreamer and PlaNet.
Problem Setup
We deal with trajectories of observations and actions:
- : observed frames, states, or features
- : actions taken by the agent
- : latent states, which our model learns
The goal is to model the joint distribution:
Since this integral is intractable (due to the latent variables), we use variational inference to approximate it.
Latent Generative Model
We define the following generative process:
Each term corresponds to:
- : latent prior
- : latent dynamics (transition model)
- : decoder (generative observation model)
- : policy in latent space (used for imagined rollouts)
This is a latent controlled Markov process.
Why Not Maximize the Log-Likelihood Directly?
Because:
is intractable due to high-dimensional integration over .
Variational Posterior (Approximate Inference)
We introduce a learned variational distribution:
This acts like a Bayesian filter each is inferred based on the past.
ELBO Derivation Step-by-Step
We apply Jensen’s inequality:
Define this lower bound as :
Expand the Joint Log Terms
Now expand:
Let’s rearrange the KL terms:
Final Form of the Sequence-Level ELBO
This is our training objective.
Interpretation of Each Term
Term | Meaning |
---|---|
Latent observation decoder. Ensures latent state can predict reality. | |
Optimizes the policy in latent space. | |
KL-divergence | Forces inferred latents to match transitions in the world model. |
What This Loss Does:
- Reconstructs observations from latent states
- Optimizes the policy purely in latent space
- Keeps latent states consistent with the learned dynamics
Implementation Notes (Pytorch Pseudo)
# Latent rollout
z_t = sample_posterior(x[:t+1], a[:t]) # q_phi
x_hat = decoder(z_t)
a_hat = policy(z_t)
z_next_pred = dynamics(z_t, a_hat)
# ELBO components
log_px = log_prob(x_t, x_hat)
log_pi = log_prob(a_t, a_hat)
kl = kl_divergence(q_phi, p_theta)
# Total ELBO (maximize)
elbo = log_px + log_pi - kl
How This Enables "Imagination"
Once trained, the agent no longer needs to interact with the real environment at every step. It can simulate future rollouts inside its learned latent model:
Sample a latent state:
Roll forward using the dynamics model:
Select actions from the latent policy:
Optionally decode the imagined observations:
The agent is now dreaming futures and acting based on them enabling fast planning and efficient behavior without expensive environment interaction.
Summary
We’ve derived and interpreted the sequence-level Evidence Lower Bound (ELBO) that forms the foundation of latent world model approaches like Dreamer, PlaNet, and others.
Takeaways
The ELBO provides a single unified training objective for:
- Learning a latent dynamics model
- Optimizing a latent-space policy
- Inferring latent variables
Policies are trained entirely in the latent space, enabling cheap and scalable imagined rollouts.
This allows:
- Efficient planning
- Better generalization
- Imagination-based behavior agents that think before they act.