Dreaming in Latent Space: Deriving the Sequence-Level ELBO for World Models

18 Sep, 2025

Motivation: Why Latent World Models?

Imagine an RL agent that no longer needs to interact with the real environment at every timestep. Instead, it imagines futures inside its own learned model, a fast, compact, and differentiable simulator. This is the promise of latent world models.

But how do we train such models?
What’s the right objective to learn latent dynamics, reconstruct observations, and learn a policy, all at once?

This blog walks through the full derivation of the sequence-level Evidence Lower Bound (ELBO) used in world model training, inspired by methods like Dreamer and PlaNet.

Problem Setup

We deal with trajectories of observations and actions:

$x_{1 : T}$ : observed frames, states, or features
$a_{1 : T}$ : actions taken by the agent
$z_{1 : T}$ : latent states, which our model learns

The goal is to model the joint distribution:

p (x_{1 : T}, a_{1 : T}) = \int p (x_{1 : T}, a_{1 : T}, z_{1 : T}) d z_{1 : T}

Since this integral is intractable (due to the latent variables), we use variational inference to approximate it.

Latent Generative Model

We define the following generative process:

p_{θ} (x_{1 : T}, a_{1 : T}, z_{1 : T}) = p (z_{1}) \prod_{t = 1}^{T} p (x_{t} ∣ z_{t}) π_{θ} (a_{t} ∣ z_{t}) p (z_{t + 1} ∣ z_{t}, a_{t})

Each term corresponds to:

$p (z_{1})$ : latent prior
$p (z_{t + 1} ∣ z_{t}, a_{t})$ : latent dynamics (transition model)
$p (x_{t} ∣ z_{t})$ : decoder (generative observation model)
$π_{θ} (a_{t} ∣ z_{t})$ : policy in latent space (used for imagined rollouts)

This is a latent controlled Markov process.

Why Not Maximize the Log-Likelihood Directly?

Because:

\log p (x_{1 : T}, a_{1 : T}) = \log \int p (x_{1 : T}, a_{1 : T}, z_{1 : T}) d z_{1 : T}

is intractable due to high-dimensional integration over $z_{1 : T}$ .

Variational Posterior (Approximate Inference)

We introduce a learned variational distribution:

q_{ϕ} (z_{1 : T} ∣ x_{1 : T}, a_{1 : T}) = \prod_{t = 1}^{T} q_{ϕ} (z_{t} ∣ x_{\leq t}, a_{< t})

This acts like a Bayesian filter each $z_{t}$ is inferred based on the past.

ELBO Derivation Step-by-Step

We apply Jensen’s inequality:

\log p (x_{1 : T}, a_{1 : T}) = \log 𝔼_{q} [\frac{p (x_{1 : T}, a_{1 : T}, z_{1 : T})}{q (z_{1 : T})}] \geq 𝔼_{q} [\log \frac{p (x_{1 : T}, a_{1 : T}, z_{1 : T})}{q (z_{1 : T})}]

Define this lower bound as $ℒ_{ELBO}$ :

Expand the Joint Log Terms

Now expand:

ℒ_{ELBO} = 𝔼_{q_{ϕ}} [& \log p (z_{1}) + \sum_{t = 1}^{T} \log p (x_{t} ∣ z_{t}) + \sum_{t = 1}^{T} \log π_{θ} (a_{t} ∣ z_{t}) & + \sum_{t = 1}^{T - 1} \log p (z_{t + 1} ∣ z_{t}, a_{t}) - \sum_{t = 1}^{T} \log q_{ϕ} (z_{t} ∣ \cdot)]

Let’s rearrange the KL terms:

Final Form of the Sequence-Level ELBO

ℒ_{ELBO} = 𝔼_{q_{ϕ}} [\sum_{t = 1}^{T} \log p (x_{t} ∣ z_{t}) + \log π_{θ} (a_{t} ∣ z_{t}) - K L (q_{ϕ} (z_{t} ∣ \cdot) ‖ p (z_{t} ∣ z_{t - 1}, a_{t - 1}))]

This is our training objective.

Interpretation of Each Term

Term	Meaning
$\log p (x_{t} ∣ z_{t})$	Latent $\to$ observation decoder. Ensures latent state can predict reality.
$\log π_{θ} (a_{t} ∣ z_{t})$	Optimizes the policy in latent space.
KL-divergence	Forces inferred latents to match transitions in the world model.

What This Loss Does:

Reconstructs observations from latent states
Optimizes the policy purely in latent space
Keeps latent states consistent with the learned dynamics

Implementation Notes (Pytorch Pseudo)

# Latent rollout
z_t = sample_posterior(x[:t+1], a[:t])  # q_phi
x_hat = decoder(z_t)
a_hat = policy(z_t)
z_next_pred = dynamics(z_t, a_hat)

# ELBO components
log_px = log_prob(x_t, x_hat)
log_pi = log_prob(a_t, a_hat)
kl = kl_divergence(q_phi, p_theta)

# Total ELBO (maximize)
elbo = log_px + log_pi - kl

How This Enables "Imagination"

Once trained, the agent no longer needs to interact with the real environment at every step. It can simulate future rollouts inside its learned latent model:

Sample a latent state:
- $z_{1} ~ q_{ϕ} (z_{1} ∣ x_{1})$
Roll forward using the dynamics model:
- $z_{t + 1} ~ p (z_{t + 1} ∣ z_{t}, a_{t})$
Select actions from the latent policy:
- $a_{t} ~ π_{θ} (a_{t} ∣ z_{t})$
Optionally decode the imagined observations:
- $x_{t} ~ p (x_{t} ∣ z_{t})$

The agent is now dreaming futures and acting based on them enabling fast planning and efficient behavior without expensive environment interaction.

Summary

We’ve derived and interpreted the sequence-level Evidence Lower Bound (ELBO) that forms the foundation of latent world model approaches like Dreamer, PlaNet, and others.

Takeaways

The ELBO provides a single unified training objective for:
- Learning a latent dynamics model
- Optimizing a latent-space policy
- Inferring latent variables
Policies are trained entirely in the latent space, enabling cheap and scalable imagined rollouts.
This allows:
- Efficient planning
- Better generalization
- Imagination-based behavior agents that think before they act.