TempAct: Advancing Temporal Plausibility in Autoregressive Video Generation via Planner-Executor RL

Jing Wang1,2,*,  Xiangxin Zhou2,*,  Jiajun Liang3,¶,  Kaiqi Liu4,  Wanyun Pang5,
Zhenyu Xie1,  Tianyu Pang2,‡,  Xiaodan Liang1,‡
1Shenzhen Campus of Sun Yat-Sen University, 2Tencent Hunyuan, 3Tsinghua University, 4Peking University, 5USTB
*Equal contribution    Project Lead    Corresponding Author
TempAct Overview and Motivation

Figure 1. Overview and Motivation of TempAct. Framework: Single-prompt AR generation conditions every chunk on the same global instruction, while step-prompt generation provides explicit stage-wise conditions but still relies on a fixed executor. TempAct introduces a planner–executor RL framework that jointly optimizes temporal decomposition and prompt-transition execution. Qualitative comparison: Compared with single-prompt and step-prompt baselines, TempAct produces more faithful event progression under temporally complex instructions. Training dynamics: The increasing reward curve shows that hierarchical planner–executor optimization provides effective learning signals for temporal plausibility.

Abstract

Autoregressive (AR) video diffusion models enable low-latency streaming generation by synthesizing videos chunk by chunk with cached visual context, but this chunk-wise formulation makes temporal instruction following ambiguous. A single global prompt does not specify which sub-event should be realized in each chunk, while naively switching to step-wise prompts often leads to delayed reactions, blended step semantics, and error propagation across prompt transitions. These failures are difficult to address with supervised fine-tuning or distillation alone: SFT suffers from exposure bias, while rollout-based distillation still optimizes low-level denoising or teacher-distribution matching rather than directly enforcing action ordering and prompt-transition correctness.


We address these challenges with TempAct, a planner--executor reinforcement learning framework that jointly optimizes temporal decomposition and step-conditioned execution for temporally plausible AR video generation. TempAct uses an LLM planner to explore span-aware step prompts that are executable by the video model, and trains an AR diffusion executor to follow these prompts under its own generated histories. Its key mechanism is hierarchical group exploration: candidate plans form planning groups, and each plan induces an execution group of multiple continuations from a shared visual context, enabling plan-level credit assignment for long-horizon temporal outcomes and executor-level credit assignment for prompt-switch behavior. We further design hierarchical rewards that combine plan-quality and full-video temporal feedback for the planner with local transition-level step-following rewards, aesthetic regularization, and KL constraints for the executor. Experiments on Self-Forcing and LongLive show that TempAct improves temporal consistency while preserving overall visual quality.

Autoregressive Video Generation Reinforcement Learning Temporal Planning Diffusion Models Flow-GRPO GSPO Prompt Engineering

Motivation

Existing AR video generators face two fundamental failures when executing temporally complex instructions:

🌀

Temporal Confusion (Single Prompt)

The model knows the full instruction but not which part should be realized now. Actions from later stages bleed into early chunks — a dog holding the ball while still sitting.

Prompt-Switch Failures (Step Prompts)

Step-wise prompts clarify the current stage but expose delayed reactions, blended semantics, and error propagation across chunks when the prompt transitions.

These failures are difficult to address with supervised fine-tuning or distillation alone: SFT suffers from exposure bias, while rollout-based distillation still optimizes low-level denoising or teacher-distribution matching rather than directly enforcing action ordering and prompt-transition correctness. This motivates reinforcement learning as a direct solution.

Method: TempAct

TempAct treats temporal video generation as a coupled decision-making problem: an LLM planner decomposes global instructions into span-aware steps, and an AR diffusion executor realizes these steps under accumulated visual context. Crucially, the LLM is not a fixed preprocessing module — both components are jointly optimized from generated trajectories.

TempAct Method Overview

Figure 2. TempAct Pipeline. An LLM planner samples span-aware temporal decompositions of a global instruction, while an autoregressive video executor rolls out shared contexts and multiple continuations under the corresponding step prompts. The nested planning–execution groups support hierarchical credit assignment.

Hierarchical Planner–Executor Pipeline

1

LLM-based Temporal Planning

The planner πφ samples M candidate temporal decompositions. Each plan assigns span-aware step prompts to latent-frame intervals, covering action ordering, temporal granularity, and state descriptions.

2

Prompt Smoothing

Each span uses a smoothed prompt j = Smooth(sj, sj+1) that exposes the executor to both the current and upcoming subgoal, reducing abrupt semantic changes at transitions.

3

Hierarchical Autoregressive Sampling

For each plan, a shared visual context is generated up to the prompt-switch moment, then N continuations are sampled. This creates a nested group structure that isolates planning and execution quality.

4

Joint RL Optimization

The planner is updated with GSPO (sequence-level policy gradients) using plan-quality and full-video rewards. The executor is updated with Flow-GRPO on the first transition chunk only, with local step-following + aesthetic rewards.

Multi-level Reward Design

📋

Plan Quality Score (Planner)

Qwen3-8B judges faithfulness to the original instruction, event coverage, temporal coherence, and hallucination avoidance across candidate decompositions.

🎬

Temporal-Following Score (Planner)

Qwen3-VL-8B evaluates full-video temporal order, physical consistency, and text-video alignment — averaged over all execution continuations per plan.

🔍

Local Step-Following Score (Executor)

VLM-based reward computed only on the first transition span, directly measuring prompt-switch execution rather than coarse full-video feedback.

🎨

Aesthetic Quality Score (Executor)

PickScore on the same transition chunk prevents semantic optimization from degrading visual quality during executor RL updates.

Experiments

We evaluate on a Temporal Order Benchmark with two subsets: Simple Set (100 prompts, 1–2 ordered steps) and Hard Set (100 prompts, 3–4 steps with complex dependencies). We report scores under both an in-domain judge (Qwen3-VL-8B) and an out-of-domain judge (Gemini-3-Flash) to test generalization.

+15.5%
↑ Avg Temporal Order
Self-Forcing backbone
+13.0%
↑ Avg Temporal Order
LongLive backbone
+24.9%
↑ Hard Set (Qwen judge)
Self-Forcing
81%
Agreement with humans
Gemini-3-Flash judge

Main Results

Method Temporal Order Score VBench PickScore
Simple Set (1–2 Steps) Hard Set (3–4 Steps) Avg. Total Quality Semantic
QwenGemini QwenGemini
Single-prompt video generation
Self-Forcing 0.4100.456 0.2400.419 0.381 81.2083.8070.5020.7
Casual Forcing 0.3810.447 0.3040.452 0.396 80.8584.0268.1820.8
LongLive 0.4280.505 0.3020.463 0.424 80.3682.7270.9121.1
Step-prompt video generation
Self-Forcing 0.4140.485 0.2690.431 0.400 80.0782.8968.8220.6
+ TempAct 0.500 +20.8% 0.538 +10.9% 0.336 +24.9% 0.473 +9.7% 0.462 +15.5% 79.9982.7169.1420.6
LongLive 0.4110.521 0.3140.481 0.432 79.5582.1069.3520.8
+ TempAct 0.508 +23.6% 0.579 +11.1% 0.352 +12.1% 0.512 +6.4% 0.488 +13.0% 79.9782.6169.4020.8

Temporal Order scores are in [0,1]. Avg. averages the four Temporal Order scores across Simple/Hard sets and both judges.

Qualitative Comparison — Self-Forcing

Single Prompt
Step Prompt
TempAct (Ours)
Example 1. A chef in a white apron stands at a wooden cutting board on a kitchen counter. First, she places a ripe tomato on the board and steadies it with one hand. Next, she picks up a sharp chef's knife and slices the tomato into neat, even rounds. Finally, she arranges the slices in a careful row on a clean white plate and steps back.
Example 2. A golden retriever sits on a grassy lawn with a rubber ball lying on the grass a short distance in front of it. First, it fixes its gaze on the ball and crouches low, hindquarters raised. Next, it pounces forward, covering the short gap in two quick bounds, and snatches the ball firmly in its mouth. Finally, it trots back and drops the ball precisely at its owner's feet.
Example 3. A small brown squirrel sits on a patch of bare ground near a tree root. First, it picks up an acorn with its front paws and examines it closely, turning it over. Next, it sets the acorn down and begins digging a small hole in the soft soil with rapid scratching motions. Finally, it drops the acorn into the hole and pushes loose dirt back over it, patting the surface flat with its paws.

Figure 3. Qualitative comparison on temporally ordered prompts using the Self-Forcing backbone. Single-prompt generation blends actions across chunks; step prompts improve stage clarity but still miss state transitions (e.g., squirrel leaving acorn on ground instead of burying it); TempAct correctly realizes the intended event progression.

Qualitative Comparison — LongLive

Single Prompt
Step Prompt
TempAct (Ours)
Example 1.A woman sits down at a vanity table in a softly lit bedroom. First, she opens a wooden jewelry box and takes out a delicate pearl necklace. Next, she holds the necklace up to the light for a moment. Finally, she carefully fastens the necklace around her neck and closes the jewelry box.
Example 2. A woman stands at a kitchen counter preparing a salad. First, she tears large lettuce leaves and places them in a wooden bowl. Next, she slices a cucumber into thin rounds and scatters them over the lettuce. Finally, she drizzles olive oil over the salad and uses two large spoons to gently toss everything together.
Example 3. A young man is at a clean wooden desk in a modern office. First, he carefully places his closed silver laptop on the right side of the desk. Next, he picks up a small black notebook that is already on the desk. Finally, he opens the notebook to a blank page and holds it as if ready to write.

Figure 4. Qualitative comparison on temporally ordered prompts using the LongLive backbone. TempAct consistently produces videos with more faithful event ordering and cleaner prompt-switch execution than both single-prompt and step-prompt baselines.

BibTeX

If you find TempAct useful, please cite our work:

@inproceedings{wang2026tempact,
  title     = {TempAct: Advancing Temporal Plausibility in Autoregressive Video Generation
               via Planner-Executor RL},
  author    = {Wang, Jing and Zhou, Xiangxin and Liang, Jiajun and Liu, Kaiqi
               and Pang, Wanyuan and Xie, Zhenyu and Pang, Tianyu and Liang, Xiaodan},
  year      = {2026}
}