Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models

Lvmin Zhang Shengqu Cai Muyang Li Gordon Wetzstein Maneesh Agrawala

Stanford University, MIT

Paper Code FramePack-P1 Preview

FramePack

Diffuse thousands of frames at full fps-30 with 13B models using 6GB laptop GPU memory.
Finetune 13B video model at batch size 64 on a single 8xA100/H100 node for personal/lab experiments.
Personal RTX 4090 generates at speed 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache).
No timestep distillation.
Video diffusion, but feels like image diffusion.

Understand FramePack in 5 seconds

A next-frame (or next-frame-section) prediction model looks like this:

So we have many input frames and want to diffuse some new frames.

The idea is that we can encode the input frames to some GPU layout like this:

This chart shows the logical GPU memory layout - frames images are not stitched.

Or, say the context length of each input frame.

Each frame is encoded with different patchifying kernel to achieve this.

For example, in HunyuanVideo, a 480p frame is likely 1536 tokens if using (1, 2, 2) patchifying kernel.

Then, if changed to (2, 4, 4) patchifying kernel, a frame is 192 tokens.

In this way, we can change the context length of each frame.

The "more important" frames are given more GPU resources (context length) - in this example, F0 is the most important as it is the nearest frame to the "next-frame prediction" target.

This is O(1) computation complexity for streaming - Yes, a constant, not even O(nlogn) or O(n).

But wait, what if ...

The above idea is a very brief concept - many questions can be asked like:

What if the importance of frames does not follow this simple pattern?

What if I want different compression rate?

If I want image-to-video, isn't the first frame most important?

What if I have some user frames and I want those frames to be more important?

...

Great - In fact these are FramePack Scheduling, like these:

So one can get different compression patterns.

One can even make the starting frames equally important so image-to-video will be happier.

And all those schedulings are O(1).

We have a detailed evaluation of many schedulings in the paper!

Anti-drifting Sampling

Drifting is a common problem of any next-what-what prediction model.

Drifting refers to the quality degradation as the video becomes longer.

Sometimes the problem is also called error accumulation or exposure bias.

To see an example, you can find an arbitrary image-to-video model and try to generate long videos by repeatedly using the last generated frame as inputs. The result will mess up quickly after you do this 5 or 6 times, and everything will severely degrade after you do this about 10 times.

See also our paper for some experiments on existing methods like history noise augmentation, special cfg guidance, rolling diffusion timesteps, and so on. We find out that, to solve drifting fundamentally, we need to break causality and make the sampling bi-directional.

Consider these sampling methods:

(the shadowed squares are the frames generated in each streaming inference)

Note that only the "vanilla sampling" is causal.

Both the "anti-drifting sampling" and "inverted anti-drifting sampling" are bi-directional.

The "inverted anti-drifting sampling" is important. This method is the only one that always treats the first frame as an approximation target in all inferences. This method is very suitable for image-to-video.

Image-to-5-Seconds (30fps, 150 frames)

All results are computed by RTX 3060 6GB laptop with 13B HY variant. (Videos compressed by h264crf18 to fit in GitHub repos.)

Image-to-60-Seconds (30fps, 1800 frames)

All results are computed by RTX 3060 laptop 6GB with 13B HY variant. (Videos compressed by h264crf18 to fit in GitHub repos.)

BibTeX

@inproceedings{zhang2025framepack,
    title={Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models},
    author={Lvmin Zhang and Shengqu Cai and Muyang Li and Gordon Wetzstein and Maneesh Agrawala},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
    year={2025},
}

@article{zhang2025framepackv1,
    title={Packing Input Frame Contexts in Next-Frame Prediction Models for Video Generation},
    author={Lvmin Zhang and Maneesh Agrawala},
    journal={Arxiv},
    year={2025}
}