Minute-level real-time video generation is here! Tencent and Nanyang Technological University jointly break the bottleneck of long video generation

Imagine a scenario where you're strolling down a city street wearing AR glasses, and the image in front of you changes in real time as you look; or you're immersed in an open-world game, where your character is roaming freely in a seamless virtual world, and the game engine needs to generate an infinitely long video stream in real time.

These scenarios place unprecedented demands on AI video generation technology: not only to generate high-quality video, but also to maintain long-time coherence while ensuring real-time performance.

However, this is the biggest bottleneck facing current AI video generation. Existing models perform well in short videos of a few seconds, but as the video is extended, the problem magnifies like a snowball. This phenomenon is known as error accumulation, just like a passing game in which information is distorted layer by layer, and tiny errors in each frame are inherited and amplified by the next frame, ultimately leading to a collapsed picture - drifting colors, stiff movements, and deformed subjects ......

Today, the Rolling Forcing method, jointly developed by Nanyang Technological University and Tencent ARC Lab, brings us a breakthrough. It successfully cracked the impossible triangle of real-time long video generation, realizing real-time generation of minute-level high-quality video streams on a single GPU.

The Impossible Triangle of Real-Time Long Video

The video generation field has long been characterized by an irreconcilable contradiction: it is difficult to reconcile high quality, consistency and real-time.

Existing methods have their limitations:

  • Traditional autoregressive generation strictly follows frame-by-frame causality, and the model is unable to correct historical errors, resulting in errors accumulating with video lengthening
  • The history corrosion approach reduces the dependence on history through noise injection, but sacrifices inter-frame coherence, resulting in frame skipping and long-term drift
  • The method of predicting keyframes before interpolating reduces the accumulation of errors, but its chaotic generation is not suitable for real-time scenes.

This dilemma has kept AI video generation in the real world in the real world of short films, making it difficult to move towards a true real-time interactive experience.

Rolling Forcing: A Revolutionary Approach to Correcting While Generating

The core idea of Rolling Forcing is to transform video generation from a strictly serial causal process to a parallel collaborative process within a sliding window. This is like upgrading a traditional industrial serial assembly line, where one step follows another and the errors are magnified step by step, into a parallel workstation that works in tandem and is calibrated to each other.

1. Rolling window joint noise reduction

Rolling Forcing uses a sliding window for multi-frame joint optimization. The model processes a window containing multiple frames simultaneously in a single forward propagation, and the frames within the window are calibrated to each other through a two-way attention mechanism.

Each time the processing is completed, the window slides forward: the first frame is output as the final result, and a new noise frame is introduced as the input at the end of the window, realizing continuous streaming generation. This design allows the model to dynamically correct potential errors in previous frames during the generation process, effectively suppressing error accumulation.

2. Attention Sink mechanism

To solve the drift problem in long video generation, Rolling Forcing introduces the Attention Sink mechanism. This mechanism caches the initial generated frames as global anchors in a persistent manner. When generating all subsequent frames, the model can access these initial anchors, thus effectively maintaining the long-term visual attributes of the video, including the consistency of the color tone, lighting and subject appearance.

3. Efficient training algorithms

Rolling Forcing designs an efficient distillation training algorithm based on non-overlapping windows. The algorithm makes the model use self-generated history frames instead of real data during the training process, which effectively simulates the real scene during inference and alleviates the exposure bias problem.

Performance Beyond: Minute-by-minute generation maintains high quality

In quantitative tests, Rolling Forcing outperforms existing mainstream methods in several key metrics. Its most prominent advantage is reflected in its long-term consistency, where ΔDriftQuality, a key measure of video quality drift, is much lower than that of the comparison model, proving that it effectively suppresses the accumulation of errors in long video generation.

In the qualitative comparison, the advantage of Rolling Forcing is more obvious. During the 2-minute-long generation process, comparison models such as SkyReels-V2 and MAGI-1 showed obvious color shifts, detail degradation, or subject deformation, while Rolling Forcing generated content that was highly stable in terms of detail, color, and motion coherence.

What's even more surprising is that this high-quality performance doesn't come at the expense of speed; Rolling Forcing achieves 16 fps generation speed on a single GPU, truly real-time generation, laying a solid foundation for interactive applications.

Interactive video generation: dynamically guided content creation

Another breakthrough capability of Rolling Forcing is its support for interactive video stream generation. During the video stream generation process, users can change the text prompts at any time, and the model can dynamically adjust the subsequent generated content according to the new instructions, realizing seamless content switching and guidance.

This capability opens up new possibilities for real-time content creation. Creators can adjust storylines, scene styles, or character movements in real time during the video generation process, without having to wait for the entire video to be generated and then start over. Educators can dynamically adjust parameters in teaching presentations, medical training can respond to trainee actions in real time, and gaming experiences can be dynamically shaped by player behavior.

Future challenges and outlook

Despite Rolling Forcing's breakthrough, the research team was candid enough to point out a few directions worth exploring further:

  1. Memory mechanism optimization: The current approach only retains the context of the initial and recent frames, and the content of the mid-segment is discarded during the generation process. In the future, we need to explore more efficient long-range memory mechanisms to realize dynamic preservation and recall of key information of mid-video segments.
  2. Increased training efficiency: Large windows of attention with DMD loss computation lead to high training costs. In the future, the computational complexity can be explored to reduce the computational complexity without sacrificing the performance to scale the model to a larger scale.
  3. Interaction Latency Optimization: The scrolling window mechanism introduces a trace amount of latency while improving quality. For interaction scenarios that require very low latency such as VR/AR, more flexible inference strategies need to be developed.

Open Source and Practice

Happily, the research team has released the full open source code, model weights and detailed documentation. Developers don't have to wait to integrate this cutting-edge technology in their projects.

Project Address:

For more products, please check out

See more at

ShirtAI - Penetrating Intelligence The AIGC Big Model: ushering in an era of dual revolution in engineering and science - Penetrating Intelligence
1:1 Restoration of Claude and GPT Official Website - AI Cloud Native Live Match App Global HD Sports Viewing Player (Recommended) - BlueShirt.com
Transit service based on official API - GPTMeta API Help, can anyone of you provide some tips on how to ask questions on GPT? - Knowing
Global Virtual Goods Digital Store - Global SmarTone (Feng Ling Ge) How powerful is Claude airtfacts feature that GPT instantly doesn't smell good? -BeepBeep