NextStep-1：自回归图像生成的"终极形态"，14B参数模型开源了！

"It no longer needs 'outside help' and is finally growing as an independent artist."

In the field of AI image generation, we seem to have long been accustomed to such a division of labor:
Diffusion modeling "draws", CLIP "sees", VQ-VAE "compresses", LLM "thinks". "......

But today, a new product called NextStep-1 the open-source model that is trying to upend the paradigm - it uses only thePure autoregressive architecture for 14B parametersThe result is a generation quality that rivals top diffusion models, as well as the ability to understand everyday language and edit images on the fly.

StepFun team has released what "big move" this time? Let's find out.

🎨 Redefining Self-Return: Say goodbye to "foreign aid" and be a real artist!

Autoregressive modeling has long been invincible in the text domain, but in the image generation track "water and soil" for many years.
Past attempts have mostly fallen into two dilemmas:

the discrete dilemma: the image must be compressed into a finite number of discrete symbols by VQ-VAE, resulting in a loss of information
External aid dependency: Requires a large diffusion model as a "decoder", which is architecturally unwieldy and complex to train.

NextStep-1's Core Breakthrough::

Generate image Patch directly in continuous visual space in an autoregressive manner

It consists of two parts:

Transformer backbone with 14B parameters: Responsible for understanding the content, planning the composition, and controlling the overall picture
Flow Matching Head with 157M parameters (Flow Matching Head): Transformer's ideas into concrete pixels, like a "paintbrush"!

This architecture brings revolutionary changes:
✅ No need for discretization: retains the full richness of the visual data
✅ End-to-end training:: No longer relying on external diffusion models to "save the day"
✅ Extremely clean architecture: High level of harmonization across the system for more efficient training

One researcher exclaimed, "It's like watching your own child finally be able to complete a painting on their own, without the need for a parent to be there to hand-hold them."

🔬 Two technical "alchemies": making autoregressive models really work for images

The Step Star team revealed two key findings in the paper, which can be called the "gold standard" for autoregressive image generation:

1️⃣ The real "artist" is Transformer!

The team found out through experimentation:The size of the stream matching head (157M → 528M) has a minimal effect on the final image quality.The
That means:

Transformer backbone undertakes 90%+ of "creative work"
Stream Match Header acts as a lightweight "enforcer" that faithfully translates ideas into images
Autoregressive models can truly "think" and "create" on their own.

"This proves that Transformer can not only be a language, but also an artist in the visual field." --Research Team

2️⃣ Tokenizer's two main "magic tricks"

In the operation of continuous visual labeling, the team discovered two key techniques:

Channel-Wise Normalization
The statistical properties of the markers are effectively stabilized by simple normalization. Generates clear images without artifacts even under the highest intensity CFG guidance.
"More noise = better quality."
A counterintuitive finding: when training the TokenizerIncrease noise regularizationInstead, it significantly improves the final image quality.
The team hypothesized that this shapes a more robust and evenly distributed potential space, providing an ideal "canvas" for autoregressive models.

🖼️ Functionality showcase: not just generate, but "change"

NextStep-1 not only generates images out of thin air, but also understands human commands and edits them with the precision of a professional designer.

✅ High Fidelity Bunsen burner

Generate detailed, well-composed images with a single command:

"A serene lakeside at dawn, pine trees reflected in still water, mist rising from the surface, soft golden light breaking through mountain peaks in the distance, hyperrealistic photography". "A serene lakeside at dawn, pine trees reflected in still water, mist rising from the surface, soft golden light breaking through mountain peaks in the distance, hyperrealistic photography"

✅ All-around image editor

Addition and deletion of objects::

"Add an open laptop on the coffee table with a steaming cup of coffee next to it."

Background modification::

"Change the background of this photo from the office to a beach sunset."

Motion Modification::

"Make the dog in the picture go from a sitting to a jumping position."

style migration::

"Convert this photo into a Van Gogh-style oil painting, retaining all character and scene details"

The real-world results are amazing - it not only understands everyday language, but also maintains pre- and post-editingvisual coherenceThis avoids the problem of "identity drift", which is common in traditional methods.

One designer commented, "It's like hiring an all-around assistant who can create out of thin air, but also modify it precisely to your ideas."

📊 Performance data: self-regression can also challenge SOTA

In authoritative reviews, the NextStep-1 has been a pleasant surprise:

Benchmarks	NextStep-1 Performance	significance
GenEval	0.73 (using self-CoT)	Beyond most autoregressive models, approximating diffusion models
GenAI-Bench	Advanced tips 0.67, basic tips 0.88	Strong understanding of complex scenarios
DPG-Bench	85.28 points	Strong comprehension of long cues
WISE	0.54 total points	Excellent integration of world knowledge
GEdit-Bench	Significantly ahead of other autoregressive models	Outstanding image editing capabilities

Even more exciting:NextStep-1 has been able to compete head-to-head with top diffusion models in several benchmark tests, which is an unprecedented breakthrough in self-regressive architecture.

⚠️ Facing the Challenge: "Stumbling Blocks" to Growth

The Step Star team did not shy away from the model's limitations and candidly listed four major challenges:

1️⃣ Unstable generation process

occurs episodically when generated in high-dimensional continuous space (16 channels):

Localized noise/blocking artifacts
global noise interference
Grid-like artifacts (possibly related to 1D positional encoding)

2️⃣ Sequential Decoding Latency

The "nature" of autoregressive models leads to speed bottlenecks:

14B Parameter Transformer sequential decoding is the main bottleneck
Multi-step sampling of the stream matching header also introduces overheads
Single token generation takes about 47.6ms on H100

3️⃣ High Resolution Challenge

Convergence inefficiency: more training steps required
High-resolution techniques for difficult transport diffusion modeling
Lack of 2D spatial induction bias

4️⃣ Supervised Fine Tuning (SFT) Difficulties

Dependent on large-scale data (millions) for stable fine-tuning
Fragile performance on small datasets: either little success or complete overfitting
Difficulty in finding a balance between "generic competencies" and "specific styles"

The team admits, "Being honest about these challenges is the first step in moving the field forward."

🚀 How to get started? Fully open source, one-click deployment

The Step Star team has put together the NextStep-1Completely open source, extremely researcher and developer friendly, installation requires only three lines of command:

git clone https://github.com/stepfun-ai/NextStep-1
cd NextStep-1
pip install -r requirements.txt

The team also provides detailed tutorials covering a variety of application scenarios from basic usage to advanced customization.

🔮 Future perspectives: a new era of autoregressive image generation

The release of NextStep-1 marks a new stage in autoregressive image generation:

Architectural Simplicity: No more complex patchwork, one unified model
Efficient training: end-to-end training to avoid instability in multi-stage optimization
Integration of competencies: specializes in both generation and editing, and understands natural language commands

The future direction revealed by the StepStar team:

Optimizing stream matching headers: reducing parameters, enabling less-step generation
Accelerated autoregression: exploring new techniques such as multi-Token prediction
High-resolution generation: development of image-specific 2D positional coding
Improving SFT: Efficient Fine-Tuning Techniques for Small Data

"This is just the first step in the exploration. We believe that this 'clean' path will provide a fresh perspective on the field of multimodal generation."

🌟 Write in the end

The NextStep-1 is much more than a new model; it proves an important concept:
Simple architecture that also enables powerful capabilities.

When we are no longer obsessed with "putting together the biggest model", but return to the essence of "how to make the model really understand the creation", AI generation technology may usher in a new leap.

"It is not meant to replace diffusion models, but to provide a new possible path for image generation." -- Step Star Team

In this era of rapidly changing AI technology, NextStep-1 reminds us:
Sometimes the most revolutionary innovations come precisely from rethinking the underlying paradigm.

Related links::

Thesis:https://arxiv.org/abs/2508.10711
Code Repository:https://github.com/stepfun-ai/NextStep-1
Model Download:https://huggingface.co/collections/stepfun-ai/nextstep-1
Project home page:https://stepfun.ai/research/en/nextstep1

For more products, please check out	See more at
ShirtAI - Penetrating Intelligence	The AIGC Big Model: ushering in an era of dual revolution in engineering and science - Penetrating Intelligence
1:1 Restoration of Claude and GPT Official Website - AI Cloud Native	Live Match App Global HD Sports Viewing Player (Recommended) - BlueShirt.com
Transit service based on official API - GPTMeta API	Help, can anyone of you provide some tips on how to ask questions on GPT? - Knowing
Global Virtual Goods Digital Store - Global SmarTone (Feng Ling Ge)	How powerful is Claude airtfacts feature that GPT instantly doesn't smell good? -BeepBeep