"It no longer needs 'outside help' and is finally growing as an independent artist."
In the field of AI image generation, we seem to have long been accustomed to such a division of labor:
Diffusion modeling "draws", CLIP "sees", VQ-VAE "compresses", LLM "thinks". "......
But today, a new product called NextStep-1 the open-source model that is trying to upend the paradigm - it uses only thePure autoregressive architecture for 14B parametersThe result is a generation quality that rivals top diffusion models, as well as the ability to understand everyday language and edit images on the fly.
StepFun team has released what "big move" this time? Let's find out.

🎨 Redefining Self-Return: Say goodbye to "foreign aid" and be a real artist!
Autoregressive modeling has long been invincible in the text domain, but in the image generation track "water and soil" for many years.
Past attempts have mostly fallen into two dilemmas:
- the discrete dilemma: the image must be compressed into a finite number of discrete symbols by VQ-VAE, resulting in a loss of information
- External aid dependency: Requires a large diffusion model as a "decoder", which is architecturally unwieldy and complex to train.
NextStep-1's Core Breakthrough::
Generate image Patch directly in continuous visual space in an autoregressive manner
It consists of two parts:
- Transformer backbone with 14B parameters: Responsible for understanding the content, planning the composition, and controlling the overall picture
- Flow Matching Head with 157M parameters (Flow Matching Head): Transformer's ideas into concrete pixels, like a "paintbrush"!
This architecture brings revolutionary changes:
✅ No need for discretization: retains the full richness of the visual data
✅ End-to-end training:: No longer relying on external diffusion models to "save the day"
✅ Extremely clean architecture: High level of harmonization across the system for more efficient training
One researcher exclaimed, "It's like watching your own child finally be able to complete a painting on their own, without the need for a parent to be there to hand-hold them."

🔬 Two technical "alchemies": making autoregressive models really work for images
The Step Star team revealed two key findings in the paper, which can be called the "gold standard" for autoregressive image generation:
1️⃣ The real "artist" is Transformer!
The team found out through experimentation:The size of the stream matching head (157M → 528M) has a minimal effect on the final image quality.The
That means:
- Transformer backbone undertakes 90%+ of "creative work"
- Stream Match Header acts as a lightweight "enforcer" that faithfully translates ideas into images
- Autoregressive models can truly "think" and "create" on their own.
"This proves that Transformer can not only be a language, but also an artist in the visual field." --Research Team
2️⃣ Tokenizer's two main "magic tricks"
In the operation of continuous visual labeling, the team discovered two key techniques:
- Channel-Wise Normalization
The statistical properties of the markers are effectively stabilized by simple normalization. Generates clear images without artifacts even under the highest intensity CFG guidance. - "More noise = better quality."
A counterintuitive finding: when training the TokenizerIncrease noise regularizationInstead, it significantly improves the final image quality.
The team hypothesized that this shapes a more robust and evenly distributed potential space, providing an ideal "canvas" for autoregressive models.

🖼️ Functionality showcase: not just generate, but "change"
NextStep-1 not only generates images out of thin air, but also understands human commands and edits them with the precision of a professional designer.
✅ High Fidelity Bunsen burner
Generate detailed, well-composed images with a single command:
"A serene lakeside at dawn, pine trees reflected in still water, mist rising from the surface, soft golden light breaking through mountain peaks in the distance, hyperrealistic photography". "A serene lakeside at dawn, pine trees reflected in still water, mist rising from the surface, soft golden light breaking through mountain peaks in the distance, hyperrealistic photography"
✅ All-around image editor
Addition and deletion of objects::
"Add an open laptop on the coffee table with a steaming cup of coffee next to it."
Background modification::
"Change the background of this photo from the office to a beach sunset."
Motion Modification::
"Make the dog in the picture go from a sitting to a jumping position."
style migration::
"Convert this photo into a Van Gogh-style oil painting, retaining all character and scene details"
The real-world results are amazing - it not only understands everyday language, but also maintains pre- and post-editingvisual coherenceThis avoids the problem of "identity drift", which is common in traditional methods.
One designer commented, "It's like hiring an all-around assistant who can create out of thin air, but also modify it precisely to your ideas."

📊 Performance data: self-regression can also challenge SOTA
In authoritative reviews, the NextStep-1 has been a pleasant surprise:
| Benchmarks | NextStep-1 Performance | significance |
|---|---|---|
| GenEval | 0.73 (using self-CoT) | Beyond most autoregressive models, approximating diffusion models |
| GenAI-Bench | Advanced tips 0.67, basic tips 0.88 | Strong understanding of complex scenarios |
| DPG-Bench | 85.28 points | Strong comprehension of long cues |
| WISE | 0.54 total points | Excellent integration of world knowledge |
| GEdit-Bench | Significantly ahead of other autoregressive models | Outstanding image editing capabilities |
Even more exciting:NextStep-1 has been able to compete head-to-head with top diffusion models in several benchmark tests, which is an unprecedented breakthrough in self-regressive architecture.

⚠️ Facing the Challenge: "Stumbling Blocks" to Growth
The Step Star team did not shy away from the model's limitations and candidly listed four major challenges:
1️⃣ Unstable generation process
occurs episodically when generated in high-dimensional continuous space (16 channels):
- Localized noise/blocking artifacts
- global noise interference
- Grid-like artifacts (possibly related to 1D positional encoding)
2️⃣ Sequential Decoding Latency
The "nature" of autoregressive models leads to speed bottlenecks:
- 14B Parameter Transformer sequential decoding is the main bottleneck
- Multi-step sampling of the stream matching header also introduces overheads
- Single token generation takes about 47.6ms on H100
3️⃣ High Resolution Challenge
- Convergence inefficiency: more training steps required
- High-resolution techniques for difficult transport diffusion modeling
- Lack of 2D spatial induction bias
4️⃣ Supervised Fine Tuning (SFT) Difficulties
- Dependent on large-scale data (millions) for stable fine-tuning
- Fragile performance on small datasets: either little success or complete overfitting
- Difficulty in finding a balance between "generic competencies" and "specific styles"
The team admits, "Being honest about these challenges is the first step in moving the field forward."


🚀 How to get started? Fully open source, one-click deployment
The Step Star team has put together the NextStep-1Completely open source, extremely researcher and developer friendly, installation requires only three lines of command:
git clone https://github.com/stepfun-ai/NextStep-1
cd NextStep-1
pip install -r requirements.txtThe team also provides detailed tutorials covering a variety of application scenarios from basic usage to advanced customization.
🔮 Future perspectives: a new era of autoregressive image generation
The release of NextStep-1 marks a new stage in autoregressive image generation:
- Architectural Simplicity: No more complex patchwork, one unified model
- Efficient training: end-to-end training to avoid instability in multi-stage optimization
- Integration of competencies: specializes in both generation and editing, and understands natural language commands
The future direction revealed by the StepStar team:
- Optimizing stream matching headers: reducing parameters, enabling less-step generation
- Accelerated autoregression: exploring new techniques such as multi-Token prediction
- High-resolution generation: development of image-specific 2D positional coding
- Improving SFT: Efficient Fine-Tuning Techniques for Small Data
"This is just the first step in the exploration. We believe that this 'clean' path will provide a fresh perspective on the field of multimodal generation."

🌟 Write in the end
The NextStep-1 is much more than a new model; it proves an important concept:
Simple architecture that also enables powerful capabilities.
When we are no longer obsessed with "putting together the biggest model", but return to the essence of "how to make the model really understand the creation", AI generation technology may usher in a new leap.
"It is not meant to replace diffusion models, but to provide a new possible path for image generation." -- Step Star Team
In this era of rapidly changing AI technology, NextStep-1 reminds us:
Sometimes the most revolutionary innovations come precisely from rethinking the underlying paradigm.
Related links::
- Thesis:https://arxiv.org/abs/2508.10711
- Code Repository:https://github.com/stepfun-ai/NextStep-1
- Model Download:https://huggingface.co/collections/stepfun-ai/nextstep-1
- Project home page:https://stepfun.ai/research/en/nextstep1