arxiv:2511.18050

UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

Published on Nov 22

· Submitted by

Owen on Nov 25

· W2GenAI Lab

Upvote

Authors:

Abstract

UltraFlux, a Flux-based DiT trained on a 4K dataset, addresses failures in diffusion transformers at 4K resolution through enhanced positional encoding, improved VAE compression, gradient rebalancing, and aesthetic curriculum learning, achieving superior performance compared to existing models.

AI-generated summary

Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.

View arXiv page View PDF Project page GitHub 40 Add to collection

Community

Owen777

Paper submitter about 18 hours ago

TL;DR: UltraFlux couples a rigorously curated 1M 4K Image dataset with a native 4K generative model, delivering sharper detail and stronger text–image consistency than existing open alternatives.

UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios

UltraFlux is our latest effort toward high-fidelity, native 4K image synthesis. Rather than scaling resolution in isolation, we approach 4K generation as a coupled system problem, where data quality, spatial representation, reconstruction fidelity, and optimization dynamics must be designed together.

Current high-resolution pipelines often improve one component at the expense of another — sharper details but unstable layouts, or stronger global structure but loss of fine texture. UltraFlux avoids this trade-off by jointly engineering the data distribution and the generative process, enabling stable behavior across wide, square, and tall aspect ratios while preserving delicate structures that typically vanish at 4K.

At the data level, we construct a large, clean, and distribution-balanced 4K corpus with broad aspect-ratio coverage, stronger aesthetic supervision, and bilingual captions. This gives the model direct exposure to the patterns that matter most for 4K realism: rich textures and consistent human-centric scenes—categories underrepresented in existing public sources.

On the model side, UltraFlux reinforces stable spatial reasoning at large scales, improves reconstruction of high-frequency regions without sacrificing efficiency, and rebalances the training objective so that fine-detail preservation and global composition are learned in a coordinated way. A structured two-stage training schedule further sharpens aesthetic and semantic consistency where the generative prior matters most.

Together, these components produce a system that offers:

Reliable native 4K detail, even in dense or intricate regions.
Improved text–image alignment without over-constraint or template-like artifacts.
Higher aesthetic quality, matching or surpassing leading closed systems when paired with a strong prompt refiner.
Efficient 4K generation, enabled by an enhanced F16 VAE that provides high-fidelity reconstruction while maintaining practical throughput and stable large-scale sampling.

UltraFlux represents a step toward principled, practical native 4K generation — not by adding more compute, but by treating high-resolution synthesis as an integrated design problem.

Project: https://w2genai-lab.github.io/UltraFlux/
HF Model: https://huggingface.co/Owen777/UltraFlux-v1
Inf Code and Dataset (coming soon): https://github.com/W2GenAI-Lab/UltraFlux