Condition Matters in Full-head 3D GANs ICLR 2026
- Heyuan Li1,†,*
- Huimin Zhang1,â€
- Yuda Qiu1
- Zhengwentai Sun1
- Keru Zheng1
- Lingteng Qiu2,‡
- Peihao Li2
- Qi Zuo2
- Ce Chen1
- Yujian Zheng3
- Yuming Gu4
- Zilong Dong2,✉
- Xiaoguang Han1,5,6,✉
1SSE, The Chinese University of Hong Kong, Shenzhen
2Tongyi Lab, Alibaba Inc.
3Mohamed bin Zayed University of Artificial Intelligence
4University of Southern California
5FNii-Shenzhen
6Guangdong Provincial Key Laboratory of Future Networks of Intelligence
* Work done during an internship at Tongyi Lab.
†Equal contribution.
‡ Team lead.
✉ Corresponding author.
Our full-head model, BalanceHead, is novelly conditioned on view-invariant semantic features to generate high-fidelity and diverse 3D heads. From top to bottom, Rows 1–7 show random-view renderings; rows 8–10 display multi-view renderings; rows 11-12 visualize the geometries of the corresponding 3D heads.
Abstract
Conditioning is crucial for stable training of full-head 3D-aware GANs. Without any conditioning signal, the model suffers from severe mode collapse, making it impractical to training (fig. 2(a,b)). However, a series of previous full-head 3D-aware GANs conventionally choose the view angle as the conditioning input, which leads to a bias in the learned 3D full-head space along the conditional view direction. This is evident in the significant differences in generation quality and diversity between the conditional view and non-conditional views of the generated 3D heads, resulting in global incoherence across different head regions (fig. 2(d-i)). In this work, we propose to use view-invariant semantic feature as the conditioning input, thereby decoupling the generative capability of 3D heads from the viewing direction. To construct a view-invariant semantic condition for each training image, we create a novel synthesized head image dataset. We leverage FLUX.1 Kontext to extend existing high-quality frontal face datasets to a wide range of view angles. The image clip feature extracted from the frontal view is then used as a shared semantic condition across all views in the extended images, ensuring semantic alignment while eliminating directional bias. This also allows supervision from different views of the same subject to be consolidated under a shared semantic condition, which accelerates training (fig. 2(c)) and enhances the global coherence of the generated 3D heads (fig. 1). Moreover, as GANs often experience slower improvements in diversity once the generator learns a few modes that successfully fool the discriminator, our semantic conditioning encourages the generator to follow the true semantic distribution, thereby promoting continuous learning and diverse generation. Extensive experiments on full-head synthesis and single-view GAN inversion demonstrate that our method achieves significantly higher fidelity, diversity, and generalizability.
Conditional-View Bias
(a) No conditioning leads to early mode collapse and unstable training. (b) Disabling view conditioning mid-training causes rapid collapse within 1000 kimg. (c) Semantic conditioning enables faster and more effective training. (d–i) View-conditioned models show strong directional bias and global incoherence; while conditional views are realistic, non-conditional views are distorted and inconsistent. (d,e), (f,g), and (h,i) are results conditionally generated by random conditional views from PanoHead, SphereHead, and HyPlaneHead.
Data Generation Pipeline
(a) No conditioning leads to early mode collapse and unstable training. (b) Disabling view conditioning mid-training causes rapid collapse within 1000 kimg. (c) Semantic conditioning enables faster and more effective training. (d–i) View-conditioned models show strong directional bias and global incoherence; while conditional views are realistic, non-conditional views are distorted and inconsistent. (d,e), (f,g), and (h,i) are results conditionally generated by random conditional views from PanoHead, SphereHead, and HyPlaneHead.
Semantic-conditional 3D-aware GANs
Unlike previous unconditional 3D-aware GANs and those conditioned on view angles, we propose a new class of 3D-aware GANs that use view-invariant semantic features as conditions. This design eliminates view-dependent cues from the conditioning signal, thereby breaking the correlation between generation capability and specific views and effectively addressing the prevalent directional bias in prior methods. Furthermore, it ensures that the generator learns to produce diverse outputs by aligning with the true semantic distribution of the data. By consolidating supervision across different views under a shared semantic condition, the model enforces semantic consistency across all perspectives for each generated sample, which enhances global coherence and improves training efficiency. We refer to this class of models as semantic-conditional 3D-aware GANs.
BalanceHead
Our BalanceHead, a semantic-conditional full-head 3D-aware GAN pipeline, as shown in fig. 4. Building upon recent state-of-the-art HyPlaneHead model, our generator employs StyleGAN2 as the backbone and uses a hy-plane representation to encode 3D head geometry. The pipeline first renders low-resolution images \(I\) via volume rendering and then applies a super-resolution module to generate high-resolution images \(I^{+}\) along with corresponding masks \(I^{m+}\). In contrast to previous methods, we condition the generator using a view-invariant semantic feature \(c_{\text{sem}}\) instead of camera view information \(c_{\text{cam}}\), and incorporate the proposed ViCiCo loss to suppress multiple-face artifacts and ensure consistency between the generated output and the semantic condition.
Qualitative Comparison with State-of-the-art Methods
Qualitative comparison with state-of-the art methods. Conditioned on front-view: (a) EG3D (b) GGHead (c) PanoHead (d) SphereHead. (e) HyPlaneHead conditioned on back-view. (f) Our view-conditional baseline conditioned on back-view. (g) Our view-semantic-conditional baseline conditioned on side-view. (h-n) Our BalanceHead conditioned on view-invariant semantic condition.
Random Sampling Results
Discussion
Trained on large-scale, high-quality images, our proposed 3D full-head model exhibits significant fidelity, diversity, and generalizability. These attributes position the model not only as a potential foundation for specialized applications, such as 3D talking heads and head editing, but also as a high-fidelity 2D/3D data generator capable of supporting downstream tasks like 3D head reconstruction.
Beyond its immediate applications, this work offers a new perspective on the synergy between 2D generative priors and 3D consistency. Specifically, our results demonstrate that fully 3D-consistent representations can be effectively supervised by imperfect, 3D-inconsistent multi-view data. This suggests a potential paradigm shift: rather than relying exclusively on strictly consistent (but expensive) multi-view data or purely single-view data (which often suffers from view imbalance), researchers can leverage the vast, though slightly inconsistent, output of powerful 2D generators. A key takeaway is that future research should perhaps place greater emphasis on developing inconsistency-tolerant 3D training strategies and robust semantic-conditioned models, rather than focusing solely on the pursuit of perfect consistency in the input data. Our method serves as an initial exploration in this direction, highlighting the critical role of view-invariant semantic information in bridging the gap between 2D priors and 3D consistency.
|
The website template was borrowed from Michaël Gharbi and Ref-NeRF. |