Generating high-fidelity upper-body 3D avatars from one-shot input image remains a significant challenge. Current 3D avatar generation methods, which rely on large reconstruction models, are fast and capable of producing stable body structures, but they often suffer from artifacts such as blurry textures and stiff, unnatural motion. In contrast, generative video models show promising performance by synthesizing photorealistic and dynamic results, but they frequently struggle with unstable behavior, including body structural errors and identity drift. To address these limitations, we propose a novel approach that combines the strengths of both paradigms. Our framework employs a 3D reconstruction model to provide robust structural and appearance priors, which in turn guide a real-time autoregressive video diffusion model for rendering. This process enables the model to synthesize high-frequency, photorealistic details and fluid dynamics in real time, effectively reducing texture blur and motion stiffness while preventing the structural inconsistencies common in pure video generation. By uniting the geometric stability of 3D reconstruction with the generative capabilities of video models, our method produces high-fidelity digital avatars with realistic appearance and dynamic, temporally coherent motion. Experiments demonstrate that our approach significantly reduces artifacts and achieves substantial improvements in visual quality over leading methods, providing a robust and efficient solution for real-time applications such as gaming and virtual reality.
Overview of the proposed ViSA. In the first stage, we train a one-shot, feed-forward transformer to regress a 3D Gaussian avatar in a predefined canonical space, conditioned on geometric, semantic, and low-level embeddings. In the second stage, we employ an autoregressive video model as a video renderer, conditioned on the 3D-aware features from stage one, to generate photorealistic results in real time.
Photorealistic, consistent, and controllable character animation from a single reference image. Our method enables photorealistic upper-body avatar generation that preserves appearance fidelity across diverse poses and expressions while maintaining temporal coherence in real-time video synthesis.
All inputs are in-the-wild cases generated by Gemini.
ViSA outperforms other methods in maintaining identity consistency across a range of poses while more faithfully capturing facial expressions and fine-grained details. Please view in full screen for details.
@article{yang2025visa,
title={ViSA: 3D-Aware Video Shading for Real-Time Upper-Body Avatar Creation},
author={Yang, Fan and Li, Heyuan and Li, Peihao and Yuan, Weihao and Qiu, Lingteng and Song, Chaoyue and Chen, Cheng and He, Yisheng and Zhang, Shifeng and Han, Xiaoguang and Hoi, Steven and Lin, Guosheng},
journal={arXiv preprint arXiv:2512.07720},
year={2025}
}