Recent advances in diffusion models have enabled 3D generation from a single image.
However, current methods often produce suboptimal results for novel views, with blurred textures and deviations from the reference image,
limiting their practical applications. In this paper, we introduce HiFi-123, a method designed for high-fidelity and multi-view consistent 3D generation.
Our contributions are twofold: First, we propose a Reference-Guided Novel View Enhancement (RGNV) technique that significantly improves the fidelity
of diffusion-based zero-shot novel view synthesis methods. Second, capitalizing on the RGNV, we present a novel Reference-Guided State Distillation (RGSD) loss.
When incorporated into the optimization-based image-to-3D pipeline, our method significantly improves 3D generation quality,
achieving state-of-the-art performance. Comprehensive evaluations demonstrate the effectiveness of our approach over existing methods,
both qualitatively and quantitatively. Video comparisons are available on the supplementary project page. We will release our code to the public.
Illustration of the RGNV pipeline. It performs depth-based DDIM inversion and sampling on both the reference image and coarse novel view, and utilizes attention injection to transfer detail textures from the reference image to the coarse novel view.
We utilize two stages to generate high-fidelity 3D contents. In the coarse stage, we optimize an Instant-NGP representation using SDS loss, reference view reconstruction loss, depth loss, and normal loss. In the refine stage, we export DMTet representation and use our proposed RGSD loss to supervise training.
@article{yu2023hifi,
title={Hifi-123: Towards high-fidelity one image to 3d content generation},
author={Yu, Wangbo and Yuan, Li and Cao, Yan-Pei and Gao, Xiangjun and Li, Xiaoyu
and Hu, Wenbo and Quan, Long and Shan, Ying and Tian, Yonghong},
journal={arXiv preprint arXiv:2310.06744},
year={2023}
}
}