LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation


1S-Lab, Nanyang Technological University, Singapore
2Wangxuan Institute of Computer Technology, Peking University 3Shanghai AI Laboratory

LN3Diff creates high-quality 3D object mesh from text within 8 V100-SECONDS.

Abstract

The field of neural rendering has witnessed significant progress with advancements in generative models and differentiable rendering techniques. Though 2D diffusion has achieved success, a unified 3D diffusion pipeline remains unsettled. This paper introduces a novel framework called LN3Diff to address this gap and enable fast, high-quality, and generic conditional 3D generation. Our approach harnesses a 3D-aware architecture and variational autoencoder (VAE) to encode the input image into a structured, compact, and 3D latent space. The latent is decoded by a transformer-based decoder into a high-capacity 3D neural field. Through training a diffusion model on this 3D-aware latent space, our method achieves state-of-the-art performance on ShapeNet for 3D generation and demonstrates superior performance in monocular 3D reconstruction and conditional 3D generation across various datasets. Moreover, it surpasses existing 3D diffusion methods in terms of inference speed, requiring no per-instance optimization. Our proposed LN3Diff presents a significant advancement in 3D generative modeling and holds promise for various applications in 3D vision and graphics tasks.

The overall architecture of LN3Diff. We propose to first learn a 3D latent space, where a monocular image is encoded into the KL-regularized latent space. The encoded 3D latent is decoded by a 3D-aware DiT transformer, and up-sampled towards a high-res tri-plane for rendering supervisions. In the second stage, we perform efficient conditional diffusion learning over the compact latent space.

Text-to-3D on Objaverse

Efficient conditional diffusion training on the 3D latent space.

An 18th century cannon.
A voxelized dog.
An UFO space aircraft.
A blue plastic chair.
A wooden worktable.
A four wheeled armored vehicle.
A standing hund.
A cute toy cat.
A sailboat with mast.
A wooden cloths case.

Monocular 3D Reconstruction on Objaverse

High-fidelity 3D reconstruction with the proposed 3D-aware VAE.


Text-to-3D on ShapeNet

State-of-the-art performance on the common ShapeNet benchmark.

A green/grey Porsche 911.
A SUV trunk.
A brown wooden chair.
A sofa chair with soft pad.
A star war Tie Fighter.
A Boeing 747.

Related Links

Our work is inspired by the following work:

Stable Diffusion introduce a general diffusion framework on the VAE latent space.

LRM introduces a large-scale monocular 3D reconstruction model.

BibTeX


  @misc{lan2024ln3diff,
    title={LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation}, 
    author={Yushi Lan and Fangzhou Hong and Shuai Yang and Shangchen Zhou and Xuyi Meng and Bo Dai and Xingang Pan and Chen Change Loy},
    year={2024},
    eprint={2403.12019},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}