Historically, 3D scene modeling has been a labor-intensive task exclusive to domain experts. While a significant catalog of 3D assets exists in the public domain, the likelihood of finding a preexisting 3D scene perfectly aligning with user specifications is low. As a result, 3D designers may spend hours, even days, meticulously creating individual 3D objects and organizing them into a cohesive scene. Streamlining this process while maintaining control over individual components could bridge the proficiency gap between expert 3D designers and laypersons.
The landscape of 3D scene modeling has evolved recently, thanks to advancements in 3D generative models. Positive strides in 3D object synthesis have been made using 3D-aware Generative Adversarial Networks (GANs), signifying the initial steps towards amalgamating created items into scenes. However, GANs are often bound to a singular item category, which inherently limits the output diversity and complicates scene-level text-to-3D conversions. In contrast, text-to-3D generation employing diffusion models enables users to prompt the creation of 3D objects across various categories.
Contemporary research typically employs a single-word prompt to impose global conditioning on rendered views of differentiable scene representations, utilizing robust 2D image diffusion priors learned from large-scale internet data. While these methods can yield impressive object-centric creations, they fall short in generating scenes with multiple distinct elements. The limitation of user input to a single text prompt further constrains control, providing no avenue to influence the aesthetics of the generated scene.
Addressing this, researchers at Stanford University have introduced a novel approach for compositional text-to-image production termed “locally conditioned diffusion”. Their proposed method enhances the cohesiveness of 3D assemblies, allowing control over the dimensions and placement of individual objects by using text prompts and 3D bounding boxes as input.
This method applies conditional diffusion stages selectively to specific image segments using an input segmentation mask and corresponding text prompts. This results in outputs that adhere to user-specified composition. When integrated into a text-to-3D generating pipeline based on score distillation sampling, this method is capable of creating compositional text-to-3D scenes.
Specifically, the Stanford team offers the following contributions:
- The introduction of locally conditioned diffusion, an innovative method enhancing the compositional capability of 2D diffusion models.
- The proposal of critical camera pose sampling strategies, essential for compositional 3D generation.
- The introduction of a method for compositional 3D synthesis by integrating locally conditioned diffusion into a score distillation sampling-based 3D generating pipeline.