Ryan Tabrizi and Alberto Hojel
CS 280 FA23
NOTE: Full website with rendered videos can be found here. Otherwise, please click on the AWS links below. Thank you!
Neural Radiance Fields (Mildenhall et al. 2020) provide a rich opportunity to model and interact with the 3D world from 2D images.
In a step towards making these representations semantically meaningful, Language Embedded Radiance Fields (Ker et al., 2023), or LERF, enable language grounding through the use of CLIP embeddings at various scales and views.
However, traditional LERFs, despite their innovative strides, grapple with challenges of computational intensity and limited real-time applicability. Our research analyzes two recent advancements in this domain, seeking to enhance both efficiency and semantic precision. We delve into two significant qualitative and quantitative benchmarking with: Gaussian Splatting and AlphaCLIP. Gaussian Splatting, replacing the multi-layer perceptron (MLP) in a NeRF with an array of anisotropic Gaussians, promises to expedite training and rendering while maintaining, if not enhancing, visual fidelity. On the other hand, AlphaCLIP aims to refine semantic accuracy at finer scales, using an alpha mask to direct attention within images.
We explore these advancements as potential key improvements to the conventional LERF framework. We’ll examine how Gaussian Splatting optimizes performance, delivering faster rendering times and clearer images, and scrutinize AlphaCLIP’s role in achieving more focused semantic embeddings. Through comparative analysis and error evaluations, we chart a path for future development in 3D scene understanding, guided by efficient language supervision.
The Language Embedded Radiance Fields (LERF) methodology integrates CLIP (Contrastive Language–Image Pretraining) embeddings into the weights of a multi-layer perceptron (MLP) to model semantic fields in 3D space. This process begins with constructing a feature pyramid for each source image, where image tiles of varying sizes are encoded using the CLIP model to capture different scales of semantic information. These feature pyramids are then used to train an MLP, mapping isotropic features to three-dimensional spatial coordinates (x, y, z) along with a scale parameter.
In the rendering phase, LERF adapts the classical ray-tracing mechanism of Neural Radiance Fields (NeRFs) to project these embeddings onto a two-dimensional plane. When a user inputs a textual prompt, its corresponding embedding is used to calculate relevancy measures across this plane of vectors. This allows for the visualization of areas in the original 3D scene that are semantically related to the input prompt.
The LERF methodology aims to enhance 3D scene understanding by integrating language models, thus enabling a more nuanced interpretation of scenes beyond traditional visual cues. This approach is particularly significant for applications where semantic context and detailed scene understanding are crucial.