Computer vision

Adapting neural radiance fields (NeRFs) to dynamic scenes

Representing light and density fields as weighted sums over basis functions, whose weights vary over time, improves motion capture, texture, and lighting.

By Sameera Ramasinghe

February 26, 2024

4 min read

One of the most intriguing challenges in computer vision is understanding dynamic scenes through the snapshots of a single moving camera. Imagine trying to digitally reconstruct a 3-D scene of a lively street scene or the subtle movements of a dancer in full flow, all from a video or a series of snapshots taken from different angles. This would enable the model to generate views from unseen camera angles, zoom in and out of the view, and create snapshots of 3-D models at different time instances, unlocking a deeper understanding of the world around us in three dimensions.

Neural radiance fields (NeRFs), which use machine learning to map 3-D scenes to 3-D color and density fields, have become a central technology for producing 3-D models from 2-D images. Even NeRFs, however, struggle to model dynamic scenes, because the problem is highly underconstrained: for a given set of snapshots, multiple dynamic scenes may be mathematically plausible, although some of them may not be realistic.

In a recent breakthrough presented at the annual meeting of the Association for the Advancement of Artificial Intelligence (AAAI), we introduce a novel approach that significantly advances our ability to capture and model scenes with complex dynamics. Our work not only addresses previous limitations but also opens doors to new applications ranging from virtual reality to digital preservation.

Our method displays a remarkable ability to factorize time and space in dynamic scenes, allowing us to more efficiently model 3-D scenes with changing lighting and texture conditions. In essence, we treat dynamic 3-D scenes as high-dimensional time-varying signals and impose mathematical constraints on them to produce realistic solutions. In tests, we've seen improvements in motion localization and the separation of light and density fields, enhancing the overall quality and fidelity of the 3-D models we can produce relative to existing technologies.

Bandlimited radiance fields

The radiance field of a 3-D scene can be decomposed into two types of lower-dimensional fields: light fields and density fields. The light field describes the direction, intensity, and energy of light at every point in the visual field. The density field describes the volumetric density of whatever is reflecting or emitting light at the relevant points. It is similar to assigning a color value and a probability of an object being placed at each 3-D location of a scene. Then, classical rendering techniques can easily be used to create a 3-D model from this representation.

In essence, our approach models the light and density fields of a 3-D scene as bandlimited, high-dimensional signals, where “bandlimited” means that signal energy outside of particular bandwidths is filtered out. A bandlimited signal can be represented as a weighted sum of basis functions, or functions that describe canonical waveforms; the frequency bands of Fourier decompositions are the most familiar basis functions.

Imagine that the state of the 3-D scene changes over time due to the dynamics of the objects within it. Each state can be reconstructed as a unique weighted sum of a particular set of basis functions. By treating the weights as functions of time, we can obtain a time-varying weighted sum, which we use to reconstruct the state of the 3-D scene.

In our case, we learn both the weights and the basis functions end-to-end. Another key aspect of our approach is that, rather than modeling the radiance field as a whole, as NeRFs typically do, we model the light and density fields separately. This allows us to model changes in object shapes or movements and in light or texture independently.

Three waves of different wavelengths passing to three sets of functions that decompose them into axial components before aggregating them into a single scene representation. — Our approach represents the light and density fields of a dynamic 3-D scene as the weighted sums of basis functions *(b_i(t))*, whose weights vary over time.

In our paper, we also show that traditional NeRF technology, while providing exceptional results for static scenes, often falters with dynamics, conflating aspects of the signal such as lighting and movement. Our solution draws inspiration from the established field of nonrigid structure from motion (NRSFM), which has been refining our grasp of moving scenes for decades.

A large green cube labeled "W/regularization" containing five smaller blue cubes labeled S(t1) - S(t5), with arrows connecting each S(t) in the sequence to the next. Aligned with each arrow is a two-dimensional rectangle that has been wrinkled in the third dimension. — The BLIRF model can integrate robust mathematical priors from the field of nonrigid structure from motion, such as the temporal clustering of motion, which ensures that the state of the 3-D scene changes smoothly over time, along very low-dimensional manifolds.

Specifically, we integrate robust mathematical priors from NRSFM, such as the temporal clustering of motion to restrict it to a low-dimensional subspace. Essentially, this ensures that the state of the 3-D scene changes smoothly over time, along very low-dimensional manifolds, instead of undergoing erratic changes unlikely to occur in real-world scenarios.

In our experiments, across a variety of dynamic scenes that feature complex, long-range movements, light changes, and texture changes, our framework has consistently delivered models that are not just visually stunning but also rich in detail and faithful to their sources. We've observed reductions in artifacts, more accurate motion capture, and an overall increase in realism, with improvements in texture and lighting representation that significantly elevate the models’ quality. We rigorously tested our model in both synthetic and real-world scenarios, as can be seen in the examples below.

Synthetic scenes

A comparison of BLIRF (Ours), ground truth (GT), and several NeRF implementations on synthetic dynamic scenes.

Real-world scene

A comparison of BLIRF (Ours) and several NeRF implementations on real-world images of a cat in motion.

A grid showing six different versions of three scenes: a cat walking across a carpet; the head and torso of a young woman with a handbag in the corner behind her; and a glass table on which a pair of glasses, a laptop, and a banana sit. — A comparison of BLIRF *(Ours)*, ground truth *(GT)*, and four NeRF implementations on the task of synthesizing a new view on a 3-D scene. Notably, BLIRF handles the motion of the cat in the top scene better than its predecessors.

A 4x6 grid showing six different reconstructions of scenes involving moving, brightly colored synthetic objects, such as cubes spheres and cones. — A comparison of BLIRF *(Ours)*, ground truth *(GT)*, and several NeRF implementations on synthetic scenes involving the motion of rudimentary geometric shapes.

As we continue to refine our approach and explore its applications, we're excited about the potential to revolutionize how we interact with digital worlds, making them more immersive, lifelike, and accessible.

About the Author

Sameera Ramasinghe

Sameera Ramasinghe is an applied scientist with Amazon’s International Emerging Stores division.

Adapting neural radiance fields (NeRFs) to dynamic scenes

Representing light and density fields as weighted sums over basis functions, whose weights vary over time, improves motion capture, texture, and lighting.

Bandlimited radiance fields

Related content

Work with us