【读论文】SUDS: Scalable Urban Dynamic Scenes

🕗 发布于 2024-07-20 22:26 NeRF 计算机视觉

文章目录

1. What
2. Why
3. How
4. Experiment
Ref:

1. What

To extend NeRFs to dynamic large-scale urban scenes, this paper introduces two key innovations: (a) factorize the scene into three separate hash table data structures to encode static, dynamic, and far-field radiance fields, and (b) make use of unlabeled target signals consisting of RGB images(from N videos), sparse LiDAR(depth), off-the-shelf self-supervised 2D descriptors(DINO), and 2D optical flow.

2. Why

Challenge…

The largest dynamic NeRF built to date.

3. How

3.1 Inputs

RGB images from N videos + LiDAR depth measurements + 2D self-supervised pixel (DINO) + 2D optical flow without bounding boxes.

3.2 Representation

3.2.1 Scene composition and Hash tables

We have three branches to model the urban scenes:

A static branch containing non-moving topography consistent across videos.
A dynamic branch to disentangle video-specific objects [19,29,56], moving or otherwise.
A far-field environment map to represent far-away objects and the sky[41 Urban Neural Field].

As for the hash table, firstly, we have a given input coordinate $(\mathbf{x},\mathbf{d},\mathbf{t},\mathbf{vid})$ . Then we need to find the surrounding voxels in each table, which we denote as $\mathbf{v}_{l,s},\mathbf{v}_{l,d},\mathbf{v}_{l,e}$ .

The static branch makes use of 3D spatial voxels $\mathbf{v}_{l,s}$ while the dynamic branch makes use of 4D spacetime voxels $\mathbf{v}_{l,d}$ and the far-field branch makes use of 3D voxels $\mathbf{v}_{l,e}$ (implemented via normalized 3D direction vectors) that index an environment map.

Then, we will compute hash indices $\mathbf{i}_{l,s} (\mathrm{or} \mathbf{i}_{l,d} \mathrm{or} \mathbf{i}_{l,e})$ for each corner with the following hash functions:

$\begin{aligned}&\mathbf{i}_{l,s}=\operatorname{static}\operatorname{hash}(space(\mathbf{v}_{l,s}))\\&\mathbf{i}_{l,d}=\operatorname{dynamic}\operatorname{hash}(space(\mathbf{v}_{l,d}),time(\mathbf{v}_{l,d}),\mathbf{vid})\\&\mathbf{i}_{l,e}=\operatorname{env}\operatorname{hash}(dir(\mathbf{v}_{l,e}),\mathbf{vid})\end{aligned}$

Recall the hash function is $h(\mathbf{x})=\left(\bigoplus_{i=1}^dx_i\pi_i\right)\mod T$ .

Finally, we linearly interpolate features up to the nearest voxel vertices and obtain the feature vectors.

Notice: we add $\mathbf{vid}$ as an auxiliary input to the hash, but do not use it for interpolation (since averaging across distinct movers is unnatural). From this perspective, we leverage hashing to effectively index separate interpolating
functions for each video, without a linear growth in memory with the number of videos.

3.2.2 From branch to images

We will establish a relationship between the image and the feature vector obtained from the hash table:

Static branch

$\begin{aligned}&\sigma_{s}(\mathbf{x})\in\mathbb{R}\\&\mathbf{c}_{s}(\mathbf{x},\mathbf{d},A_{vid}\mathcal{F}(t))\in\mathbb{R}^{3}.\end{aligned}$

We add a latent embedding $A_{vid}\mathcal{F}(t)$ consisting of a video-specific matrix $A_{vid}$ and a Fourier-encoded time index $\mathcal{F}(t)$ .
Dynamic branch

$\sigma_d(\mathbf{x},\mathbf{t},\mathbf{vid})\in\mathbb{R} \\ \rho_{d}(\mathbf{x},\mathbf{t},\mathbf{vid})\in[0,1]\\ \mathbf{c}_d(\mathbf{x},\mathbf{t},\mathbf{vid},\mathbf{d})\in\mathbb{R}^3$

We find shadows to play a crucial role in the appearance of urban scenes, we explicitly model a shadow field of scalar values $\rho_{d} \in$ [0, 1], used to scale down the static color $\mathbf{c}_s$ .
Far-field branch

$\sigma_d(\mathbf{x},\mathbf{t},\mathbf{vid})\in\mathbb{R}$

3.2.3 Rendering

Firstly, we derive a single density and radiance value for any position by computing the weighted sum of the static and dynamic components, combined with the pointwise shadow reduction:

$\begin{aligned}\sigma(\mathbf{x},\mathbf{t},\mathbf{vid})& =\sigma_{s}(\mathbf{x})+\sigma_{d}(\mathbf{x},\mathbf{t},\mathbf{vid}) \\\mathbf{c(x,t,vid,d)}& =\frac{\sigma_{s}}{\sigma}(1-\rho_{d})\mathbf{c}_{s}(\mathbf{x},\mathbf{d},A_{vid}\mathcal{F}(t)) \\&+ \frac{\sigma_{d}}{\sigma}\mathbf{c}_{d}(\mathbf{x},\mathbf{t},\mathbf{vid},\mathbf{d})\end{aligned}$

Then similar to the $\alpha$ -blending, we calculate the final color $\hat{C}$ :

$\begin{aligned}\hat{C}(\mathbf{r},\mathbf{t},\mathbf{vid})&=\int_{0}^{+\infty}T(t)\sigma(\mathbf{r}(t),\mathbf{t},\mathbf{vid})\mathbf{c}(\mathbf{r}(t),\mathbf{t},\mathbf{vid},\mathbf{d})dt\\&+T(+\infty)\mathbf{c}_{e}(\mathbf{d},\mathbf{vid}),\end{aligned}\\ \text{where} T(t)=\exp\left(-\int_{0}^{t}\sigma(\mathbf{r}(s),\mathbf{t},\mathbf{vid})ds\right).$

3.2.4 Other supervision

Feature distillation

We add a C-dimensional output head to each of our branches to predict the DINO feature and then compare it with the offline model.

$\begin{aligned}&\Phi_{s}(\mathbf{x})\in\mathbb{R}^{C} \\&\Phi_d(\mathbf{x},\mathbf{t},\mathbf{vid})\in\mathbb{R}^C \\&\Phi_e(\mathbf{d},\mathbf{vid})\in\mathbb{R}^C.\end{aligned}$

These are similar to the mapping of color. Firstly, $\mathbf{x}$ is passed to a grid encoding and then to a MLP to obtain the final output.
But in the code, feature, flow, and color output use different grids, which makes their training independent.
```
            self.encoding_feature = tcnn.Encoding(
                n_input_dims=5,
                encoding_config={
                    'otype': 'SequentialGrid',
                    'n_levels': num_levels,
                    'n_features_per_level': features_per_level,
                    'log2_hashmap_size': log2_hashmap_size,
                    'base_resolution': base_resolution,
                    'include_static': False
                },
            )
```
Scene flow

$s_{t^{\prime}\in[-1,1]}(\mathbf{x},\mathbf{t},\mathbf{vid})\in \mathbb{R}^3$

3.3 Optimization

$\mathcal{L}=\underbrace{\left(\mathcal{L}_{c}+\lambda_{f}\mathcal{L}_{f}+\lambda_{d}\mathcal{L}_{d}+\lambda_{o}\mathcal{L}_{o}\right)}_{\text{reconstruction losses}}+\underbrace{\left(\mathcal{L}_{c}^{w}+\lambda_{f}\mathcal{L}_{f}^{w}\right)}_{\text{warping losses}}\\\lambda_{flo}\underbrace{\left(\mathcal{L}_{cyc}+\mathcal{L}_{sm}+\mathcal{L}_{slo}\right)}_{\text{flow losses}}+\underbrace{\left(\lambda_{e}\mathcal{L}_{e}+\lambda_{d}\mathcal{L}_{d}\right)}_{\text{static-dynamic factorization}}+\lambda_{\rho}\mathcal{L}_{\rho}.$