【读论文】SUDS: Scalable Urban Dynamic Scenes
文章目录
1. What
To extend NeRFs to dynamic large-scale urban scenes, this paper introduces two key innovations: (a) factorize the scene into three separate hash table data structures to encode static, dynamic, and far-field radiance fields, and (b) make use of unlabeled target signals consisting of RGB images(from N videos), sparse LiDAR(depth), off-the-shelf self-supervised 2D descriptors(DINO), and 2D optical flow.
2. Why
Challenge…
The largest dynamic NeRF built to date.
3. How
3.1 Inputs
RGB images from N videos + LiDAR depth measurements + 2D self-supervised pixel (DINO) + 2D optical flow without bounding boxes.
3.2 Representation
3.2.1 Scene composition and Hash tables
We have three branches to model the urban scenes:
- A static branch containing non-moving topography consistent across videos.
- A dynamic branch to disentangle video-specific objects [19,29,56], moving or otherwise.
- A far-field environment map to represent far-away objects and the sky[41 Urban Neural Field].
As for the hash table, firstly, we have a given input coordinate ( x , d , t , v i d ) (\mathbf{x},\mathbf{d},\mathbf{t},\mathbf{vid}) (x,d,t,vid). Then we need to find the surrounding voxels in each table, which we denote as v l , s , v l , d , v l , e \mathbf{v}_{l,s},\mathbf{v}_{l,d},\mathbf{v}_{l,e} vl,s,vl,d,vl,e.
The static branch makes use of 3D spatial voxels v l , s \mathbf{v}_{l,s} vl,s while the dynamic branch makes use of 4D spacetime voxels v l , d \mathbf{v}_{l,d} vl,d and the far-field branch makes use of 3D voxels v l , e \mathbf{v}_{l,e} vl,e(implemented via normalized 3D direction vectors) that index an environment map.
Then, we will compute hash indices i l , s ( o r i l , d o r i l , e ) \mathbf{i}_{l,s} (\mathrm{or} \mathbf{i}_{l,d} \mathrm{or} \mathbf{i}_{l,e}) il,s(oril,doril,e) for each corner with the following hash functions:
i l , s = static hash ( s p a c e ( v l , s ) ) i l , d = dynamic hash ( s p a c e ( v l , d ) , t i m e ( v l , d ) , v i d ) i l , e = env hash ( d i r ( v l , e ) , v i d ) \begin{aligned}&\mathbf{i}_{l,s}=\operatorname{static}\operatorname{hash}(space(\mathbf{v}_{l,s}))\\&\mathbf{i}_{l,d}=\operatorname{dynamic}\operatorname{hash}(space(\mathbf{v}_{l,d}),time(\mathbf{v}_{l,d}),\mathbf{vid})\\&\mathbf{i}_{l,e}=\operatorname{env}\operatorname{hash}(dir(\mathbf{v}_{l,e}),\mathbf{vid})\end{aligned} il,s=statichash(space(vl,s))il,d=dynamichash(space(vl,d),time(vl,d),vid)il,e=envhash(dir(vl,e),vid)
Recall the hash function is h ( x ) = ( ⨁ i = 1 d x i π i ) m o d T h(\mathbf{x})=\left(\bigoplus_{i=1}^dx_i\pi_i\right)\mod T h(x)=(⨁i=1dxiπi)modT.
Finally, we linearly interpolate features up to the nearest voxel vertices and obtain the feature vectors.
Notice: we add
v
i
d
\mathbf{vid}
vid as an auxiliary input to the hash, but do not use it for interpolation (since averaging across distinct movers is unnatural). From this perspective, we leverage hashing to effectively index separate interpolating
functions for each video, without a linear growth in memory with the number of videos.
3.2.2 From branch to images
We will establish a relationship between the image and the feature vector obtained from the hash table:
-
Static branch
σ s ( x ) ∈ R c s ( x , d , A v i d F ( t ) ) ∈ R 3 . \begin{aligned}&\sigma_{s}(\mathbf{x})\in\mathbb{R}\\&\mathbf{c}_{s}(\mathbf{x},\mathbf{d},A_{vid}\mathcal{F}(t))\in\mathbb{R}^{3}.\end{aligned} σs(x)∈Rcs(x,d,AvidF(t))∈R3.
We add a latent embedding A v i d F ( t ) A_{vid}\mathcal{F}(t) AvidF(t) consisting of a video-specific matrix A v i d A_{vid} Avid and a Fourier-encoded time index F ( t ) \mathcal{F}(t) F(t).
-
Dynamic branch
σ d ( x , t , v i d ) ∈ R ρ d ( x , t , v i d ) ∈ [ 0 , 1 ] c d ( x , t , v i d , d ) ∈ R 3 \sigma_d(\mathbf{x},\mathbf{t},\mathbf{vid})\in\mathbb{R} \\ \rho_{d}(\mathbf{x},\mathbf{t},\mathbf{vid})\in[0,1]\\ \mathbf{c}_d(\mathbf{x},\mathbf{t},\mathbf{vid},\mathbf{d})\in\mathbb{R}^3 σd(x,t,vid)∈Rρd(x,t,vid)∈[0,1]cd(x,t,vid,d)∈R3
We find shadows to play a crucial role in the appearance of urban scenes, we explicitly model a shadow field of scalar values ρ d ∈ \rho_{d} \in ρd∈ [0, 1], used to scale down the static color c s \mathbf{c}_s cs.
-
Far-field branch
σ d ( x , t , v i d ) ∈ R \sigma_d(\mathbf{x},\mathbf{t},\mathbf{vid})\in\mathbb{R} σd(x,t,vid)∈R
3.2.3 Rendering
Firstly, we derive a single density and radiance value for any position by computing the weighted sum of the static and dynamic components, combined with the pointwise shadow reduction:
σ ( x , t , v i d ) = σ s ( x ) + σ d ( x , t , v i d ) c ( x , t , v i d , d ) = σ s σ ( 1 − ρ d ) c s ( x , d , A v i d F ( t ) ) + σ d σ c d ( x , t , v i d , d ) \begin{aligned}\sigma(\mathbf{x},\mathbf{t},\mathbf{vid})& =\sigma_{s}(\mathbf{x})+\sigma_{d}(\mathbf{x},\mathbf{t},\mathbf{vid}) \\\mathbf{c(x,t,vid,d)}& =\frac{\sigma_{s}}{\sigma}(1-\rho_{d})\mathbf{c}_{s}(\mathbf{x},\mathbf{d},A_{vid}\mathcal{F}(t)) \\&+ \frac{\sigma_{d}}{\sigma}\mathbf{c}_{d}(\mathbf{x},\mathbf{t},\mathbf{vid},\mathbf{d})\end{aligned} σ(x,t,vid)c(x,t,vid,d)=σs(x)+σd(x,t,vid)=σσs(1−ρd)cs(x,d,AvidF(t))+σσdcd(x,t,vid,d)
Then similar to the α \alpha α-blending, we calculate the final color C ^ \hat{C} C^:
C ^ ( r , t , v i d ) = ∫ 0 + ∞ T ( t ) σ ( r ( t ) , t , v i d ) c ( r ( t ) , t , v i d , d ) d t + T ( + ∞ ) c e ( d , v i d ) , where T ( t ) = exp ( − ∫ 0 t σ ( r ( s ) , t , v i d ) d s ) . \begin{aligned}\hat{C}(\mathbf{r},\mathbf{t},\mathbf{vid})&=\int_{0}^{+\infty}T(t)\sigma(\mathbf{r}(t),\mathbf{t},\mathbf{vid})\mathbf{c}(\mathbf{r}(t),\mathbf{t},\mathbf{vid},\mathbf{d})dt\\&+T(+\infty)\mathbf{c}_{e}(\mathbf{d},\mathbf{vid}),\end{aligned}\\ \text{where} T(t)=\exp\left(-\int_{0}^{t}\sigma(\mathbf{r}(s),\mathbf{t},\mathbf{vid})ds\right). C^(r,t,vid)=∫0+∞T(t)σ(r(t),t,vid)c(r(t),t,vid,d)dt+T(+∞)ce(d,vid),whereT(t)=exp(−∫0tσ(r(s),t,vid)ds).
3.2.4 Other supervision
-
Feature distillation
We add a C-dimensional output head to each of our branches to predict the DINO feature and then compare it with the offline model.
Φ s ( x ) ∈ R C Φ d ( x , t , v i d ) ∈ R C Φ e ( d , v i d ) ∈ R C . \begin{aligned}&\Phi_{s}(\mathbf{x})\in\mathbb{R}^{C} \\&\Phi_d(\mathbf{x},\mathbf{t},\mathbf{vid})\in\mathbb{R}^C \\&\Phi_e(\mathbf{d},\mathbf{vid})\in\mathbb{R}^C.\end{aligned} Φs(x)∈RCΦd(x,t,vid)∈RCΦe(d,vid)∈RC.
These are similar to the mapping of color. Firstly, x \mathbf{x} x is passed to a grid encoding and then to a MLP to obtain the final output.
But in the code, feature, flow, and color output use different grids, which makes their training independent.self.encoding_feature = tcnn.Encoding( n_input_dims=5, encoding_config={ 'otype': 'SequentialGrid', 'n_levels': num_levels, 'n_features_per_level': features_per_level, 'log2_hashmap_size': log2_hashmap_size, 'base_resolution': base_resolution, 'include_static': False }, )
-
Scene flow
s t ′ ∈ [ − 1 , 1 ] ( x , t , v i d ) ∈ R 3 s_{t^{\prime}\in[-1,1]}(\mathbf{x},\mathbf{t},\mathbf{vid})\in \mathbb{R}^3 st′∈[−1,1](x,t,vid)∈R3
3.3 Optimization
L = ( L c + λ f L f + λ d L d + λ o L o ) ⏟ reconstruction losses + ( L c w + λ f L f w ) ⏟ warping losses λ f l o ( L c y c + L s m + L s l o ) ⏟ flow losses + ( λ e L e + λ d L d ) ⏟ static-dynamic factorization + λ ρ L ρ . \mathcal{L}=\underbrace{\left(\mathcal{L}_{c}+\lambda_{f}\mathcal{L}_{f}+\lambda_{d}\mathcal{L}_{d}+\lambda_{o}\mathcal{L}_{o}\right)}_{\text{reconstruction losses}}+\underbrace{\left(\mathcal{L}_{c}^{w}+\lambda_{f}\mathcal{L}_{f}^{w}\right)}_{\text{warping losses}}\\\lambda_{flo}\underbrace{\left(\mathcal{L}_{cyc}+\mathcal{L}_{sm}+\mathcal{L}_{slo}\right)}_{\text{flow losses}}+\underbrace{\left(\lambda_{e}\mathcal{L}_{e}+\lambda_{d}\mathcal{L}_{d}\right)}_{\text{static-dynamic factorization}}+\lambda_{\rho}\mathcal{L}_{\rho}. L=reconstruction losses (Lc+λfLf+λdLd+λoLo)+warping losses (Lcw+λfLfw)λfloflow losses (Lcyc+Lsm+Lslo)+static-dynamic factorization (λeLe+λdLd)+λρLρ.
4. Experiment
4.1 City-Scale Reconstruction
It did not specify the dataset.
4.2 KITTI Benchmarks
KITTI + VKITTI (same as NSG)
4.3 Diagnostics
Flow-based warping is the single-most important input, while depth is the least crucial input.
Ref:
[1] SUDS: Scalable Urban Dynamic Scenes_suds论文解读-CSDN博客
[2] 【读论文】Instant Neural Graphics Primitives with a Multiresolution Hash Encoding-CSDN博客
原文地址:https://blog.csdn.net/weixin_62012485/article/details/140577279
免责声明:本站文章内容转载自网络资源,如本站内容侵犯了原著者的合法权益,可联系本站删除。更多内容请关注自学内容网(zxcms.com)!