# arXiv digest — 2026-06-27

> Satyajit Ghana — Head of Engineering @ Inkers Technology
> canonical: https://ai.thesatyajit.com/arxiv/2026-06-27
> papers: 8
## ★ SatSplatDiff: Geometry-preserving generative refinement for high-fidelity satellite Gaussian Splatting

- arXiv: 2606.27223v1 — [abs](https://arxiv.org/abs/2606.27223) · [pdf](https://arxiv.org/pdf/2606.27223) · [html](https://arxiv.org/html/2606.27223v1) · [ar5iv](https://ar5iv.labs.arxiv.org/html/2606.27223)
- authors: Jiyong Kim, Shuang Song, Ronjgun Qin
- categories: cs.CV

**Abstract.** Gaussian Splatting has been recently explored for satellite 3D reconstruction, demonstrating flexibility and efficiency in representing radiometrically diverse satellite scenes. However, the limited top viewpoint of satellite imagery results in insufficient supervision on building facades, leaving surface holes and degraded visual fidelity. Generative refinement, which leverages pretrained generative priors to iteratively refine and update the rendered images used as supervision targets, has recently been investigated to improve the visual fidelity of Gaussian-rendered images. However, since these models refine each view independently, the resulting images can generate hallucinations and break photo-consistency, leading to geometric degradation. To address these limitations, we propose SatSplatDiff, which aims to minimize geometric degradation prevalent in generative refinement. Building on photogrammetric DSM initialization and 2DGS-based shadow casting established in our prior work SatSplat, we first introduce monocular depth supervision and multi-scale geometric refinement to establish a geometrically accurate and well-regularized surface representation. We then apply shadow-guided generative refinement, where geometrically calculated shadow maps guide the Gaussians to maintain consistency with the underlying geometry, improving visual fidelity while reducing geometric degradation. Extensive evaluations on the IARPA2016 and DFC2019 datasets demonstrate state-of-the-art performance, reducing geometric MAE by up to 18% and improving visual fidelity (FID-CLIP) by 28-45% over existing baselines. Our method delivers up to 5x resolution enhancement with minimal hallucination and sensor-consistent appearance, demonstrating seamless cross-tile consistency and strong scalability for large-scale reconstruction. Source code is available at https://github.com/GDAOSU/SatSplatDiff

**Take.** Satellite splatting starves on building facades — imagery is top-down, so the sides get almost no supervision and you get holes. Bolting a generative prior on per-view hallucinates and breaks photo-consistency; the fix here is to let geometrically-computed shadow maps steer the refinement so it can't drift off the surface. 18% lower geometric MAE and 28-45% better FID-CLIP is a real lift. The shadow-as-geometric-anchor trick is the part I'd steal.

## UAV-MapFusion: RTK-Aligned Uncertainty-Aware Coarse-to-Fine Multi-Session UAV Mapping

- arXiv: 2606.26928v1 — [abs](https://arxiv.org/abs/2606.26928) · [pdf](https://arxiv.org/pdf/2606.26928) · [html](https://arxiv.org/html/2606.26928v1) · [ar5iv](https://ar5iv.labs.arxiv.org/html/2606.26928)
- authors: Feng Pan, Chunran Zheng, Bing Xue, Yukang Cui, Jiayu Wen, Zhiyu Chen, Wei Wang
- categories: cs.RO, cs.SI

**Abstract.** Large-scale point cloud maps are essential for robotics and spatial intelligence tasks. UAVs provide an efficient means for large-scale map acquisition; however, due to limited flight endurance and onboard storage, mapping a large-scale scene within a single flight remains difficult. Existing multi-session map merging methods can extend the mapping range, yet in UAV scenarios they still struggle to simultaneously suppress long-range drift and preserve local geometric accuracy. To address this issue, an uncertainty-aware multi-session point cloud map merging and coarse-to-fine optimization system is proposed. The proposed method first performs initial multi-session map merging based on a scene graph, and then incorporates RTK observations through an RTK spatiotemporal alignment module, where temporal offsets are estimated using Dynamic Time Warping (DTW), and continuous RTK constraints are recovered using Multi-Output Gaussian Processes (MOGP) under incomplete sampling and frame dropouts. On this basis, a unified uncertainty-aware factor graph is constructed, and local geometric accuracy is further improved through iterative plane-factor refinement. Experiments on real-world datasets validate the effectiveness and robustness of the proposed method. To facilitate further research and development in the community, our code and dataset will be publicly released.

**Take.** The unglamorous production problem stated plainly: one flight can't map a site, so you stitch sessions and fight long-range drift without smearing local geometry. DTW for the RTK temporal offset plus multi-output GPs to recover dropped frames is a sane way to treat RTK as a soft constraint instead of trusting it blindly. The uncertainty-aware factor graph over the merge is the right substrate; I'd push on how the plane-factor refinement holds up across session seams.

## OctoSense: Self-Supervised Learning for Multimodal Robot Perception

- arXiv: 2606.27317v1 — [abs](https://arxiv.org/abs/2606.27317) · [pdf](https://arxiv.org/pdf/2606.27317) · [html](https://arxiv.org/html/2606.27317v1) · [ar5iv](https://ar5iv.labs.arxiv.org/html/2606.27317)
- authors: Anthony Bisulco, Jeremy Wang, Kostas Daniilidis, Randall Balestriero, Pratik Chaudhari
- categories: cs.CV, cs.RO

**Abstract.** We present OctoSense, an open-source sensor platform with stereo RGB and event cameras, LiDAR, a thermal camera, an inertial measurement unit, RTK-corrected global positioning system, and proprioception (CAN bus data from a car, and joint angles for a quadruped robot). The eponymous OctoSense dataset contains 59 hours of time-synchronized driving data across different types of environments at different times of the day, including situations with highly degraded sensors. We demonstrate multi-modal self-supervised learning using such real-world robotics data, where sensors have different representations, frequencies, latencies and noise. Our approach, a "late-fusion" masked autoencoder, (i) uses modality-specific tokenizers to account for different spatiotemporal characteristics of these sensors, and (ii) caches modality-specific tokens at inference time to process new measurements as they come. This architecture (i) is fast (6.68 ms and 112 ms on NVIDIA 5090 and Orin NX respectively, to compute the representation), (ii) performs better than existing image-only foundation models on tasks such as estimation of optical flow, depth, semantic segmentation, and ego-motion (translation, rotation, and steering angle), and (iii) predicts robustly at nighttime or in situations where sensory data is degraded. See our project page for links to the dataset, code, and supplementary videos: https://abisulco.com/octosense/.

**Take.** Sensor fusion as a late-fusion masked autoencoder with per-modality tokenizers — and the part that matters for shipping, it caches modality tokens at inference so new measurements stream in instead of re-encoding everything. 6.68 ms on a 5090, 112 ms on an Orin NX, and it degrades gracefully at night or with a dead sensor. That edge-latency number is what separates a deployable perception stack from a benchmark entry.

## PanoImager: Geometry-Guided Novel View Synthesis and Reconstruction from Sparse Panoramic Views

- arXiv: 2606.27071v1 — [abs](https://arxiv.org/abs/2606.27071) · [pdf](https://arxiv.org/pdf/2606.27071) · [html](https://arxiv.org/html/2606.27071v1) · [ar5iv](https://ar5iv.labs.arxiv.org/html/2606.27071)
- authors: Zhisong Xu, Takeshi Oishi
- categories: cs.CV

**Abstract.** Panoramic sensing offers wide field-of-view coverage, yet 3D reconstruction from sparse panoramas remains challenging under rotation-dominant, weak-parallax motion. In such regimes, SfM/SLAM initialization is often ill-conditioned and unreliable. We present PanoImager, an SfM-free framework that combines feed-forward pose/depth priors, geometry-conditioned diffusion view completion, and depth-guided 3DGS optimization. Given only a few panoramic images, PanoImager decomposes them into local perspective views, synthesizes auxiliary observations to enrich sparse evidence, and stabilizes Gaussian optimization for improved cross-view consistency. Experiments on multiple benchmarks show improved stability under extreme sparsity, suggesting PanoImager as an offline/background component for map refinement when SfM/SLAM fails to initialize.

**Take.** When motion is rotation-dominant with weak parallax, SfM/SLAM initialization is ill-conditioned and just falls over — so this is SfM-free on purpose. Decompose panoramas into local perspective views, synthesize auxiliary views with a geometry-conditioned diffusion model, then stabilize 3DGS with depth guidance. Positioned honestly as an offline map-refinement fallback for exactly the regime where the online pipeline can't get off the ground.

## ★ RayPE: Ray-Space Positional Encoding for 3D-Aware Video Generation

- arXiv: 2606.27345v1 — [abs](https://arxiv.org/abs/2606.27345) · [pdf](https://arxiv.org/pdf/2606.27345) · [html](https://arxiv.org/html/2606.27345v1) · [ar5iv](https://ar5iv.labs.arxiv.org/html/2606.27345)
- authors: Minghao Yin, Jiahao Lu, Wenbo Hu, Wang Zhao, Shan Ying, Kai Han
- categories: cs.CV

**Abstract.** Modern video diffusion transformers position their tokens through RoPE on the (u,v,t) axes -- a description of the camera's sampling grid that says nothing about the 3D structure of the scene. We observe that the geometric relation between two camera rays is captured by the Plucker reciprocal product, which is bilinear in the two rays -- the same algebraic form as the dot product in Transformer attention. Building on this analogy, we propose RayPE, a positional-encoding extension that injects per-token 6D Plucker coordinates additively into the queries and keys of self-attention, with a query/key flip arrangement under which the symmetric identity configuration coincides exactly with the reciprocal product. The injection is additive, the resulting attention score decomposes into a content term, a geometry term, and two content and geometry cross-terms -- all of which our experiments find individually necessary. To make the encoding stable across video data with heterogeneous camera-translation scales (SfM, deep SLAM, metric), we further decouple ray direction from moment magnitude, gate the encoding by a learned function of the log-magnitude, and apply RMSNorm to align it with the QKNorm-normalized content branch. The full module adds less than 0.1% parameters to a pretrained video DiT, is zero-initialized to start from the pretrained weights, and improves camera controllability, cross-frame 3D consistency, and overall video quality on a four-dataset training mixture.

**Take.** The clean idea of the batch. Video DiTs position tokens with RoPE over (u,v,t) — the camera's sampling grid, which says nothing about 3D structure. They notice the Plücker reciprocal product between two rays is bilinear in the rays, the same algebraic form as the attention dot product, and inject 6D Plücker coordinates additively into Q and K. Under 0.1% extra params, zero-initialized so it starts exactly from the pretrained model. Geometry-as-positional-encoding is a genuinely nice unification.

## Focusing on What Matters: Saliency-Harnessing Accurate Routing for Diffusion MoE

- arXiv: 2606.26938v1 — [abs](https://arxiv.org/abs/2606.26938) · [pdf](https://arxiv.org/pdf/2606.26938) · [html](https://arxiv.org/html/2606.26938v1) · [ar5iv](https://ar5iv.labs.arxiv.org/html/2606.26938)
- authors: Haoyou Deng, Keyu Yan, Chaojie Mao, Xiang Wang, Yu Liu, Changxin Gao, Nong Sang
- categories: cs.CV

**Abstract.** Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling diffusion models in visual generation. Recent advancements have focused on adaptively allocating computational resources across diverse tokens to improve efficiency and performance. However, we identify a routing assignment problem in existing diffusion MoE frameworks: the router fails to accurately allocate more computational resources to salient tokens. Our analysis attributes this failure to the router's reliance on noise-corrupted latent features throughout the denoising process. Such stochastic noise obscures the critical structural and textural information, thereby preventing the router from effectively distinguishing salient tokens. To address this, we propose SharpMoE, a post-training framework with a saliency-harnessing accurate routing mechanism, which utilizes clean latent features as a noise-free guidance signal for routing. By bypassing the noise-distorted inputs, SharpMoE provides the router with clear saliency guidance, enabling the identification of salient tokens even in high-noise stages. Furthermore, we introduce a trajectory routing loss to constrain the compute allocation throughout the multi-step denoising trajectory, ensuring precise resource allocation along the generation rollout. Extensive experiments demonstrate that SharpMoE serves as a versatile, plug-and-play solution that further enhances the pretrained, converged MoE models, achieving state-of-the-art performance in visual generation.

**Take.** A specific, believable failure: a diffusion MoE router reads noise-corrupted latents during denoising, so it can't tell which tokens are salient and mis-allocates compute. The fix routes on the clean latent as a noise-free guidance signal, with a trajectory loss constraining allocation along the denoising rollout. Plug-and-play on an already-converged MoE — no retrain — is the appealing part.

## PhysiFormer: Learning to Simulate Mechanics in World Space

- arXiv: 2606.27364v1 — [abs](https://arxiv.org/abs/2606.27364) · [pdf](https://arxiv.org/pdf/2606.27364) · [html](https://arxiv.org/html/2606.27364v1) · [ar5iv](https://ar5iv.labs.arxiv.org/html/2606.27364)
- authors: Yiming Chen, Yushi Lan, Andrea Vedaldi
- categories: cs.CV

**Abstract.** We present PhysiFormer, a diffusion transformer for physically-plausible 3D object motion. Unlike video world models that operate in view-dependent pixel space, PhysiFormer represents objects as 3D meshes expressed in world coordinates. Given the initial vertex positions and velocities, as well as object material type, rigid or elastic, the model samples future vertex trajectories. While related neural physics approaches build on ad-hoc latent spaces or explicitly enforce rigidity and causality, PhysiFormer shows that excellent results can be obtained without any such inductive biases, by casting vertex trajectory prediction as a single denoising diffusion process directly in world coordinates. The probabilistic formulation captures uncertainty in the learned dynamics, enabling diverse plausible futures from initial conditions, making this framework potentially useful for applications with unobserved uncertainty. The model features attention factorised over time, space, and objects for efficiency, enabling permutation-invariant multi-object reasoning without needing explicit object encoding. Trained on over 100k simulated trajectories, PhysiFormer generates rigid and elastic mechanics, and generalises to mixed-material settings, unseen real-world geometries, and larger object counts. It substantially outperforms autoregressive baselines in trajectory accuracy, rigidity preservation, and momentum-based physical consistency. Our results position coordinate-space diffusion as a promising step toward view-invariant, geometry-aware world modelling for robotics, graphics, and physical design. Visualisations, code, and models are available at https://yimingc9.github.io/physiformer.

**Take.** Predict vertex trajectories with a single denoising diffusion directly in world coordinates — no view-dependent pixel space, no hard-coded rigidity or causality — with attention factorized over time, space, and objects. The bet is that you recover rigid and elastic mechanics without baking the physics in, and the probabilistic head gives you diverse plausible futures. World-space rather than pixel-space is the right call for anything feeding robotics or design.

## Sculpting NeRF Geometry: Human-Preference Fine-Tuning of a 3D-Aware Face GAN

- arXiv: 2606.27305v1 — [abs](https://arxiv.org/abs/2606.27305) · [pdf](https://arxiv.org/pdf/2606.27305) · [html](https://arxiv.org/html/2606.27305v1) · [ar5iv](https://ar5iv.labs.arxiv.org/html/2606.27305)
- authors: Archer Moore, Mingming Gong, Liam Hodgkinson
- categories: cs.CV

**Abstract.** Reinforcement learning from human feedback (RLHF) for 3D generation is now established across a number of works, but most existing pipelines optimise explicit surface representations, often by converting radiance fields into meshes and training heavily on surface-supervised data. We instead fine-tune a pretrained 3D-aware generative model directly from a learned reward over radiance-field density ($σ$) values, with no externally supplied mesh or shape prior. The reward model requires no pretraining, trains easily on a small set of preference samples, and yields robust improvement in 3D geometry. Working on an unconditional 3D-aware face GAN (EG3D), our reward reads the continuous 3D density field of the neural radiance field (NeRF) directly and supplies a geometry-only learning signal, requiring neither text conditioning, mesh extraction, nor multi-view rendering. A density-consistency constraint keeps the 2D appearance qualitatively similar while the geometry is reshaped, at a measurable but bounded distributional cost (FID-50k rises from 4.09 to 6.66): the fine-tuned generator, trained from the preferences of a single annotator as a proof of concept, produces face geometries preferred by users in 74.4% of pairwise comparisons.

**Take.** RLHF that reads the NeRF density field directly and supplies a geometry-only reward — no mesh extraction, no multi-view render, no text conditioning — with a reward model that trains on a handful of preference pairs. FID rises 4.09 to 6.66 (the bounded appearance cost is stated honestly) while geometry is preferred in 74.4% of pairwise comparisons. Optimizing the continuous density field instead of a surface proxy is the move worth noting.