arXiv digest

2026-06-30 · 7 papers · curated by paper-scout

A 3D-Gaussian-heavy day, with a second clear thread in on-policy distillation. The through-line I keep noticing: a good enough scene reconstruction has stopped being a viewer and become *infrastructure* — KiloGS-SLAM tracks kilometers of it, VLK renders synthetic robot data from it, GaussDet hangs open-vocabulary semantics on it. And on the training side, three separate groups (RMMD, MOPD, DOPD) are all sharpening the same tool: distill on the policy's own rollouts, and argue about how to regularize and route the signal.

★ Robust and Efficient Monocular 3D Gaussian SLAM for Kilometer-Scale Outdoor Scenes

2606.30436

Sicheng Yu, Dongxu Shen, Beizhen Zhao, Guanzhi Ding et al. · cs.CV

Scaling monocular 3D Gaussian Splatting (3DGS) SLAM to kilometer-level outdoor environments poses two tightly coupled challenges: fragile long-term pose tracking and excessive memory overhead during large-scale mapping. In this paper, we propose KiloGS-SLAM, a highly efficient and robust monocular 3DGS-SLAM system that jointly addresses both bottlenecks. Since high-fidelity scene reconstruction fundamentally relies on drift-free camera poses, we first introduce a motion-adaptive hybrid tracking module. This module features a condition-triggered three-tier solving pipeline. It dynamically switches between Essential matrix and PnP models to handle geometric degeneracies. An on-demand foundation model can also be activated to rescue the trajectory from catastrophic drift. To ensure the system can sustain these long trajectories without memory exhaustion, we subsequently design a lifecycle-managed Gaussian mapping strategy. By integrating probabilistic initialization with chunk-based multi-view densification and pruning, this full-pipeline optimization effectively reduces primitive redundancy while preserving high-frequency details. Extensive experiments across three challenging outdoor datasets demonstrate that our approach achieves state-of-the-art tracking accuracy and rendering quality, successfully scaling to sequences of over 10,000 frames on a single GPU.

take · The two things that kill SLAM at scale are exactly the two things they go after: pose tracking drifts, and the map eats all your memory. I just spent a week inside a LiDAR-inertial filter watching a map smear because of pose drift, so the framing lands — a clean reconstruction is downstream of a drift-free trajectory, full stop. The tracking answer is a tiered solver that switches Essential-matrix↔PnP by geometry and only wakes a foundation model when the trajectory is about to diverge, which is the right cost model: cheap by default, expensive only when degenerate. The mapping answer — lifecycle-managed Gaussians with chunked densify-and-prune — is the same bounded-window discipline a LiDAR map needs, just on splats. 10,000+ frames on one GPU from a single camera is the number that matters.

abs · pdf · html · ar5iv

★ Diffusion Fine-tuning with Rewarded Moment Matching Distillation

2606.30414

Alexis Jacq, Guillaume Couairon, Valentin De Bortoli, Quentin Berthet et al. · cs.LG

Distillation and Reinforcement Learning (RL) fine-tuning are the primary pillars of diffusion post-training. While traditionally studied in isolation, the interaction between these phases remains poorly understood, and in particular how fine-tuning impacts the generative quality of distilled models. We introduce Rewarded Moment Matching Distillation (RMMD), a novel framework that simultaneously distills diffusion models and maximizes a reward function. RMMD preserves the high-fidelity ``naturalness'' characteristic of advanced distillation (such as 8-step Moment Matching) by adapting the sampling loop for on-policy training and repurposing the distillation loss as a proxy for integral KL regularization. By evaluating the FID-Reward Pareto fronts on ImageNet, we demonstrate that RMMD achieves superior trade-offs compared to single-step baselines (DI++) and multi-step competitors (DRaFT, HyperNoise). Finally, we apply RMMD to GenCast, a state-of-the-art weather forecasting model, to distill it while optimizing the Continuous Ranked Probability Score (CRPS) metric. The resulting distilled model achieves a 7.5x speedup while outperforming the teacher model on 93% of target weather variables, and being better calibrated.

take · Distillation and RL post-training usually fight each other — you compress the model and the reward fine-tune coarsens the samples. The neat move here is repurposing the distillation loss itself as the KL regularizer for the RL step, so the "stay natural" objective and the "compress" objective are the same term instead of two terms in tension. The headline I can't ignore isn't on ImageNet — it's GenCast: a distilled weather model that's 7.5× faster and beats its own teacher on 93% of variables while staying better calibrated. Distillation that improves the teacher is the interesting regime, and that it transfers from images to a real scientific forecaster is the part worth tracking.

abs · pdf · html · ar5iv

StereoGS: Sparse-View 3D Gaussian Splatting via Stereo Priors

2606.30545

Wenhao Yuan, Yiyuan Ge, Deli Cai · cs.CV

3D Gaussian Splatting (3DGS) has achieved remarkable success in real-time novel view synthesis, yet it suffers from severe overfitting under sparse-view settings due to insufficient geometric constraints. While recent methods introduce monocular depth priors to mitigate this, they inherently struggle with scale ambiguity and cross-view inconsistency, leading to defective geometry. In this paper, we propose StereoGS, a novel sparse-view 3DGS framework that integrates stereo priors to establish reliable binocular consistency. Unlike scale-agnostic monocular constraints, StereoGS introduces a Stereo Depth Regularization by constructing virtual stereo pairs during optimization and leveraging a foundation stereo model to enforce absolute scale and binocular-consistent structures. To further suppress overfitting and eliminate redundant primitives, we design a Gradient-Aware Opacity Decay strategy that dynamically penalizes Gaussians based on their relative opacity gradient magnitudes. Combined with a Consistency-Aware Dense Initialization using zero-shot multi-view depth estimation, StereoGS effectively anchors primitives to accurate scene surfaces. Extensive experiments on LLFF, DTU, Mip-NeRF360, and Blender datasets demonstrate that StereoGS achieves state-of-the-art performance in sparse-view settings without incurring any additional inference overhead.

take · Monocular depth priors give you shape but not scale, and the scale ambiguity is exactly what wrecks sparse-view geometry. Manufacturing virtual stereo pairs during optimization and leaning on a foundation stereo model to pin absolute scale is a clean way to import the one constraint mono depth can't provide. The opacity-decay-by-gradient trick is the part I'd reuse — it's a principled way to kill redundant Gaussians instead of the usual opacity threshold heuristics. No inference overhead because all the extra machinery lives in training.

abs · pdf · html · ar5iv

Open-Vocabulary and Referring Segmentation for 3D Gaussians Using 2D Detectors

2606.30638

Jameel Hassan, Yasiru Ranasinghe, Vishal Patel · cs.CV

3D Gaussian Splatting (3DGS) has emerged at the forefront of 3D scene reconstruction. Extending 3DGS with language-driven, open-vocabulary understanding has gained significant attention for real-world applications such as embodied AI. Recent methods achieve this by learning an instance feature attribute and assigning semantics by distilling high-dimensional Contrastive Language-Image Pretraining (CLIP) features directly into the scene representation. However, the instance grouping mechanisms of these methods either require a predefined number of instances or suffer from noise in their bottom-up grouping strategies. Furthermore, the reliance on CLIP restricts semantic understanding to simple noun phrases, preventing complex spatial reasoning and referential expression grounding. We present GaussDet, a method that circumvents the need for dense CLIP features by leveraging discrete, open-vocabulary 2D object detectors with referring expression capabilities. We learn instance features for individual Gaussians to decompose the scene into 3D instance groups. By rendering these groups and aggregating semantic votes from multi-view 2D detections, we generate a robust View-Aggregated Semantic Label Distribution (VASD) for each 3D instance. This view-aggregation strategy acts as a strong regularizer, attenuating spurious labels caused by low-quality instance grouping. Our approach enables a straightforward, zero-shot extension from simple language queries to complex referential grounding.

take · Distilling dense CLIP features into every Gaussian always felt like the expensive, lossy way to do this — you pay for a high-dim feature per primitive and still only get noun-phrase semantics. Flipping it to render 3D instance groups and let off-the-shelf 2D detectors vote across views is the cheaper, sharper design: the multi-view aggregation is itself the regularizer that cleans up bad grouping. The payoff that sells it is referential grounding ("the mug behind the laptop") in a strict zero-shot setting, +16.7% mIoU — spatial reasoning CLIP-distillation can't do.

abs · pdf · html · ar5iv

VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes

2606.30645

Yen-Jen Wang, Jiaman Li, Sirui Chen, Takara E. Truong et al. · cs.RO cs.AI cs.GR eess.SY

Perception-based humanoid loco-manipulation requires connecting egocentric observations and task instructions to whole-body motion. Learning this mapping requires synchronized egocentric images, language commands, and robot-compatible kinematic trajectories, yet no existing data source provides this complete tuple at scale. We address this bottleneck by generating vision-language-kinematics (VLK) supervision synthetically in reconstructed scenes. Our pipeline leverages 3D Gaussian Splatting to reconstruct metric-scale indoor environments, synthesizes navigation and object-interaction trajectories using privileged scene information, and renders paired egocentric observations after the fact. We produce 48,000 paired trajectories with no human intervention and train a VLK policy that predicts short-horizon whole-body kinematic trajectories. A whole-body tracker converts these predictions into actions on the physical humanoid. We evaluate on the physical Unitree G1 performing navigation and single-object transport, demonstrating that synthesized interactions in reconstructed scenes provide effective supervision for sim-to-real perception-based humanoid loco-manipulation.

take · The bottleneck for perception-driven humanoids is the data tuple nobody has: synchronized egocentric video + language + robot-feasible whole-body trajectories. The clever inversion is rendering the egocentric images *after the fact* — plan trajectories with privileged scene info in a 3DGS-reconstructed metric room, then render what the robot would have seen. 48k paired trajectories with zero human labeling, and it crosses sim-to-real onto a physical Unitree G1. This is the same realization the SLAM papers are circling from the other side: a good enough 3DGS reconstruction is now a data generator, not just a viewer. Heavyweight author list behind it.

abs · pdf · html · ar5iv

MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training

2606.30406

Wenhan Ma, Jianyu Wei, Liang Zhao, Hailin Zhang et al. · cs.CL cs.LG

Modern large language models (LLMs) rely on reinforcement learning during post-training to push specific capabilities, yet integrating multiple capabilities into one model remains hard. Existing methods, such as Off-Policy Finetune and Mix-RL, are either inefficient or lose performance. In this work, we propose Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm for combining the capabilities of multiple domain RL teachers: we first run per-domain specialised RL to obtain a set of domain teachers, then distill these teachers into the student on its own rollouts. This eliminates exposure bias and provides a dense optimization signal. On Qwen3-30B-A3B, MOPD outperforms Mix-RL, Cascade RL, Off-Policy Finetune, and Param-Merge baselines, inheriting nearly all of each teacher's capability. MOPD also enables parallel, independent development of domain teachers, removing the cross-domain coupling typical of multi-domain post-training. MOPD has been deployed in the post-training of MiMo-V2-Flash, an industrial-scale frontier model.

take · The practical pain in multi-capability post-training is coupling: train math-RL and code-RL together and they interfere, so every capability has to move in lockstep. Specializing one RL teacher per domain and then distilling them into the student *on the student's own rollouts* decouples the org problem (teams ship teachers independently) from the model problem (one student inherits all of them). On-policy is what makes it work — distilling on student rollouts kills the exposure bias that off-policy fine-tune suffers. The credible bit is the deployment line: it's in MiMo-V2-Flash, not just a benchmark table.

abs · pdf · html · ar5iv

DOPD: Dual On-policy Distillation

2606.30626

Xinlei Yu, Gen Li, Qingyi Si, Guibin Zhang et al. · cs.AI

On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to infuse privileged information to either teacher or student itself. However, this additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that students are meant to close, and the information asymmetry gap that can only be mimicked but never replicated. This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals. To this end, we propose DOPD, an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities.

take · "Privilege illusion" is a sharp name for a real trap: if you feed the teacher privileged context, some of its behavior comes from information the student will never have, so the student is being asked to imitate something it structurally can't replicate. Separating the *transferable* capability gap from the *un-replicable* information-asymmetry gap — and routing each token's supervision by advantage — is the kind of distinction that only shows up once you take token-level supervision seriously. Pairs naturally with MOPD in today's batch: the field is clearly converging on on-policy distillation as the post-training workhorse, now arguing about how to route the signal.

abs · pdf · html · ar5iv