arXiv digest

2026-06-23 · 6 papers · curated by paper-scout

★ Lift4D: Harmonizing Single-View 3D Estimation for 4D Reconstruction In-the-Wild

2606.23688

Yehonathan Litman, Xiaoxuan Ma, Manan Shah, Nicolas Ugrinovic et al. · cs.CV

Reconstructing dynamic non-rigid objects from monocular video requires integrating visual cues from direct observations with data-driven priors over geometry and appearance. Prior approaches either learn to directly predict 4D representations from visual input or initialize a 3D representation that is subsequently deformed and refined based on video evidence. However, the former are constrained by the scarcity of 4D training data, while the latter leverage priors only for the initial reconstruction and rely solely on video supervision thereafter; neither handles complex in-the-wild scenarios with large deformations and occlusions well. We present Lift4D, a test-time optimization framework that addresses both limitations. First, we adapt an existing single-view 3D reconstruction model to yield temporally consistent per-frame predictions via causal latent conditioning, providing a coherent initialization for a deformable 3D Gaussian Splatting representation. We then ``sculpt'' this representation to match the input video through an occlusion-aware optimization that faithfully recovers visible surface details while completing unobserved regions using a view-conditioned diffusion prior. We demonstrate that Lift4D clearly improves over prior 4D reconstruction methods, particularly on challenging in-the-wild sequences with severe occlusions and non-rigid motion.

take · The split that matters: lean on the data-driven prior for the initialization *and* throughout the deformation, not just the seed. Most 4D-from-monocular pipelines trust the prior once and then trust the video — which is exactly when in-the-wild noise wins. I'd stress it on fast non-rigid motion, where monocular depth is least reliable and the refinement has to carry the most weight.

abs · pdf · html · ar5iv

MeGAS: Thermomechanical Dynamic Gaussian Splatting for Thermophysical Scene Editing

2606.23455

Zesong Yang, Yuanhang Lei, Liyuan Cui, Yihang Chen et al. · cs.CV

Recent advances integrate physically grounded Newtonian dynamics with neural rendering frameworks, narrowing the gap between photorealistic scene reconstruction and physics-based animation. However, existing approaches focus on mechanically driven dynamics while neglecting temperature, a fundamental yet invisible physical factor underlying phenomena such as melting, solidification, and other thermomechanical processes. In this paper, we propose MeGAS, a novel framework that incorporates thermomechanical phase-change dynamics into 3D Gaussian Splatting (3DGS). Specifically, we propose a new thermomechanical dynamic Gaussian Splatting representation that augments 3DGS with temperature attributes and employs a heat advection-diffusion solver with MPM dynamics incorporating phase transitions, enabling physically plausible and visually realistic synthesis of thermophysical phenomena. Furthermore, a new topology-adaptive Gaussian rendering strategy is proposed to mitigate cracking and floaters under extreme deformation. Extensive experiments demonstrate that MeGAS produces physically consistent thermomechanical behavior while maintaining high-fidelity photorealistic rendering, advancing toward physics-integrated world models.

take · Temperature as the missing state variable in physics-grounded splatting. Melting and solidification need a thermal field, not just Newtonian forces — bolting one on is obvious only in hindsight. The real question is whether the phase-change coupling is physically calibrated or merely visually plausible; the editing demo will look great either way.

abs · pdf · html · ar5iv

Lightweight Neural Framework for Robust 3D Volume and Surface Estimation from Multi-View Images

2606.23653

Diego E. Farchione, Ramzi Idoughi, Peter Wonka · cs.CV

Accurate volume and surface area estimation is critical for diverse applications, from marine ecology to medical diagnostics. However, existing methods often suffer from high computational costs and poor performance with sparse and noisy data. We propose a fully feed-forward framework that regresses scale-normalized volume and surface area and their associated uncertainties directly from multi-view images. By fusing 3D point cloud reconstructions with view-aligned 2D features through a graph-based decoder, our model bypasses iterative optimization, ensuring exceptional scalability and rapid inference. Experimental results demonstrate that our approach outperforms state-of-the-art methods, particularly when operating with a low number of input images. Validated across coral monitoring, dietary analysis, and anthropometry, our proposed framework provides a robust, adaptable solution for quantitative shape analysis. This architecture provides a high-speed, scalable alternative for precise geometric estimation from visual data, maintaining high performance even in resource-constrained or sparse-view scenarios.

take · Feed-forward, scale-normalized volume and surface area *with uncertainty* from multi-view images — the uncertainty head is what makes this usable in an actual measurement loop rather than a demo. Fusing point-cloud reconstructions with view-aligned 2D features through a graph decoder is a sane layout; I'd go straight to the calibration of those error bars on sparse, noisy inputs.

abs · pdf · html · ar5iv

dVLA-RL: Reinforcement Learning over Denoising Trajectories for Discrete Diffusion Vision-Language-Action Models

2606.23623

Yuhao Wu, Yitian Liu, Weijie Shen, Mishuo Han et al. · cs.RO

Vision-Language-Action (VLA) models have established a powerful paradigm for generalist robotic manipulation by grounding control into the semantic reasoning of VLMs. Prevailing architectures typically model actions continuously via diffusion or flow processes, or discretely through either autoregressive generation or parallel decoding. Recently, Discrete Diffusion VLAs (dVLAs) have emerged as a distinct alternative, unifying vision, language, and action into a single discrete token space via masked generative modeling. While combining iterative refinement with unified representations, its training has thus far been restricted to Supervised Fine-Tuning (SFT), leaving the potential of Reinforcement Learning (RL) for further policy refinement largely unexplored. A fundamental challenge in RL for dVLAs is that the marginal probability of the final action generated by dVLAs remains intractable. To solve this problem, we propose \textbf{dVLA-RL}, shifting the learning objective from the marginal action probability to the joint probability of the sampled generation path. Specifically, by modeling the denoising process as a Markov Decision Process (MDP), we mathematically formulate this path probability as a product of step-wise transitions. This trajectory-level objective provides a unified formulation that natively accommodates variable denoising steps. Leveraging this intrinsic fexibility, we introduce a unified step scheduling approach for complex multi-task learning, tailoring denoising steps to specific task complexities to maximize both success rates and computational effciency. Extensive evaluations demonstrate that our approach achieves a success rate of \textbf{99.7\%} on LIBERO. Furthermore, it establishes strong VLA-based results on RoboTwin 2.0 by delivering a \textbf{30.6\%} improvement over the SFT baseline, remaining competitive with strong World-Action Model baselines.

take · RL applied directly over the denoising trajectory of a discrete-diffusion VLA — rewarding the unmasking path, not just the final action tokens. For parallel-decoding policies that's the right surface to optimize. The open question is reward density and stability across masked generative rollouts, where credit assignment over the trajectory gets slippery.

abs · pdf · html · ar5iv

Scheduling Thoughts: Learning the Order of Thought in Diffusion Language Models

2606.23567

Jiawei Xu, Minghui Liu, Aakriti Agrawal, Yifan Chen et al. · cs.LG cs.AI

Masked diffusion language models decode by iteratively unmasking tokens, where the unmasking order defines an "order of thought" that strongly influences generation quality yet is typically chosen heuristically. We derive a tractable upper bound on the sequential decoding mismatch, measured by the Kullback-Leibler divergence and expressed in terms of the model's pathwise log-likelihood, with tightness under sufficient model expressivity. This bound induces a dense self-aware reward over ordered trajectories, casting order selection as a principled policy optimization problem with a frozen denoiser. We instantiate this idea as Self-Aware Scheduling (SAS), which learns a lightweight order policy using Group Relative Policy Optimization and applies seamlessly to both any-order and semi-autoregressive decoding. On Sudoku with 1B MDM, SAS improves puzzle accuracy from 82.0% (best heuristic schedule) to 91.8%, and reaches 97.5% with second-stage fine-tuning along learned trajectories. On mathematical reasoning with LLaDA-8B, SAS improves pass@1 on GSM8K from 64% to 76% and on MBPP from 39.5% to 41%, consistently matching or exceeding heuristic schedules across generation lengths and block sizes. Project page: https://jimmyxu123.github.io/SAS

take · A tractable KL bound on decoding-order mismatch, turned into a dense reward over unmasking trajectories — replacing the heuristic 'which token to unmask next' with a learned order of thought. The theory-to-reward move is clean, and it's the kind of change that lifts diffusion-LM quality without touching the backbone.

abs · pdf · html · ar5iv

HoloAgent-0: A Unified Embodied Agent Framework with 3D Spatial Memory

2606.23565

Xiaolin Zhou, Liu Liu, Tingyang Xiao, Wei Feng et al. · cs.RO cs.CV

LLM agents follow a practical execution loop in digital environments: they reason over structured states, invoke tools, inspect feedback, and revise actions. Extending this loop to physical robots is difficult because physical execution is continuous, embodiment-dependent, uncertain, and constrained by safety. Existing embodied-AI systems have advanced manipulation, spatial understanding, navigation, and humanoid control, but these capabilities often remain specialized modules or loosely coupled decision loops. In this work, we introduce HoloAgent-0, a unified embodied agent framework for real-world robot deployment. Embodied AgentOS converts language instructions into executable skill graphs, schedules robot resources, monitors execution, and triggers clarification or re-planning from runtime feedback. HoloAgent-0 organizes heterogeneous robot models and controllers through three coupled layers: Embodied AgentOS for closed-loop execution, 3D spatial memory for physical world grounding, and embodied skills for robot action. We deploy HoloAgent-0 on real hardware and evaluate its spatial memory, long-horizon navigation, and closed-loop execution across motion generation, object search, cross-robot coordination, and mobile manipulation.

take · Porting the digital-agent loop — reason, call a tool, inspect feedback, revise — onto robots, with a persistent 3D spatial memory as the shared state. The hard part is always the continuous, embodiment-dependent, safety-constrained execution gap; a unified 3D memory is a reasonable substrate for it. Read it for how the loop grounds against uncertainty, not for the framework diagram.

abs · pdf · html · ar5iv