[{"date":"2026-07-22","papers":[{"arxivId":"2607.19223v1","title":"AdaFlash: Adaptive Speculative Decoding via On-Policy Distilled Diffusion Drafters","authors":["Yu-Yang Qian","Hao-Cong Wu","Chen Chen","Jiacheng Sun","Zhenhua Dong","Peng Zhao","Zhi-Hua Zhou"],"categories":["cs.LG","cs.CL"],"abstract":"Speculative decoding, in which a lightweight draft model first generates a draft sequence that is then verified in parallel by the target model, has become a prevalent paradigm for accelerating large language model inference. Recent work such as DFlash further boosts drafting efficiency by leveraging diffusion drafters, whose parallel denoising mechanism enables draft generation in a single forward pass. In this work, we uncover a central pitfall of diffusion drafters: bidirectional attention is a double-edged sword. On one hand, it endows the model with parallel generation and global contextual modeling capabilities; on the other hand, this inherent global dependency introduces high variance at both the domain-level and the token-level: acceptance rates fluctuate substantially across different domains, and draft token quality also varies heterogeneously at different token positions. To tackle this issue, we propose AdaFlash framework, comprising two components: (i) an on-policy distillation (OPD) algorithm with reverse-KL divergence tailored for diffusion drafters, bringing stable convergence and effectively reducing domain-level variance; and (ii) an adaptive length head that dynamically adjusts the candidate sequence length on the fly, substantially lowering the verification cost of the target model and handling token-level variance. Experiments demonstrate that AdaFlash consistently improves speedup rate during deployment, with especially significant gains in high-concurrency scenarios, achieving up to approximately 66% higher throughput than previous state-of-the-art methods.","take":"The DFlash line drafts with a diffusion model — one parallel denoising pass proposes a whole draft — but this paper names the catch: the bidirectional attention that buys that parallelism also injects high variance into the acceptance rate, at both the domain and token level. AdaFlash distills the drafter on-policy to damp that variance. A good reminder that “faster drafts” and “accepted drafts” are different objectives, and the second is the one that actually sets the speedup.","standout":true,"links":{"abs":"https://arxiv.org/abs/2607.19223","pdf":"https://arxiv.org/pdf/2607.19223","html":"https://arxiv.org/html/2607.19223v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.19223"}},{"arxivId":"2607.19191v1","title":"ABot-World-0: Infinite Interactive World Rollout on a Single Desktop GPU","authors":["Fan Jiang","Zhaoxu Sun","Mengchao Wang","Ziyu Zhu","Chiyu Wang","Yunpeng Zhang","Wenlin Liu","Yun Wang","Xue Zheng","Rui Sun","Junfeng Ni","Hongyu Pan","Zhongxu Sun","Fei Yu","Zengye Ge","Mengmeng Du","Nianfei Fan","Mingchao Sun","Yu Liu","Yongchang","Yanqing Zhu","Jiahang Wang","Ning Ying","Yuze Xuan","Di Yang","Zhicheng Liu","Zhe Gao","Tingbing Xu","Jiacheng Sui","Wenjin Yang","Junnan Lai","Shufeng Liu","Yuan Liu","Zheng Zhou","Yingliang Peng","Dawei Cao","Kaifeng Sheng","Yuxiang Cai","Fei Lu","Mu Xu","Ning Guo"],"categories":["cs.CV","cs.AI","cs.LG"],"abstract":"We present ABot-World-0, an action-conditioned video world model for real-time, long-horizon closed-loop interaction, supported by a multi-source data infrastructure spanning AAA games, simulation engines, and internet videos to learn controllable world dynamics. WorldExplorer performs agent-driven collection guided by training feedback, while a unified pipeline applies 14 deterministic quality checks, VLM-based assessment, and synchronized action and text annotation. We progressively distill a bidirectional action-conditioned teacher into a causal student through teacher forcing and ODE distillation, and introduce LongForcing to align long student self-rollouts with an extended-horizon teacher, mitigating accumulated distribution shift and autoregressive drift. Raw keyboard actions provide a unified control interface for scene roaming and third-person character interaction, while reference-character memory provides persistent appearance cues for identity consistency during third-person rollouts. For deployment, we co-design a streaming inference stack with a lightweight VAE decoder, efficient attention, memory-aware scheduling, and low-bit DiT inference. Across optimized low-bit configurations, ABot-World-0 streams 720P video at up to 16 FPS on a single NVIDIA RTX 5090 desktop GPU, with 1.2s action-to-first-frame latency and approximately 19GiB peak VRAM. Experiments on WorldRoamBench and extended interactive rollouts demonstrate competitive controllability and coherent long-horizon world evolution.","take":"An action-conditioned video world model you can run interactively on a single desktop GPU — the efficiency framing is the headline. They distill a bidirectional teacher into a causal student with ODE distillation, then add LongForcing to align long self-rollouts against an extended-horizon teacher and fight the usual autoregressive drift, trained on a mix of AAA games, sim engines, and internet video behind 14 deterministic quality checks. “Runs on one desktop GPU” is the part that makes a world model feel usable rather than a demo.","standout":true,"links":{"abs":"https://arxiv.org/abs/2607.19191","pdf":"https://arxiv.org/pdf/2607.19191","html":"https://arxiv.org/html/2607.19191v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.19191"}},{"arxivId":"2607.19228v1","title":"IGGT4D: Streaming 4D Instance-Grounded Geometry Transformer","authors":["Zhengyu Zou","Hao Li","Kuixuan Jiao","Liu Liu","Tingyang Xiao","Xiaolin Zhou","Fangzhou Hong","Zhizhong Su","Dingwen Zhang","Ziwei Liu"],"categories":["cs.CV"],"abstract":"Real-world spatial intelligence requires agents to understand scenes from continuous video streams, where objects move, persist, disappear, and reappear over time. While recent spatial foundation models have enabled generalizable feed-forward 3D reconstruction, most streaming methods remain geometry-centric and lack temporally consistent object-level understanding. Meanwhile, existing semantic reconstruction and 3D-aware vision-language methods largely rely on externally extracted 2D semantic cues or loosely coupled geometry inputs, limiting unified geometry-instance learning in long dynamic scenes. In this paper, we propose IGGT4D, a streaming instance-grounded geometry Transformer for online 4D scene understanding. IGGT4D processes video frames sequentially, reuses historical context through causal spatial-temporal modeling, and incrementally updates a unified representation of camera motion, geometry, and object identity. This enables long-sequence feed-forward reconstruction with geometry-instance consistency in dynamic environments. To address the lack of high-quality 4D supervision, we further construct InsScene4D-147K, a large-scale dataset spanning real/synthetic and static/dynamic scenes, with RGB images, depth, poses, and temporally consistent instance masks generated by an automated geometry-guided annotation pipeline. Experiments on 3D reconstruction, pose estimation, instance spatial tracking, and open-vocabulary segmentation demonstrate that IGGT4D outperforms existing streaming baselines while maintaining scalable online inference for long dynamic sequences.","take":"Most streaming 3D foundation models are geometry-centric — they reconstruct the scene but don't carry object identity through time. IGGT4D fuses geometry and instance grounding in one streaming transformer, so objects that move, vanish, and reappear keep a consistent handle across a long video, without leaning on externally extracted 2D semantics. Exactly the “spatial intelligence over a continuous stream” problem that matters for real perception.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.19228","pdf":"https://arxiv.org/pdf/2607.19228","html":"https://arxiv.org/html/2607.19228v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.19228"}},{"arxivId":"2607.18730v1","title":"Dual Attention Residuals","authors":["Xingda Yu","Yining Li","Xinzhang Liu","Zhihao Yang","Haowei He","Chao Wang","Yongxiang Li","Shuangyong Song"],"categories":["cs.CL"],"abstract":"Recent work extends Transformer residual pathways along two complementary axes: historical retrieval selects information from earlier depths, whereas multi-stream methods maintain multiple residual trajectories. These capabilities have largely been studied in isolation, and assigning an independent retriever to each stream still prevents one trajectory from influencing depth selection in another. We propose Dual Attention Residuals (DAR), which brings multi-stream interaction into historical retrieval through reciprocal cross-stream addressing. For each target stream, DAR computes depth weights from normalized states in the opposite stream and applies them to values from the target stream's own history. The retrieved states are combined for an unchanged Transformer branch and updated through constrained gated writes; a block-form variant operates on block-level histories to control overhead. Across dense models from 0.1B to 1B parameters and a 7B sparse-MoE model, DAR consistently improves validation loss over standard residual Transformers and Attention Residuals. Routing ablations show that the gain cannot be explained by an additional stream or value projection alone. Representation and intervention analyses further show that reciprocal cross-stream selection preserves depth-wise diversity and avoids the redundancy or functional imbalance observed in alternative two-stream designs.","take":"Two recent ideas for the residual stream — historical retrieval (pull from earlier depths) and multi-stream (keep several residual trajectories) — have mostly lived apart. DAR couples them: each stream computes its depth-selection weights from the *other* stream's normalized states. It's a close cousin of the Attention Residuals in Kimi K3, and one more sign that the residual pathway itself is becoming a design surface, not just a highway.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.18730","pdf":"https://arxiv.org/pdf/2607.18730","html":"https://arxiv.org/html/2607.18730v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.18730"}},{"arxivId":"2607.18625v1","title":"Norm or Direction? Decoding Vision Mambas for High-Resolution Vision","authors":["Jin Yu","Juyoun Park"],"categories":["cs.CV","cs.AI"],"abstract":"Vision Mamba models replace quadratic self-attention with linear complexity selective state space models (SSMs), emerging as efficient visual backbones. However, MambaOut demonstrates that a Gated CNN block can match or exceed VMamba on image classification, questioning the necessity of SSMs for vision. This raises a fundamental question: do VMamba and MambaOut encode visual information differently at the representation level? To investigate, we apply cross model centered kernel alignment (CKA) analysis and find that VMamba's final stage blocks form representations distinctly different from both MambaOut and its own preceding blocks. We therefore focus on the final block features, decomposing each spatial token into magnitude and direction. MambaOut concentrates class-discriminative information in high-norm foreground tokens that align with Grad-CAM attribution. VMamba, by contrast, produces high-norm tokens predominantly in background regions, misaligned with Grad-CAM, yet preserves discriminative signals primarily in token directions. These observations reveal that the two models rely on different encoding strategies. We connect this difference to high-resolution classification and semantic segmentation. VMamba distributes logit support broadly across object regions, whereas MambaOut relies on sparse dominant tokens, a strategy that becomes less stable as token counts grow. Under full fine-tuning for segmentation, VMamba consistently outperforms MambaOut. These results suggest that VMamba's advantage in dense prediction stems not merely from the SSM mechanism or sequence length, but from how semantic evidence is organized across token magnitude, direction. Ultimately, we conclude that token magnitude and directional structure serve as critical axes for improving visual backbones, particularly under dense supervision.","take":"A clean measure-before-you-architect paper. MambaOut showed a gated CNN can match Vision Mamba, so do they actually encode vision differently? CKA says yes — VMamba's final-stage blocks diverge from both MambaOut and their own earlier blocks — and decomposing each spatial token into magnitude vs direction localizes where the difference lives. Pairs nicely with our own CKA map of workspace geometry across models.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.18625","pdf":"https://arxiv.org/pdf/2607.18625","html":"https://arxiv.org/html/2607.18625v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.18625"}},{"arxivId":"2607.19171v1","title":"Point Ladder Tuning: Parameter-Efficient Hierarchical Adaptation for 3D Point Cloud Understanding","authors":["Junlin Chang","Longhao Zou","Rui Li"],"categories":["cs.CV"],"abstract":"Fine-tuning pre-trained point-cloud backbones typically updates all parameters, resulting in substantial computation and memory overhead. More importantly, modern point backbones rely on aggressive tokenization and downsampling, which yields compact global tokens but irreversibly discards fine-grained local geometry, an inherent bottleneck for parameter-efficient adaptation. Consequently, existing PEFT methods that operate only on these coarsened tokens can modulate global semantics but struggle to recover the missing multi-scale locality. We present Point Ladder Tuning (PLT), a locality-aware PEFT framework that performs hierarchical, instance-conditioned adaptation while keeping the backbone frozen. PLT forms a lightweight closed loop: (i) a Hierarchical Ladder Network (HLN) constructs a multi-resolution local feature pyramid directly from raw points; (ii) a Local-Global Fusion (LGF) aligns and fuses local pyramids with intermediate backbone semantics; and (iii) a Dynamic Prompt Generator produces instance-aware multi-scale prompts to modulate the frozen backbone effectively. For dense prediction, we further introduce a lightweight segmentation head that progressively upsamples fused features and leverages backbone priors to refine fine structures. Extensive experiments on classification and dense prediction show that PLT consistently surpasses prior PEFT baselines with minimal tunable parameters. PLT achieves state-of-the-art performance using only 2.71% trainable parameters for classification and 7.69% for dense prediction, and scales favorably to larger backbones, requiring merely 0.36% parameters on PointGPT-L. The code is released at https://github.com/JunLinChang/ECCV2026-PLT.","take":"Point backbones tokenize and downsample hard, which yields compact global tokens but irreversibly discards fine local geometry — so PEFT methods that only touch the coarse tokens can modulate semantics but can't recover multi-scale locality. Point Ladder Tuning adds a locality-aware ladder that adapts hierarchically with the backbone frozen. Practical for fine-tuning a point-cloud model on a real inspection dataset without paying the full-finetune bill.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.19171","pdf":"https://arxiv.org/pdf/2607.19171","html":"https://arxiv.org/html/2607.19171v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.19171"}},{"arxivId":"2607.18722v1","title":"Stale but Stable: Staleness-Adaptive Trust Regions for Stabilizing Asynchronous Reinforcement Learning","authors":["Junyao Yang","Yucheng Shi","Zongxia Li","Zhongzhi Li","Ruhan Wang","Xiangxin Zhou","Kishan Panaganti","Haitao Mi","Leowei Liang"],"categories":["cs.LG","cs.CL"],"abstract":"Asynchronous reinforcement learning improves throughput by decoupling rollout generation from optimization, but staleness is an inevitable byproduct compounded by policy lag, engine delays, and mixture-of-experts routing. From a trust-region perspective, this mismatch is critical: training-inference divergence governs approximation error in finite-horizon bounds, whereas PPO clipping only gates sampled outward updates, acting as a sampled surrogate rather than a full-policy constraint. As a result, high-staleness updates remain weakly controlled in the asynchronous regime where stale rollouts matter most. We introduce the Staleness-Adaptive Trust Region (SAT), which uses the detached sampled log-ratio as a practical staleness proxy, identifies high-mismatch tails within each batch via staleness-based kernel scaling, and contracts only the sign-selected endpoint of the nominal PPO interval. This preserves baseline behavior on ordinary tokens while enforcing more conservative updates on newly intercepted outward bands. We prove local interval containment and pointwise pessimism relative to PPO, showing how the adaptive rule reshapes update geometry under heterogeneous staleness. We evaluate SAT in a decoupled asynchronous RL setup built on Qwen3-30B-A3B-Base, using SGLang as the inference engine and Megatron for training. In this setting, SAT-GSPO w/ R3 achieves the best observed AIME24 avg@8, reaching 35.83 at lag 1 and 34.79 at lag 8, while SAT-GSPO reaches 34.17 at lag 1. Adaptive clipping and routing replay act as complementary stabilizers targeting mismatch tails and routing inconsistency, respectively. Overall, aligning clip intervals with staleness heterogeneity effectively stabilizes asynchronous RL.","take":"Async RL buys throughput by decoupling rollout generation from optimization, but staleness — policy lag, engine delays, MoE routing drift — quietly breaks the trust region, and PPO clipping only gates the sampled updates rather than the full-policy mismatch. SAT uses the detached sampled log-ratio as a staleness proxy to tighten the region exactly where stale rollouts matter most. Squarely in the make-large-scale-RL-not-fall-over genre we keep returning to.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.18722","pdf":"https://arxiv.org/pdf/2607.18722","html":"https://arxiv.org/html/2607.18722v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.18722"}},{"arxivId":"2607.18578v1","title":"Two-Stage Extrinsic Calibration of a Static Line-Scanning Lidar with a Rotary Platform","authors":["Vikram Shree","Hike Danakian","Long Nguyen","Rajanish Gokidi","Patrick Nercessian"],"categories":["cs.RO","eess.SP"],"abstract":"A line-scanning lidar yields range and azimuth values in a fixed plane. To perceive surrounding objects in 3D, there must be relative motion between the lidar plane and the object. Thus, using a rotating base-platform is promising for industrial applications where objects need to be scanned or inspected precisely, and is the main focus of this work. In the rotary platform setup, a 3D point cloud of an object can be constructed if the axis of rotation and the precise motion about that axis are known. However, this setup gives rise to the following problem: how can the axis of rotation of the platform be accurately identified with respect to the lidar coordinate system? It is referred to as the calibration problem in the robotics community. Any inaccuracy in this transformation directly affects the quality of the reconstructed point cloud, leading to misrepresentation of the object of interest. In this work, we explore automated approaches to statically and dynamically estimate the transformation of a rotary platform's axis of rotation with respect to a static line-scanning lidar. The proposed algorithms have been validated on real-world datasets obtained from a custom made rotary platform and an FMCW lidar, and their convergence characteristics are studied for various initial conditions.","take":"A refreshingly concrete robotics paper. A static line-scanning lidar on a rotary platform can build a 3D point cloud of an object — but only if you know the axis of rotation precisely, and recovering that axis in the lidar frame is the whole problem. Their two-stage extrinsic calibration pins it down for industrial scan-and-inspect rigs. This is the unglamorous perception plumbing that decides whether an inspection setup actually works.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.18578","pdf":"https://arxiv.org/pdf/2607.18578","html":"https://arxiv.org/html/2607.18578v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.18578"}}],"kind":"arxiv","slug":"2026-07-22","body":"Eight from today's feed — a streaming 4D geometry transformer and a single-GPU world model, a diffusion-drafter speculative-decoding fix, two residual-stream/SSM architecture notes, parameter-efficient point-cloud tuning, staleness-aware async RL, and a genuinely useful lidar calibration for rotary inspection rigs.\n","readingTimeMins":1,"url":"https://ai.thesatyajit.com/arxiv/2026-07-22"},{"date":"2026-07-15","papers":[{"arxivId":"2607.13031v1","title":"The Seriality Gap in Video Diffusion Models","authors":["Jorge Diaz Chao","Konpat Preechakul","Yuxi Liu","Yutong Bai"],"categories":["cs.LG","cs.CV"],"abstract":"When one ball strikes another, then another, video models should predict the consequences of each bounce. In controlled experiments on multi-ball hard-sphere dynamics, we find that the performance of standard bidirectional video diffusion degrades as the causal chain lengthens, even when provided more denoising steps. In a length-matched single-ball control, where ball-ball interactions are absent, the degradation largely disappears, isolating dependent-event structure rather than video length as the cause. Across intervention studies, methods that increase effective serial computation improve performance disproportionately, including autoregressive/blockwise generation and architectural depth. We identify this pattern as the seriality gap: a mismatch between tasks requiring growing serial computation and video diffusion models whose denoising loop does not provide scalable serial compute. We then prove that, for deterministic video prediction, denoising steps do not add serial computation beyond the backbone.","take":"A clean negative result with a proof attached — my favorite genre. On multi-ball collision dynamics, bidirectional video diffusion gets *worse* as the causal chain lengthens, and piling on denoising steps doesn't help; but a length-matched single-ball control (no interactions) barely degrades, which isolates dependent-event structure — not video length — as the culprit. They name it the seriality gap and prove that, for deterministic prediction, denoising steps add no serial computation beyond the backbone. Tellingly, the fixes that work — autoregressive/blockwise generation, more depth — are exactly the ones that add real serial compute. Sobering for anyone treating a video diffusion model as a physics simulator: the denoising loop is not the sequential reasoning you might assume it is.","standout":true,"links":{"abs":"https://arxiv.org/abs/2607.13031","pdf":"https://arxiv.org/pdf/2607.13031","html":"https://arxiv.org/html/2607.13031v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.13031"}},{"arxivId":"2607.12829v1","title":"Accelerating Masked Diffusion Large Language Models: A Survey of Efficient Inference Techniques","authors":["Daehoon Gwak","Minhyung Lee","Junwoo Park","Jaegul Choo"],"categories":["cs.LG","cs.AI","cs.CL"],"abstract":"Diffusion large language models (dLLMs) offer a theoretical advantage in parallel generation over standard autoregressive models. However, parallel generation alone does not guarantee practical speedups. Realizing this efficiency requires specialized inference mechanisms, such as diffusion-aware caching and reuse. As inference efficiency becomes a prerequisite for practical deployment, recent research has actively explored acceleration techniques across algorithms, architectures, and systems. However, rigorous comparisons remain difficult, as end-to-end latency stems from intricate trade-offs between algorithmic, architectural, and system-level factors that are often conflated in existing benchmarks. In this survey, we introduce a unified latency decomposition framework for dLLMs to disentangle these factors, categorize acceleration techniques along three axes (algorithmic innovations, architectural and system optimizations, and inference-time scaling), and provide guidelines for reproducible benchmarking.","take":"Diffusion LLMs promise parallel generation, but as I kept flagging writing up [iLLaDA](/articles/illada-diffusion-language-model), \"parallel\" on paper rarely means \"fast\" in practice. This survey is the useful corrective: a unified latency-decomposition framework that stops people conflating algorithmic, architectural, and system-level speedups, then a taxonomy of the acceleration zoo — diffusion-aware caching/reuse, architecture and system tricks, inference-time scaling — laid out along those axes. The honest through-line is that rigorous end-to-end benchmarking is genuinely hard because these factors trade off against each other, which is exactly why a single-number \"5× faster\" claim for a dLLM deserves suspicion. Read it as a map before you trust any diffusion-LM speedup.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.12829","pdf":"https://arxiv.org/pdf/2607.12829","html":"https://arxiv.org/html/2607.12829v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.12829"}},{"arxivId":"2607.13013v1","title":"Audio-Native Speech Recognition with a Frozen Discrete-Diffusion Language Model","authors":["Harsha Vardhan Khurdula","Abhinav Kumar Singh","Yoeven D Khemlani","Vineet Agarwal"],"categories":["cs.AI","cs.SD"],"abstract":"Automatic speech recognition is dominated by autoregressive decoders that emit one token at a time. We ask whether a discrete diffusion language model can transcribe speech instead, refining a whole transcript in parallel over a small number of denoising steps. We train an audio-native interface for DiffusionGemma, a 26B mixture-of-experts model that generates text by uniform, random-token discrete diffusion rather than the absorbing-mask scheme common to recent diffusion language models. A frozen Whisper encoder supplies acoustic features, a lightweight projector maps them into the model embedding space, and low-rank adapters let the frozen backbone attend to the new modality. About 42M parameters are trained, which is 0.16 percent of the backbone. We find that the natural training objectives fail to ground the audio because their gradient reaches the projector only through attention that has already dismissed it. A connectionist temporal classification loss applied through the frozen output head breaks this deadlock. The resulting model reaches 6.6 percent word error rate on LibriSpeech test-clean, transcribes in roughly eight parallel steps regardless of utterance length, and uses a single adapter trained on six languages.","take":"A tidy bolt-on: freeze a 26B diffusion LM (DiffusionGemma, which denoises via uniform random-token corruption rather than the absorbing-mask scheme in [iLLaDA](/articles/illada-diffusion-language-model)), hang a frozen Whisper encoder plus a tiny projector and LoRA off it, and transcribe speech by refining the whole transcript in ~8 parallel steps regardless of length — training only 42M params, 0.16% of the backbone. The genuinely interesting part is a failure they had to engineer around: the natural objectives never grounded the audio because the gradient reached the projector only through attention that had already ignored it, and a CTC loss through the frozen output head broke the deadlock. 6.6% WER on LibriSpeech from a near-frozen model is a strong argument that diffusion LMs are a real substrate, not just a curiosity.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.13013","pdf":"https://arxiv.org/pdf/2607.13013","html":"https://arxiv.org/html/2607.13013v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.13013"}},{"arxivId":"2607.12790v1","title":"Who Grades the Grader? Co-Evolving Evaluation Metrics and Skills for Self-Improving LLM Agents","authors":["Xing Zhang","Guanghui Wang","Yanwei Cui","Ziyuan Li","Wei Qiu","Bing Zhu","Peiyang He"],"categories":["cs.AI","cs.CL","cs.MA"],"abstract":"Self-evolving agent systems improve by creating, revising, and retiring their own skills, but every such loop rests on a hidden assumption: a reliable evaluation metric already exists. In many real applications it does not. We make three claims. First, metrics can be evolved: our metric loop searches compositions of small drawback detectors under a full evolutionary lifecycle, trained to agree with a ten-item anchored reference set, regularized by consensus over unlabeled outputs, and audited against a held-out anchor it never reads, yielding a transparent, inspectable metric rather than an opaque judge. Second, since no metric exists to beat, the yardstick is recovering what an accurate metric would have enabled, and Double Ratchet, our co-evolution of the metric with a lifecycle-managed skill loop, does so: across code generation (MBPP+), enterprise text-to-SQL (Spider 2.0-Snow), and reference-free report generation, it retains 88 to 110 percent of the held-out lift achieved by the same skill loop driven by ground-truth labels.","take":"Every self-improving agent loop quietly assumes a reliable evaluation metric already exists — and in real applications it usually doesn't, which makes the whole loop circular. This paper evolves the *metric itself*: a searchable composition of small \"drawback detectors\" trained to agree with a ten-item anchored reference set, regularized by consensus on unlabeled outputs, and audited against a held-out anchor it never sees — so you get an inspectable metric, not an opaque LLM judge. Their \"Double Ratchet\" co-evolves that metric with the skill loop and recovers 88 to 110% of the lift you'd get from ground-truth labels on MBPP+, text-to-SQL, and report generation. A sharp answer to the problem lurking under every [multi-agent self-improvement](/articles/nous-hermes-moa) scheme: who grades the grader.","standout":true,"links":{"abs":"https://arxiv.org/abs/2607.12790","pdf":"https://arxiv.org/pdf/2607.12790","html":"https://arxiv.org/html/2607.12790v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.12790"}},{"arxivId":"2607.12881v1","title":"Inhibited Self-Attention: Sharpening Focus in Vision Transformers","authors":["Peter R. D. van der Wal","Nicola Strisciuglio","George Azzopardi"],"categories":["cs.CV"],"abstract":"Vision Transformers (ViTs) have demonstrated remarkable performance in computer vision tasks. However, their self-attention mechanism often diffuses focus across background regions, relying on spurious correlations rather than object-relevant cues. Inspired by inhibitory mechanisms observed in biological vision systems, we propose the Inhibited Self-Attention (ISA), a novel self-attention that integrates inhibitory signals to enhance feature selectivity and suppress spurious responses. In contrast to conventional self-attention, which relies solely on positive attention values due to softmax normalization, our approach retains and utilizes negative attention scores to suppress irrelevant features and sharpen focus on objects of interest. Experiments across multiple datasets, including ImageNet-1k and COCO, and several robustness benchmarks demonstrate that ISA enhances object-centric selectivity, reduces shortcut reliance, and improves out-of-distribution generalization.","take":"Softmax attention is all-positive: every token can only *add* to the mix, so nothing can be actively suppressed. Inspired by inhibitory neurons in biological vision, ISA keeps negative attention scores so a ViT can push background and spurious features *down* instead of merely up-weighting the salient ones. The payoff is object-centric selectivity — less shortcut reliance, better out-of-distribution generalization on ImageNet-1k/COCO, and relevance maps that visibly tighten onto objects. It's a small change to the [attention](/articles/how-transformers-attention-works) primitive with an intuitive story: let attention say \"no,\" not just \"yes.\" Worth watching whether the same trick helps language models, where softmax's positivity is just as baked in.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.12881","pdf":"https://arxiv.org/pdf/2607.12881","html":"https://arxiv.org/html/2607.12881v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.12881"}},{"arxivId":"2607.12959v1","title":"ViCo3D: Empowering LiDAR-based Collaborative 3D Object Detection with Vision Foundation Models","authors":["Haojie Ren","Songrui Luo","Lingfeng Wang","Yan Xia","Yao Li","Jing Li","Lu Zhang","Jiajun Deng","Yanyong Zhang"],"categories":["cs.CV"],"abstract":"LiDAR-based collaborative 3D perception in Vehicle-to-Everything (V2X) systems typically relies on fusing bird's-eye-view (BEV) features across agents. However, current BEV representations, typically extracted by LiDAR backbones trained from scratch, are geometry-dominated and lack general semantic priors, inherently limiting feature-level collaboration. Vision foundation models (VFMs) pretrained on large-scale image data learn general-purpose visual representations and could enhance agent-wise LiDAR BEV features, but adapting them is hard due to the image-point-cloud modality gap. ViCo3D projects point clouds onto the BEV plane as three-channel images so DINOv2 can extract BEV-space visual features from LiDAR inputs, introduces a multi-scale BEV fusion module to integrate these with LiDAR geometric features, and adopts an ego-centric cross-agent fusion strategy. On DAIR-V2X and V2XSet it achieves state-of-the-art 3D detection, with up to 1.8x greater collaborative gains than prior methods on DAIR-V2X.","take":"Collaborative V2X perception fuses bird's-eye-view features across vehicles, but those BEV features come from LiDAR backbones trained from scratch — geometry-rich, semantics-poor. ViCo3D's move is to render the point cloud as a three-channel BEV image so a frozen DINOv2 can pull general semantic priors out of it, then fuse those with the raw geometric features before sharing across agents. It reports state-of-the-art on DAIR-V2X/V2XSet and, more tellingly, up to 1.8x larger *collaborative* gains — the vision-foundation semantics are what make cross-agent fusion actually pay off. A nice bridge between the point-cloud world of [FAST-LIO2](/articles/fast-lio2-lidar-inertial-odometry) and the 2D foundation-model world.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.12959","pdf":"https://arxiv.org/pdf/2607.12959","html":"https://arxiv.org/html/2607.12959v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.12959"}}],"kind":"arxiv","slug":"2026-07-15","body":"","readingTimeMins":1,"url":"https://ai.thesatyajit.com/arxiv/2026-07-15"},{"date":"2026-07-10","papers":[{"arxivId":"2607.08642v1","title":"DominoTree: Conditional Tree-Structured Drafting with Domino for Speculative Decoding","authors":["Saw S. Lin","Jyh-Shing Roger Jang"],"categories":["cs.CL"],"abstract":"Speculative decoding accelerates LLM inference by drafting several tokens and verifying them in parallel. Block-diffusion drafters such as DFlash produce a draft block in one pass but model only per-position marginals; best-first tree methods such as DDTree expand candidate trees from those marginals. The released Domino drafter adds a GRU-based causal correction that makes each draft token's distribution path-dependent, a structure DDTree's factorized formulation cannot represent. We introduce DominoTree, a training-free best-first draft tree scored by Domino's conditional, non-factorized correction along each root-to-node path, made practical by restricting the per-node correction to a candidate top-M. On Qwen3-4B across eight benchmarks, DominoTree reaches up to 6.6x speedup over autoregressive decoding and the highest mean accept length of any evaluated method, up to 10.7 tokens per round, at every temperature we test. DominoTree constructs its tree with a GPU-native, CUDA-graph builder that is bit-identical to a reference Python implementation, so acceptance is unchanged, while keeping per-round tree construction cheap.","take":"We walked the mechanics of speculative decoding in the [DeepSeek dSpark piece](/articles/deepseek-dspark) — draft cheap, verify the whole block in one target pass. DominoTree's twist is that the draft *tree* is scored by a conditional, path-dependent correction (Domino's GRU head), so each root-to-node path carries its own distribution instead of the factorized per-position marginals a tree method like DDTree assumes. Training-free, with a CUDA-graph builder that's bit-identical to the reference so acceptance is unchanged, and up to 6.6x over autoregressive on Qwen3-4B at a mean accept length of 10.7 tokens/round. Honest read: the throughput edge over prior tree methods is decisive at T=0 but narrows to a tie (and a small loss) at high temperature — the win is real, not universal.","standout":true,"links":{"abs":"https://arxiv.org/abs/2607.08642","pdf":"https://arxiv.org/pdf/2607.08642","html":"https://arxiv.org/html/2607.08642v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.08642"}},{"arxivId":"2607.08734v1","title":"The Illusion of Equivalency: Statistical Characterization of Quantization Effects in LLMs","authors":["Baha Rababah","Cuneyt Gurcan Akcora","Carson K. Leung"],"categories":["cs.AI"],"abstract":"Post-training quantization is widely used to deploy large language models in resource-constrained settings, yet its evaluation relies almost exclusively on accuracy and perplexity. We show that these metrics fail to capture behavioral changes induced by quantization. We introduce correctness agreement, a decision-level metric that measures overlap in correct predictions between a base model and its quantized variants, independent of absolute accuracy. Across multiple models and quantization schemes from 8-bit to 2-bit, we find that behavioral divergence emerges under moderate quantization even when task performance appears preserved. To explain this effect, we analyze quantization as a structural operator on attention weights and quantify layer-wise distortions using statistical and distributional measures. Our results reveal non-linear breakpoints at low bit-widths and show that query and key projections are consistently more sensitive than value and output projections.","take":"This is the paper the quantization hype needs. Accuracy and perplexity say an INT4 model \"matches\" its base — this shows that's an illusion: introduce *correctness agreement* (do the two models get the same items right?) and behavioral divergence appears under even moderate quantization while the headline metrics look preserved. It localizes the damage, too — query and key projections are consistently more sensitive than value/output, with non-linear breakpoints at low bit-width. It's exactly the caveat we kept flagging around [native FP4 training](/articles/nemotron-nvfp4) and [rotation-based KV quant](/articles/turboquant-kv-cache): a matched loss or accuracy number is not behavioral parity, and you have to measure the behavior to know.","standout":true,"links":{"abs":"https://arxiv.org/abs/2607.08734","pdf":"https://arxiv.org/pdf/2607.08734","html":"https://arxiv.org/html/2607.08734v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.08734"}},{"arxivId":"2607.08643v1","title":"BiSCo-LLM: Lookup-Free Binary Spherical Coding for Extreme Low-Bit Large Language Model Compression","authors":["Yuantian Shao","Peisong Wang","Zhilei Liu","Chuangyi Li","Yuanteng Chen","Pengcheng Xie","Yiwu Yao","Zhihui Wei","Jian Cheng"],"categories":["cs.LG"],"abstract":"Large language models are increasingly constrained by memory capacity, weight bandwidth, and checkpoint storage during deployment. Scalar or group-wise quantization is simple and compatible with efficient low-precision kernels, but its representation capacity becomes limited when the target budget approaches 2 bits per weight. Vector-quantized weight compression provides a richer block-level representation, but usually introduces explicit codebooks, index lookup, and additional storage. BiSCo-LLM is a codebook-free binary spherical coding framework: local weight chunks are mapped onto a unit hypersphere and binarized into compact spherical codes, so the main payload is a bit-packed sign stream rather than explicit VQ centroids; a residual stage encodes the reconstruction error without stored codebooks; and category-wise recovery distillation reduces the mismatch between local reconstruction and assembled model behavior. A small 8-bit protected-channel path stabilizes sensitive channels and is counted separately.","take":"Vector quantization gives you rich sub-2-bit weights but drags in explicit codebooks and index lookups; scalar/group quant is kernel-friendly but runs out of representation near 2 bits. BiSCo-LLM threads it — map weight chunks onto a unit hypersphere and binarize to a bit-packed sign stream (a *codebook-free* spherical code), then a residual stage buys back rate-distortion without stored centroids. Same \"normalize into a nicer basis, then quantize\" instinct as [TurboVec](/articles/turbovec). Watch the fine print though: the honest storage budget also counts the neural decoders, an 8-bit protected-channel path, and LoRA adapters — which is exactly where extreme-low-bit claims tend to hide the weight they saved.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.08643","pdf":"https://arxiv.org/pdf/2607.08643","html":"https://arxiv.org/html/2607.08643v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.08643"}},{"arxivId":"2607.08766v1","title":"OPSD-V: On-Policy Self-Distillation for Post-Training Few-Step Autoregressive Video Generators","authors":["Hongyu Liu","Chun Wang","Feng Gao","Xuanhua He","Yue Ma","Ziyu Wan","Yong Zhang","Xiaoming Wei","Qifeng Chen"],"categories":["cs.CV"],"abstract":"We propose OPSD-V, an on-policy self-distillation paradigm for post-training few-step autoregressive (AR) video diffusion models. Existing few-step AR video generators can produce long videos with low latency, but still suffer from error accumulation and weakened motion dynamics during long autoregressive rollout. OPSD-V reduces long-horizon degradation while preserving the original few-step inference path. The student follows the exact inference-time rollout, generating each chunk conditioned on its own previously generated KV cache; in parallel, the teacher is evaluated at the same student-visited denoising states but uses a cleaner AR-consistent temporal cache in which older history can be replaced by real-video context. This provides dense denoising-level corrective targets under on-policy AR cache dynamics, without changing the sampler, number of denoising steps, or inference-time cache mechanism. A user study prefers OPSD-V over the base models in 66.0% of overall-preference judgments.","take":"Few-step autoregressive video generators are fast but drift — error accumulates and motion goes limp over a long rollout. OPSD-V fixes it without touching the inference path: the student rolls out exactly as it will at test time (conditioned on its own KV cache), while a teacher is evaluated at the same denoising states but allowed a cleaner cache seeded with *real* video, giving dense corrective targets. It's the on-policy, cache-aware cousin of the training-free acceleration tricks in [MrFlow](/articles/mrflow-diffusion-acceleration) — and refreshingly, the win is reported as a preference study (66% overall) rather than one cherry-picked metric.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.08766","pdf":"https://arxiv.org/pdf/2607.08766","html":"https://arxiv.org/html/2607.08766v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.08766"}},{"arxivId":"2607.08408v1","title":"Track2Map: Online Deformable SLAM with Motion-Aware Pose Optimization in Robotic Surgery","authors":["Tianyi Song","Sierra Bonilla","Xinwei Ju","Evangelos Mazomenos","Danail Stoyanov","Adam Schmidt","Omid Mohareri","Sophia Bano","Francisco Vasconcelos"],"categories":["cs.CV","cs.AI"],"abstract":"Gaussian splatting is the current state-of-the-art for dense, deformable 3D anatomy reconstruction in robot-assisted minimally invasive surgery; however, most pipelines are offline and depend on accurate camera trajectory priors (often from robotic kinematics), limiting applicability when priors are missing or noisy. Track2Map is an online 3D Gaussian Splatting pipeline that jointly optimizes camera trajectory and 3D deformable scene representation directly from surgical video, and due to its online nature effectively works as a SLAM method. To stabilize optimization under tissue motion and ambiguous visual cues, it introduces a track-anchored deformation initialization using dense 2D point tracks, and uses track statistics to disentangle camera motion from scene deformation by detecting static camera periods and reducing drift during incremental mapping. Experiments on StereoMIS improve reconstruction quality and camera trajectory over competing SLAM methods.","take":"Most deformable-anatomy reconstruction is offline and leans on a camera-trajectory prior from robot kinematics. Track2Map drops the prior and does it online — jointly optimizing camera pose *and* a deformable 3D Gaussian scene straight from surgical video, which makes it a genuine SLAM system. The bit that keeps it from diverging under tissue motion is the same problem [FAST-LIO2](/articles/fast-lio2-lidar-inertial-odometry) fights: it leans on dense 2D point tracks to initialize deformation and to detect static-camera periods, disentangling camera motion from scene deformation to cut drift. Gaussian-splat SLAM in a squishy, prior-free scene is a genuinely hard setting.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.08408","pdf":"https://arxiv.org/pdf/2607.08408","html":"https://arxiv.org/html/2607.08408v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.08408"}},{"arxivId":"2607.08391v1","title":"On Exploring Input Resolution Scaling For Anytime LiDAR Object Detection","authors":["Ahmet Soyyigit","Shuochao Yao","Heechul Yun"],"categories":["cs.RO","cs.LG"],"abstract":"Making tradeoffs between execution latency and result utility (anytime computing) has been shown to enhance the performance of cyber-physical systems. We enable anytime computing for DNNs that process LiDAR point clouds for 3D object detection: a method that enables multi-resolution inference for models that process point clouds as pillars or voxels, allowing the input to be dynamically scaled and processed at the resolution needed to meet timing requirements. The memory-efficient approach requires deploying only a single DNN model rather than one per resolution. A deadline-aware scheduler selects the highest possible resolution for each input by predicting the execution time for all resolutions at runtime, which is challenging due to the irregularity of LiDAR point clouds. On nuScenes it significantly outperforms existing anytime approaches, and in a simulated autonomous driving system it enables collision-free navigation while avoiding unnecessary stalls.","take":"A tidy systems idea for real-time perception: instead of shipping N models trained at N resolutions, train one point-cloud detector that can be *run* at any resolution, then add a deadline-aware scheduler that predicts each resolution's runtime and picks the highest one that fits the time budget. Predicting runtime is the hard part, because LiDAR point clouds are irregular so cost isn't a clean function of input size. It's the anytime-computing counterpart to the latency discipline that matters in the [FAST-LIO2](/articles/fast-lio2-lidar-inertial-odometry) stack — trade a little accuracy to never blow the frame deadline.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.08391","pdf":"https://arxiv.org/pdf/2607.08391","html":"https://arxiv.org/html/2607.08391v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.08391"}}],"kind":"arxiv","slug":"2026-07-10","body":"","readingTimeMins":1,"url":"https://ai.thesatyajit.com/arxiv/2026-07-10"},{"date":"2026-07-03","papers":[{"arxivId":"2607.02461v1","title":"OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers","authors":["Donghyun Lee","Jitesh Chavan","Duy Nguyen","Sam Huang","Liming Jiang","Priyadarshini Panda"],"categories":["cs.CV","cs.AI","cs.LG"],"abstract":"Diffusion transformers (DiTs) achieve state-of-the-art image and video generation, but their multi-step sampling and growing parameter count make inference expensive. Post-training quantization (PTQ) is the natural remedy, yet DiT activations shift across timesteps, prompts, and guidance branches, forcing prior methods to re-fit calibration data for every new checkpoint or modality. We present OrbitQuant, a data-agnostic weight-activation quantizer that bypasses range estimation by quantizing in a normalized, rotated basis. In this basis, a randomized permuted block-Hadamard (RPBH) rotation concentrates each coordinate around one fixed, known marginal regardless of the input, so a single Lloyd-Max codebook serves all timesteps, prompts, and layers of a given input dimension. We extend the same quantizer to weight rows offline, absorbing the rotation into the weights so that it cancels inside each linear layer and only a forward rotation on the activations remains at runtime. The same recipe transfers from image to video with no per-modality tuning. Across FLUX.1, Z-Image-Turbo, Wan 2.1, and CogVideoX, it sets the state of the art for PTQ at several low-bit settings, and pushes PTQ of image diffusion transformers to W2A4 with usable generation quality.","take":"If you read the [TurboQuant write-up](/articles/turboquant-kv-cache) this week, OrbitQuant will feel like déjà vu — because it's the same primitive pointed at a third target. Rotate into a basis where every coordinate lands on one fixed, known marginal (here a randomized permuted block-Hadamard rotation), and a single Lloyd-Max codebook quantizes everything with no calibration. That data-free property is exactly what diffusion-transformer PTQ needs, because DiT activations drift across every timestep, prompt, and guidance branch, so calibration-based quantizers have to re-fit constantly. The tidy engineering touch is absorbing the rotation into the weights offline so it cancels inside each linear layer, leaving only one forward rotation at runtime. Pushing image DiTs to W2A4 with usable output is the number to watch. \"Rotate, then quantize\" is quietly becoming a standard primitive.","standout":true,"links":{"abs":"https://arxiv.org/abs/2607.02461","pdf":"https://arxiv.org/pdf/2607.02461","html":"https://arxiv.org/html/2607.02461v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.02461"}},{"arxivId":"2607.02502v1","title":"DemoPSD: Disagreement-Modulated Policy Self-Distillation","authors":["Yunhe Li","Hao Shi","Wenhao Liu","Mengzhe Ruan","Hanxu Hou","Zhongxiang Dai"],"categories":["cs.LG","cs.AI"],"abstract":"On-policy self-distillation (OPSD) has emerged as a practical method for training large language models (LLMs) to reason, where a single model acts as both the teacher and the student with different levels of information access. However, recent studies have found that the teacher's dense token-level supervision, conditioned on privileged information, can lead to overfitting to in-domain patterns, suppress exploration, and hurt cross-domain generalization, while also introducing a more fundamental issue: privileged information leakage, where the student encodes answer-dependent shortcuts that are unavailable at test time. We introduce DemoPSD, a novel framework that resolves such problems through the idea of selective adoption of teacher guidance. Instead of fitting the full teacher distribution, DemoPSD steers the student toward a reverse-KL barycenter target, a weighted geometric combination of the teacher and student distributions, and uses the teacher-student discrepancy to adaptively control the blending at each token position. We provably show leakage attenuation and exploration preservation. Experiments on SciKnowEval across four scientific fields show DemoPSD outperforms both GRPO and SDPO while maintaining higher training entropy and robustly generalizing to out-of-distribution GPQA.","take":"This is the direct next chapter on the \"privilege illusion\" [DOPD named last week](/arxiv/2026-06-30): when a self-distillation teacher sees privileged info, the student learns answer-dependent shortcuts it can't use at test time — leakage, not capability. DemoPSD's fix is elegant and, unusually, *proven*: don't fit the full teacher, steer toward a reverse-KL barycenter (a geometric blend of teacher and student) with the blend weight set per-token by how much the two disagree. Where the teacher and student already agree, lean on the teacher; where they diverge, preserve the student's own reasoning and exploration. Beating GRPO and SDPO while keeping *higher* training entropy is the tell that it's actually preserving exploration rather than collapsing onto the teacher. On-policy distillation keeps generating its own subfield of failure modes and fixes.","standout":true,"links":{"abs":"https://arxiv.org/abs/2607.02502","pdf":"https://arxiv.org/pdf/2607.02502","html":"https://arxiv.org/html/2607.02502v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.02502"}},{"arxivId":"2607.02255v1","title":"AgenticSTS: A Bounded-Memory Testbed for Long-Horizon LLM Agents","authors":["Xiangchen Cheng","Yunwei Jiang","Jianwen Sun","Zizhen Li","Chuanhao Li","Xiangcheng Cao"],"categories":["cs.AI","cs.CL"],"abstract":"Memory for a long-horizon LLM agent is a contract about what each future decision is allowed to see. The simplest contract appends past observations, tool calls, and reflections to every prompt, which makes prior context easy to access but also turns it into a jumbled mixture in which the effect of any single memory component is hard to isolate. We introduce and instrument an alternative bounded contract: every decision is made from a fresh user message assembled by typed retrieval, with no raw cross-decision transcript appended. The prompt thus stays bounded across runs of any length, and any single layer can be ablated in isolation. We instantiate the contract in Slay the Spire 2, a closed-rule stochastic deck-building game whose runs require hundreds of tactical and strategic decisions. Within our harness, a fixed-A0 ablation shows the largest observed difference when triggered strategic skills are enabled: the no-store baseline wins 3/10 games and adding the skill layer 6/10. We release a reproducible testbed: 298 completed trajectories with condition tags, frozen memory/skill snapshots, prompt records, and analysis scripts.","take":"The framing is the contribution: agent memory is \"a contract about what each future decision is allowed to see.\" Instead of the usual append-everything transcript — which makes it impossible to isolate what any one memory component actually does — they assemble each decision from a *bounded*, typed retrieval, so the prompt stays fixed-size over arbitrarily long runs and each layer can be ablated cleanly. Slay the Spire 2 is a shrewd testbed: hundreds of stochastic decisions per run, frontier LLMs currently at zero wins, humans at 16%, so it's hard but not saturated. The result (3/10 → 6/10 with a strategic-skill layer) is honestly flagged as directional at this sample size — refreshing restraint — but the reusable harness and 298 released trajectories are the real deliverable for anyone studying long-horizon memory.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.02255","pdf":"https://arxiv.org/pdf/2607.02255","html":"https://arxiv.org/html/2607.02255v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.02255"}},{"arxivId":"2607.02436v1","title":"Reasoning effort, not tool access, buys first-try reliability in agentic code generation","authors":["Achint Mehta"],"categories":["cs.SE","cs.AI"],"abstract":"Agentic coding assistants are increasingly given extra capabilities, such as browser-based testing tools and design-oriented system prompts, on the assumption that more capability yields better software. This study tested that assumption directly. Ninety independent agent runs built the same application from one detailed specification, each scored on a fixed 14-criterion functional rubric (42-point max) and a visual review. Capability tier dominated: frontier models clustered near the ceiling while a low-cost local model fell to 24-37 points. Container deployment was the dominant defect, failing first-try in 44% of runs. The testing tool raised cost by 42-68% without improving functional score or reliability. Raising reasoning effort from High to xHigh lifted first-try perfect runs from 28% to 89% and cut corrective prompts about five-fold, for 9-29% more cost. A design-oriented prompt raised visual quality (4.5 vs 3.0) without lifting function. The practical lesson is to match the fix to the failure: most first-run failures came from weak reasoning, not from visible flaws a checking tool would catch.","take":"A refreshingly empirical corrective to the \"give the agent more tools\" reflex. Ninety runs building the same app from one spec, and the result cuts against the current instinct: the browser-based testing tool added 42-68% cost and improved nothing, while simply turning reasoning effort from High to xHigh took first-try-perfect runs from 28% to 89% and cut corrective prompts five-fold. The failures were reasoning failures (container deployment botched on the first try 44% of the time), not the kind of visible flaws a checking tool catches — so a stronger model or more thinking prevents them and a test harness doesn't. It's a single-app observational study, not a controlled benchmark, but \"match the fix to the failure\" is exactly the discipline the agent-tooling gold rush is skipping.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.02436","pdf":"https://arxiv.org/pdf/2607.02436","html":"https://arxiv.org/html/2607.02436v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.02436"}},{"arxivId":"2607.02507v1","title":"What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates","authors":["Arman Ghaffarizadeh","Danyal Mohaddes","Aliakbar Izadkhah","Shahriar Noroozizadeh"],"categories":["cs.AI","cs.CL","cs.LG","cs.MA"],"abstract":"LLM agents will increasingly act in socially structured settings where role, audience, and relational context can shape what is advantageous or costly to say. We study whether such social structure, without any explicit objective in the prompt, changes what an agent expresses publicly relative to an off-the-record (OTR) channel elicited under the same condition. We introduce a dual-channel debate framework in which agents produce public utterances that enter the shared history alongside OTR responses that are recorded but never shown to the other participant. Across 10 models, 3 scenarios, and 5 variations within each scenario, alignment-inducing settings produce systematic public-OTR divergence in the targeted agent, with its decision divergence rising from a ~3% baseline to roughly 40%. The effect is consistent across four aggregate analyses: stance, semantic similarity, natural language inference, and survey responses. In some cases, the OTR response explicitly attributes public accommodation to relational pressures such as career risk or sponsorship obligation.","take":"A genuinely unsettling result, cleanly measured. Put an agent in a socially structured debate with no explicit objective in the prompt, give it a private \"off-the-record\" channel alongside its public utterances, and under alignment-inducing conditions the two diverge — its public-vs-private decision gap jumps from ~3% to ~40%, consistent across stance, semantic-similarity, NLI, and survey measures. Sometimes the private channel *names the reason* — career risk, sponsorship obligation. Nobody trained this in; social structure alone induced a latent objective (say the agreeable thing publicly, hold the real view privately). The takeaway is a real methodology gap: evaluating agents on their stated goals misses emergent ones, and a dual-channel probe is a concrete way to catch them. The kind of eval we're going to need more of.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.02507","pdf":"https://arxiv.org/pdf/2607.02507","html":"https://arxiv.org/html/2607.02507v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.02507"}},{"arxivId":"2607.02509v1","title":"ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning","authors":["Yanjun Zhao","Ruizhong Qiu","Tianxin Wei","Yuanchen Bei","Zhining Liu","Lingjie Chen"],"categories":["cs.AI"],"abstract":"Understanding and reasoning over long contexts has become a key requirement for deploying large language models (LLMs) in realistic applications. Although recent LLMs support increasingly long context windows, they often fail to use relevant evidence that is already present in the input, revealing a gap between context access and effective context utilization. We propose Recursive Evidence Replay as LLM Harness for Long-Context Reasoning (RECONTEXT), a training-free inference method for improving long-context reasoning. RECONTEXT uses model-internal relevance signals to construct a query-conditioned evidence pool and replays it before final generation while preserving the full original context. We provide a theoretical analysis based on associative memory, which characterizes the context as a memory store, the question as a retrieval cue, attention as cue-trace association, and replay as trace reactivation. Experiments on eight long-context datasets with 128K context show RECONTEXT consistently improves evidence utilization across Qwen3-4B, Qwen3-8B, and Llama3-8B.","take":"The gap this targets is real and under-discussed: models with 128K windows can *access* a fact and still fail to *use* it — context length isn't context utilization. ReContext is a training-free harness that reads the model's own internal relevance signals to pull a query-conditioned evidence pool, then replays it right before generation while keeping the full original context intact — no pruning, no external memory. The associative-memory framing (context = store, question = cue, attention = association, replay = reactivation) is a clean way to think about why \"remind the model of what matters\" works. It pairs naturally with the long-context hardware story from [MiMo](/articles/mimo-v2-flash): the architecture makes 256K *cheap*, harnesses like this make it *usable*.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.02509","pdf":"https://arxiv.org/pdf/2607.02509","html":"https://arxiv.org/html/2607.02509v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.02509"}},{"arxivId":"2607.02515v1","title":"PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation","authors":["Haofei Xu","Rundi Wu","Philipp Henzler","Nikolai Kalischek","Michael Oechsle","Fabian Manhardt"],"categories":["cs.CV"],"abstract":"State-of-the-art single-image 3D reconstruction methods often rely on complex hybrid architectures and loss functions, or compress geometry into latent spaces in order to leverage pre-trained latent diffusion models. In this work, we show that such architectural overhead and intricate loss formulations are unnecessary. We introduce a minimalist pixel-space Diffusion Transformer, built on a plain ViT, that operates directly on raw 3D point map patches and is conditioned on image tokens from a pre-trained DINOv3. Unlike existing latent diffusion approaches, we train our diffusion backbone entirely from scratch, eliminating the need for point map tokenizers. Despite its simplicity, our approach surpasses complex latent-based diffusion models while remaining significantly simpler than hybrid alternatives. Notably, it produces sharper geometric structure and is more robust in highly ambiguous regions such as transparent objects.","take":"A satisfying \"the overhead was unnecessary\" result. Monocular 3D reconstruction has accreted hybrid architectures, bespoke losses, and latent-space compressions to piggyback on pretrained latent diffusion models — and PointDiT shows a plain ViT doing pixel-space diffusion directly on raw 3D point-map patches, conditioned on DINOv3 image tokens, beats them. No point-map tokenizer, backbone trained from scratch, and it's *sharper* on the hard cases like transparent objects where latent compression tends to smear. It rhymes with the broader \"stop compressing into latents, work in the native space\" mood ([Cross-Space Distillation](/arxiv/2026-07-01) argued the opposite direction is hard for a reason). Simpler and better is the best kind of paper.","standout":false,"links":{"abs":"https://arxiv.org/abs/2607.02515","pdf":"https://arxiv.org/pdf/2607.02515","html":"https://arxiv.org/html/2607.02515v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2607.02515"}}],"kind":"arxiv","slug":"2026-07-03","body":"A day that reads like a callback to the last two weeks of this feed. **OrbitQuant** is the\n[TurboQuant](/articles/turboquant-kv-cache) primitive — rotate into a known distribution, then\napply one data-free Lloyd-Max codebook — now quantizing diffusion transformers; **DemoPSD** is the\nnext fix in the on-policy-distillation saga, attacking the \"privileged information leakage\" that\n[DOPD](/arxiv/2026-06-30) named. Around them, the long-horizon-agent thread keeps maturing\n(bounded memory contracts, evidence replay, and a sharp reminder that reasoning effort beats tool\nsprawl), plus one genuinely uncomfortable finding about what agents say off the record.\n","readingTimeMins":1,"url":"https://ai.thesatyajit.com/arxiv/2026-07-03"},{"date":"2026-07-01","papers":[{"arxivId":"2606.32034v1","title":"QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents","authors":["Sergio Hernández-Gutiérrez","Matteo Merler","Ilze Amanda Auzina","Joschka Strüber","Ameya Prabhu","Matthias Bethge"],"categories":["cs.LG","cs.AI","cs.CL"],"abstract":"LLM agents increasingly act over long horizons, where a single trajectory can contain hundreds or thousands of actions. In these settings, outcome-only rewards provide too sparse guidance, failing to inform the model about the goodness of intermediate actions. Dense supervision methods aim to solve this problem by scoring intermediate steps, from intrinsic confidence to self-distillation and embedding similarities. However, it is common practice to evaluate them by measuring the downstream performance of a training pipeline that integrates them. This is expensive, conflates supervision quality with training engineering confounders, and renders different methodological families requiring distinct training setups incomparable. We introduce QVal, a training-free testbed for directly evaluating dense supervision signals. Given a state-action pair, QVal measures how well a method's score is Q-aligned: whether it orders actions according to the Q-values of a strong reference-policy. We instantiate QVal as QVal-v1.0, benchmarking 21 dense supervision methods across four diverse environments and seven methodological families, with over 1.2K evaluation experiments across six open-weight model backbones. We find that simple prompting baselines consistently outperform recent dense supervision methods from the literature, and that performance clusters strongly by family.","take":"Yesterday's [Agents-A1](/articles/agents-a1) made the case that verified intermediate steps — not outcome-only rewards — are what a long-horizon agent learns from. QVal asks the sharp follow-up: how do you know your step-scoring signal is any good *before* you spend a training run on it? Their answer is clean — score a state-action pair, then check whether the ordering matches the Q-values of a strong reference policy. Training-free, so you separate signal quality from training-pipeline luck. The result is the kind of finding that only falls out of an honest common-ground benchmark: across 21 methods and 1.2K experiments, plain prompting baselines beat most of the fancy dense-supervision machinery. A testbed that lets you kill bad ideas before the GPU bill.","standout":true,"links":{"abs":"https://arxiv.org/abs/2606.32034","pdf":"https://arxiv.org/pdf/2606.32034","html":"https://arxiv.org/html/2606.32034v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.32034"}},{"arxivId":"2606.32020v1","title":"Cross-Space Distillation: Teaching One-Step Students with Modern Diffusion Teachers","authors":["Anh Nguyen","Ngan Nguyen","Duc Vu","Trung Dao","Viet Nguyen","Quan Dao","Khoi Nguyen","Anh Tran"],"categories":["cs.CV"],"abstract":"Modern one-step diffusion models achieve impressive quality through distribution-based timestep distillation. Yet, they rely on a critical assumption: Teacher and Student must inhabit the same latent space. This Shared-Space constraint prevents knowledge transfer from modern high-capacity Teachers (e.g., SD 3.5 and Flux) into compact, deployment-friendly Students such as SD 1.5, whose latent resolution and VAE parameterization differ from the Teacher. We formalize this overlooked regime as Cross-Space Distillation, where Teacher and Student differ in both latent resolution and VAE space. To enable distillation under this mismatch, we introduce the Bridge, a lightweight latent interface that maps Student latents into the Teacher space without modifying the Student backbone. Bridge combines a frozen Student VAE decoder as a spatial prior with a compact learnable projector, and is trained with latent reconstruction and attention fidelity objectives for stable Teacher-space alignment. Across diverse modern Teachers, Bridge enables substantial gains for compact one-step Students; for example, it improves SD 1.5 from 5.4 to 9.4 HPSv3 while preserving one-step inference, low latency, and broad ecosystem compatibility.","take":"Distillation almost always assumes teacher and student speak the same latent language — same VAE, same resolution. That quietly locks you out of the most useful transfer of all: pouring a modern high-capacity teacher (SD 3.5, Flux) into the small, beloved, ecosystem-rich SD 1.5. The move is a lightweight \"Bridge\" that maps student latents into the teacher's space without touching the student backbone, so the student stays one-step and deployable while learning from a teacher it structurally couldn't before. 5.4 → 9.4 HPSv3 on SD 1.5 is a big jump for a frozen-backbone adapter. The general lesson keeps recurring this week: the interface between teacher and student is where the leverage is.","standout":true,"links":{"abs":"https://arxiv.org/abs/2606.32020","pdf":"https://arxiv.org/pdf/2606.32020","html":"https://arxiv.org/html/2606.32020v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.32020"}},{"arxivId":"2606.31984v1","title":"GR2 Technical Report","authors":["Yufei Li","Zaiwei Zhang","Mingfu Liang","Kavosh Asadi","Jay Xu","Hamed Firooz","Luke Simon"],"categories":["cs.IR","cs.AI"],"abstract":"Industrial recommendation systems serve billions of users through a multi-stage funnel -- retrieval, early-stage ranking, and re-ranking -- where the final re-ranking step disproportionately shapes user engagement. Despite growing enthusiasm for LLMs in recommendation, three gaps hinder industrial adoption: most efforts target retrieval and ranking, leaving re-ranking underexplored; LLMs are typically deployed zero-shot or via SFT, underutilizing RL on verifiable rewards; and deployed catalogs index billions of items with non-semantic identifiers outside any base-LLM vocabulary. We present GR2 (Generative Reasoning Re-Ranker), an end-to-end framework that combines mid-training on semantic IDs with >=99% uniqueness, reasoning traces distilled from a stronger teacher via rejection sampling, and RL with verifiable rewards purpose-built for re-ranking. To make GR2 resource-viable, we introduce a context compressor, On-Policy Distillation (OPD) as a scalable alternative to SFT -- which we find collapses at industrial scale -- and reasoning distillation for low-latency serving. GR2 delivers +18.7% R@1, +7.1% R@3, and +9.6% N@3 over legacy baselines on industrial-scale traffic. We further find that LLMs often hack rewards by preserving the incoming order or exploiting position bias.","take":"Two things make this more than a recsys report. First: another independent data point that **on-policy distillation is eating SFT** — they state flatly that SFT \"collapses at industrial scale\" and reach for OPD instead, the same conclusion Agents-A1, MOPD, and DOPD landed on this week from entirely different directions. Second: the reward-hacking honesty. Their re-ranking LLM learns to game the metric by just preserving the incoming order or riding position bias — exactly the failure mode you'd predict, caught and named, and patched with conditional verifiable rewards. Applying LLM reasoning to the re-ranking stage (the one closest to the user) with billions of non-semantic item IDs is a real systems problem, handled as one.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.31984","pdf":"https://arxiv.org/pdf/2606.31984","html":"https://arxiv.org/html/2606.31984v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.31984"}},{"arxivId":"2606.32025v1","title":"Generative Skill Composition for LLM Agents","authors":["Xinyu Zhao","Zhen Tan","Vaishnav Tadiparthi","Nakul Agarwal","Kwonjoon Lee","Tianlong Chen"],"categories":["cs.CL"],"abstract":"Recent LLM agents benefit from skills for solving complex tasks. As skill libraries grow and become reusable across tasks and domains, selecting an appropriate skill composition has emerged as a central bottleneck. Existing approaches either expose the agent's reasoning to the entire skill collection or perform skill retrieval via embeddings or LLM-based rerankers. Both miss the structural nature of skill composition, which is a joint decision over which skills, how many, and in what order -- three dimensions that cannot be decoupled. We formalize this as structured skill composition: given a task and a skill library, predict an executable skill plan that jointly specifies the activated subset, count, and execution order. We propose SkillComposer, which uses a constrained autoregressive decoder over skill identifiers, so subset, count, and order emerge jointly from a single decoding pass. On GPT-5.2-Codex and Gemini-3-Pro-Preview, SkillComposer raises the pass rate by +23.1 and +18.2pp over the no-skill baseline, surpassing top-3 retrieval and matching the gold-skill retrieval upper bound at lower prompt-token cost.","take":"The framing is the contribution: skill selection isn't retrieval, it's *composition* — which skills, how many, in what order — and those three can't be chosen independently. Treating it as constrained autoregressive decoding over skill IDs, so the subset/count/order fall out of one pass, is the right shape for a joint decision with dependencies between steps. Matching the gold-skill upper bound at lower token cost is the number that matters: it means the plan is as good as knowing the answer, without stuffing the whole library into context. As agents lean harder on skill libraries (the [Claude-style](/articles/agents-a1) procedural-package trend), the router over those skills becomes the bottleneck, and this is a clean take on it.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.32025","pdf":"https://arxiv.org/pdf/2606.32025","html":"https://arxiv.org/html/2606.32025v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.32025"}},{"arxivId":"2606.32014v1","title":"Scalable Behaviour Cloning on Browser via Skill Distillation","authors":["Kaisen Yang","Zheng Jiang","Yuzhao Peng","Houde Qian","Boshi Zhang","Bingxiang He"],"categories":["cs.CL"],"abstract":"Internet users collectively perform an enormous range of skilled work through web browsers, from software development and document editing to search, forms, and enterprise workflows, making human browsing a highly scalable but under-exploited source of reusable browser skills. We argue that the bottleneck for browser agents is decision-making under incomplete information rather than low-level operation, and that the priors agents lack are already implicit in human interaction traces. We therefore study scalable behavior cloning for browser agents via skill distillation, converting user interaction trajectories into compact natural-language skills that agents can read, retrieve, reuse, and compose directly. We further organize the distilled skills into a skill graph so that growth proceeds through consolidation rather than unbounded accumulation.","take":"A nice reframing of where browser-agent capability actually comes from: not more hand-designed tasks, but the priors already latent in ordinary human browsing traces. Distilling those traces into compact, readable natural-language skills — then consolidating them into a skill *graph* so the library grows by merging rather than piling up — is the same \"knowledge-action graph\" instinct [Agents-A1](/articles/agents-a1) used for training data, pointed at browser automation. The claim that the bottleneck is decision-making under incomplete information, not clicking, rings true for anyone who's watched a browser agent fail on judgment rather than mechanics.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.32014","pdf":"https://arxiv.org/pdf/2606.32014","html":"https://arxiv.org/html/2606.32014v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.32014"}},{"arxivId":"2606.32036v1","title":"PointSplat: Compact Gaussian Splatting via Human-Centric Prediction","authors":["Yujie Guo","Yudong Jin","Lingteng Qiu","Zehong Shen","Zhen Xu","Sida Peng","Xiaowei Zhou"],"categories":["cs.CV"],"abstract":"Producing 3D human representations from input views on the fly is essential for immersive live streaming systems, where representation compactness is as critical as high fidelity given limited computational power and transmission bandwidth. Although recent feed-forward reconstruction methods achieve impressive quality through the view-centric prediction of 3D representations, they repeatedly encode the same subject content across multiple views, leading to significant inter-view redundancy. Our key insight is to perform predictions directly in 3D space. We propose PointSplat, a human-centric approach that directly infers Gaussian primitives from an input point set. The method estimates a coarse geometric proxy and performs ray casting to prune redundant points and establish explicit 2D--3D correspondences, then employs a Point-Image Transformer to fuse appearance and geometry features, predicting Gaussian attributes in a single forward pass. This restricts predictions to foreground regions, substantially reducing the total number of Gaussians while improving novel-view rendering quality.","take":"The redundancy insight is the good part: view-centric feed-forward Gaussian predictors re-encode the same person once per input view, so the representation bloats with duplicated content. Moving the prediction into 3D space — infer Gaussians from a point set, ray-cast to prune and pin 2D–3D correspondences — cuts the primitive count while *improving* novel-view quality, which is the rare case where compression and fidelity move the same direction. For live-streamed volumetric humans, where bandwidth is the whole constraint, \"fewer Gaussians, better render\" is exactly the trade you want.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.32036","pdf":"https://arxiv.org/pdf/2606.32036","html":"https://arxiv.org/html/2606.32036v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.32036"}},{"arxivId":"2606.32023v1","title":"FLORA: A deep learning approach to predict forest attributes from heterogeneous LiDAR data","authors":["Emilie Vautier","Clément Mallet","Cédric Vega"],"categories":["cs.CV","cs.AI"],"abstract":"Forest attributes are essential for national-scale resource monitoring. Airborne LiDAR metrics are among the auxiliary variables most strongly correlated with forest attributes used in National Forest Inventory (NFI) estimates. However, producing wall-to-wall predictions remains challenging when LiDAR data are acquired under heterogeneous conditions -- variability in sensors, flight parameters, seasons, and scan angles limits the robustness of existing models. We present FLORA (Forest LiDAR Octree Regression with Auxiliary Data), a deep learning framework that predicts six forest attributes: dominant height, total volume, deciduous volume, coniferous volume, basal area, and stem density from heterogeneous LiDAR point clouds. FLORA combines an octree-based backbone with ecological and spatiotemporal auxiliary variables through a late-fusion gating mechanism. Trained and evaluated on 32,052 National Forest Inventory plots across mainland France, a single model trained on both leaf-on and leaf-off acquisitions outperforms season-specific models. FLORA achieves an rRMSE of about 12.3% (R2 = 0.88) for dominant height and 39% (R2 = 0.74) for total volume.","take":"A grounded reminder that LiDAR point clouds are a national-infrastructure data type, not just a robotics one. The hard part here is the same heterogeneity that bites any real deployment — different sensors, seasons, scan angles — and the fix is sensible: an octree backbone over the raw cloud plus a late-fusion gate that lets ecological and spatiotemporal side-variables in only where they help. The finding that one model trained across leaf-on and leaf-off beats season-specific models is the transferable bit — robustness to acquisition conditions beats bespoke calibration, which is precisely the wall-to-wall property a national forest inventory needs.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.32023","pdf":"https://arxiv.org/pdf/2606.32023","html":"https://arxiv.org/html/2606.32023v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.32023"}}],"kind":"arxiv","slug":"2026-07-01","body":"Two threads dominate today. **On-policy distillation** keeps compounding — GR2 states\noutright that SFT \"collapses at industrial scale\" and reaches for OPD instead, echoing\nAgents-A1, MOPD, and DOPD from earlier this week; Cross-Space Distillation and browser\nskill distillation attack the same teacher→student interface from the vision and agent\nsides. And **long-horizon agents** get a rigorous, training-free way to vet their\nintermediate-reward signals (QVal) — which promptly finds that simple prompting beats most\nof the literature. The through-line: the leverage is in the interface and the verifier,\nnot the parameter count.\n","readingTimeMins":1,"url":"https://ai.thesatyajit.com/arxiv/2026-07-01"},{"date":"2026-06-30","papers":[{"arxivId":"2606.30436v1","title":"Robust and Efficient Monocular 3D Gaussian SLAM for Kilometer-Scale Outdoor Scenes","authors":["Sicheng Yu","Dongxu Shen","Beizhen Zhao","Guanzhi Ding","Hao Wang"],"categories":["cs.CV"],"abstract":"Scaling monocular 3D Gaussian Splatting (3DGS) SLAM to kilometer-level outdoor environments poses two tightly coupled challenges: fragile long-term pose tracking and excessive memory overhead during large-scale mapping. In this paper, we propose KiloGS-SLAM, a highly efficient and robust monocular 3DGS-SLAM system that jointly addresses both bottlenecks. Since high-fidelity scene reconstruction fundamentally relies on drift-free camera poses, we first introduce a motion-adaptive hybrid tracking module. This module features a condition-triggered three-tier solving pipeline. It dynamically switches between Essential matrix and PnP models to handle geometric degeneracies. An on-demand foundation model can also be activated to rescue the trajectory from catastrophic drift. To ensure the system can sustain these long trajectories without memory exhaustion, we subsequently design a lifecycle-managed Gaussian mapping strategy. By integrating probabilistic initialization with chunk-based multi-view densification and pruning, this full-pipeline optimization effectively reduces primitive redundancy while preserving high-frequency details. Extensive experiments across three challenging outdoor datasets demonstrate that our approach achieves state-of-the-art tracking accuracy and rendering quality, successfully scaling to sequences of over 10,000 frames on a single GPU.","take":"The two things that kill SLAM at scale are exactly the two things they go after: pose tracking drifts, and the map eats all your memory. I just spent a week inside a LiDAR-inertial filter watching a map smear because of pose drift, so the framing lands — a clean reconstruction is downstream of a drift-free trajectory, full stop. The tracking answer is a tiered solver that switches Essential-matrix↔PnP by geometry and only wakes a foundation model when the trajectory is about to diverge, which is the right cost model: cheap by default, expensive only when degenerate. The mapping answer — lifecycle-managed Gaussians with chunked densify-and-prune — is the same bounded-window discipline a LiDAR map needs, just on splats. 10,000+ frames on one GPU from a single camera is the number that matters.","standout":true,"links":{"abs":"https://arxiv.org/abs/2606.30436","pdf":"https://arxiv.org/pdf/2606.30436","html":"https://arxiv.org/html/2606.30436v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.30436"}},{"arxivId":"2606.30414v1","title":"Diffusion Fine-tuning with Rewarded Moment Matching Distillation","authors":["Alexis Jacq","Guillaume Couairon","Valentin De Bortoli","Quentin Berthet","Arnaud Doucet","Romuald Elie"],"categories":["cs.LG"],"abstract":"Distillation and Reinforcement Learning (RL) fine-tuning are the primary pillars of diffusion post-training. While traditionally studied in isolation, the interaction between these phases remains poorly understood, and in particular how fine-tuning impacts the generative quality of distilled models. We introduce Rewarded Moment Matching Distillation (RMMD), a novel framework that simultaneously distills diffusion models and maximizes a reward function. RMMD preserves the high-fidelity ``naturalness'' characteristic of advanced distillation (such as 8-step Moment Matching) by adapting the sampling loop for on-policy training and repurposing the distillation loss as a proxy for integral KL regularization. By evaluating the FID-Reward Pareto fronts on ImageNet, we demonstrate that RMMD achieves superior trade-offs compared to single-step baselines (DI++) and multi-step competitors (DRaFT, HyperNoise). Finally, we apply RMMD to GenCast, a state-of-the-art weather forecasting model, to distill it while optimizing the Continuous Ranked Probability Score (CRPS) metric. The resulting distilled model achieves a 7.5x speedup while outperforming the teacher model on 93% of target weather variables, and being better calibrated.","take":"Distillation and RL post-training usually fight each other — you compress the model and the reward fine-tune coarsens the samples. The neat move here is repurposing the distillation loss itself as the KL regularizer for the RL step, so the \"stay natural\" objective and the \"compress\" objective are the same term instead of two terms in tension. The headline I can't ignore isn't on ImageNet — it's GenCast: a distilled weather model that's 7.5× faster and beats its own teacher on 93% of variables while staying better calibrated. Distillation that improves the teacher is the interesting regime, and that it transfers from images to a real scientific forecaster is the part worth tracking.","standout":true,"links":{"abs":"https://arxiv.org/abs/2606.30414","pdf":"https://arxiv.org/pdf/2606.30414","html":"https://arxiv.org/html/2606.30414v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.30414"}},{"arxivId":"2606.30545v1","title":"StereoGS: Sparse-View 3D Gaussian Splatting via Stereo Priors","authors":["Wenhao Yuan","Yiyuan Ge","Deli Cai"],"categories":["cs.CV"],"abstract":"3D Gaussian Splatting (3DGS) has achieved remarkable success in real-time novel view synthesis, yet it suffers from severe overfitting under sparse-view settings due to insufficient geometric constraints. While recent methods introduce monocular depth priors to mitigate this, they inherently struggle with scale ambiguity and cross-view inconsistency, leading to defective geometry. In this paper, we propose StereoGS, a novel sparse-view 3DGS framework that integrates stereo priors to establish reliable binocular consistency. Unlike scale-agnostic monocular constraints, StereoGS introduces a Stereo Depth Regularization by constructing virtual stereo pairs during optimization and leveraging a foundation stereo model to enforce absolute scale and binocular-consistent structures. To further suppress overfitting and eliminate redundant primitives, we design a Gradient-Aware Opacity Decay strategy that dynamically penalizes Gaussians based on their relative opacity gradient magnitudes. Combined with a Consistency-Aware Dense Initialization using zero-shot multi-view depth estimation, StereoGS effectively anchors primitives to accurate scene surfaces. Extensive experiments on LLFF, DTU, Mip-NeRF360, and Blender datasets demonstrate that StereoGS achieves state-of-the-art performance in sparse-view settings without incurring any additional inference overhead.","take":"Monocular depth priors give you shape but not scale, and the scale ambiguity is exactly what wrecks sparse-view geometry. Manufacturing virtual stereo pairs during optimization and leaning on a foundation stereo model to pin absolute scale is a clean way to import the one constraint mono depth can't provide. The opacity-decay-by-gradient trick is the part I'd reuse — it's a principled way to kill redundant Gaussians instead of the usual opacity threshold heuristics. No inference overhead because all the extra machinery lives in training.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.30545","pdf":"https://arxiv.org/pdf/2606.30545","html":"https://arxiv.org/html/2606.30545v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.30545"}},{"arxivId":"2606.30638v1","title":"Open-Vocabulary and Referring Segmentation for 3D Gaussians Using 2D Detectors","authors":["Jameel Hassan","Yasiru Ranasinghe","Vishal Patel"],"categories":["cs.CV"],"abstract":"3D Gaussian Splatting (3DGS) has emerged at the forefront of 3D scene reconstruction. Extending 3DGS with language-driven, open-vocabulary understanding has gained significant attention for real-world applications such as embodied AI. Recent methods achieve this by learning an instance feature attribute and assigning semantics by distilling high-dimensional Contrastive Language-Image Pretraining (CLIP) features directly into the scene representation. However, the instance grouping mechanisms of these methods either require a predefined number of instances or suffer from noise in their bottom-up grouping strategies. Furthermore, the reliance on CLIP restricts semantic understanding to simple noun phrases, preventing complex spatial reasoning and referential expression grounding. We present GaussDet, a method that circumvents the need for dense CLIP features by leveraging discrete, open-vocabulary 2D object detectors with referring expression capabilities. We learn instance features for individual Gaussians to decompose the scene into 3D instance groups. By rendering these groups and aggregating semantic votes from multi-view 2D detections, we generate a robust View-Aggregated Semantic Label Distribution (VASD) for each 3D instance. This view-aggregation strategy acts as a strong regularizer, attenuating spurious labels caused by low-quality instance grouping. Our approach enables a straightforward, zero-shot extension from simple language queries to complex referential grounding.","take":"Distilling dense CLIP features into every Gaussian always felt like the expensive, lossy way to do this — you pay for a high-dim feature per primitive and still only get noun-phrase semantics. Flipping it to render 3D instance groups and let off-the-shelf 2D detectors vote across views is the cheaper, sharper design: the multi-view aggregation is itself the regularizer that cleans up bad grouping. The payoff that sells it is referential grounding (\"the mug behind the laptop\") in a strict zero-shot setting, +16.7% mIoU — spatial reasoning CLIP-distillation can't do.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.30638","pdf":"https://arxiv.org/pdf/2606.30638","html":"https://arxiv.org/html/2606.30638v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.30638"}},{"arxivId":"2606.30645v1","title":"VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes","authors":["Yen-Jen Wang","Jiaman Li","Sirui Chen","Takara E. Truong","Pei Xu","Pieter Abbeel","Rocky Duan","Koushil Sreenath","Angjoo Kanazawa","Carmelo Sferrazza","Guanya Shi","Karen Liu"],"categories":["cs.RO","cs.AI","cs.GR","eess.SY"],"abstract":"Perception-based humanoid loco-manipulation requires connecting egocentric observations and task instructions to whole-body motion. Learning this mapping requires synchronized egocentric images, language commands, and robot-compatible kinematic trajectories, yet no existing data source provides this complete tuple at scale. We address this bottleneck by generating vision-language-kinematics (VLK) supervision synthetically in reconstructed scenes. Our pipeline leverages 3D Gaussian Splatting to reconstruct metric-scale indoor environments, synthesizes navigation and object-interaction trajectories using privileged scene information, and renders paired egocentric observations after the fact. We produce 48,000 paired trajectories with no human intervention and train a VLK policy that predicts short-horizon whole-body kinematic trajectories. A whole-body tracker converts these predictions into actions on the physical humanoid. We evaluate on the physical Unitree G1 performing navigation and single-object transport, demonstrating that synthesized interactions in reconstructed scenes provide effective supervision for sim-to-real perception-based humanoid loco-manipulation.","take":"The bottleneck for perception-driven humanoids is the data tuple nobody has: synchronized egocentric video + language + robot-feasible whole-body trajectories. The clever inversion is rendering the egocentric images *after the fact* — plan trajectories with privileged scene info in a 3DGS-reconstructed metric room, then render what the robot would have seen. 48k paired trajectories with zero human labeling, and it crosses sim-to-real onto a physical Unitree G1. This is the same realization the SLAM papers are circling from the other side: a good enough 3DGS reconstruction is now a data generator, not just a viewer. Heavyweight author list behind it.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.30645","pdf":"https://arxiv.org/pdf/2606.30645","html":"https://arxiv.org/html/2606.30645v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.30645"}},{"arxivId":"2606.30406v1","title":"MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training","authors":["Wenhan Ma","Jianyu Wei","Liang Zhao","Hailin Zhang","Bangjun Xiao","Lei Li","Qibin Yang","Bofei Gao","Yudong Wang","Rang Li","Jinhao Dong","Zhifang Sui","Fuli Luo"],"categories":["cs.CL","cs.LG"],"abstract":"Modern large language models (LLMs) rely on reinforcement learning during post-training to push specific capabilities, yet integrating multiple capabilities into one model remains hard. Existing methods, such as Off-Policy Finetune and Mix-RL, are either inefficient or lose performance. In this work, we propose Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm for combining the capabilities of multiple domain RL teachers: we first run per-domain specialised RL to obtain a set of domain teachers, then distill these teachers into the student on its own rollouts. This eliminates exposure bias and provides a dense optimization signal. On Qwen3-30B-A3B, MOPD outperforms Mix-RL, Cascade RL, Off-Policy Finetune, and Param-Merge baselines, inheriting nearly all of each teacher's capability. MOPD also enables parallel, independent development of domain teachers, removing the cross-domain coupling typical of multi-domain post-training. MOPD has been deployed in the post-training of MiMo-V2-Flash, an industrial-scale frontier model.","take":"The practical pain in multi-capability post-training is coupling: train math-RL and code-RL together and they interfere, so every capability has to move in lockstep. Specializing one RL teacher per domain and then distilling them into the student *on the student's own rollouts* decouples the org problem (teams ship teachers independently) from the model problem (one student inherits all of them). On-policy is what makes it work — distilling on student rollouts kills the exposure bias that off-policy fine-tune suffers. The credible bit is the deployment line: it's in MiMo-V2-Flash, not just a benchmark table.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.30406","pdf":"https://arxiv.org/pdf/2606.30406","html":"https://arxiv.org/html/2606.30406v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.30406"}},{"arxivId":"2606.30626v1","title":"DOPD: Dual On-policy Distillation","authors":["Xinlei Yu","Gen Li","Qingyi Si","Guibin Zhang","Yuqi Xu","Congcong Wang","Shuai Dong","Kaiwen Tuo","Xiangyu Zeng","Kaituo Feng","Qunzhong Wang","Yang Shi","Xiaobin Hu","Xiangyu Yue","Jiaqi Wang","Shuicheng Yan"],"categories":["cs.AI"],"abstract":"On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to infuse privileged information to either teacher or student itself. However, this additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that students are meant to close, and the information asymmetry gap that can only be mimicked but never replicated. This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals. To this end, we propose DOPD, an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities.","take":"\"Privilege illusion\" is a sharp name for a real trap: if you feed the teacher privileged context, some of its behavior comes from information the student will never have, so the student is being asked to imitate something it structurally can't replicate. Separating the *transferable* capability gap from the *un-replicable* information-asymmetry gap — and routing each token's supervision by advantage — is the kind of distinction that only shows up once you take token-level supervision seriously. Pairs naturally with MOPD in today's batch: the field is clearly converging on on-policy distillation as the post-training workhorse, now arguing about how to route the signal.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.30626","pdf":"https://arxiv.org/pdf/2606.30626","html":"https://arxiv.org/html/2606.30626v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.30626"}}],"kind":"arxiv","slug":"2026-06-30","body":"A 3D-Gaussian-heavy day, with a second clear thread in on-policy distillation. The\nthrough-line I keep noticing: a good enough scene reconstruction has stopped being a\nviewer and become *infrastructure* — KiloGS-SLAM tracks kilometers of it, VLK renders\nsynthetic robot data from it, GaussDet hangs open-vocabulary semantics on it. And on the\ntraining side, three separate groups (RMMD, MOPD, DOPD) are all sharpening the same tool:\ndistill on the policy's own rollouts, and argue about how to regularize and route the\nsignal.\n","readingTimeMins":1,"url":"https://ai.thesatyajit.com/arxiv/2026-06-30"},{"date":"2026-06-28","papers":[{"arxivId":"2606.27332v1","title":"RoPEMover: Depth-Aware Object Relocation via Positional Embeddings","authors":["Ipek Oztas","Duygu Ceylan","Aybars Bugra Aksoy","Aysegul Dundar"],"categories":["cs.CV"],"abstract":"Moving an object in a single image requires geometry-consistent spatial rearrangement, including handling occlusions, revealing previously unseen regions, and maintaining coherent shadows and reflections. Existing approaches are not well suited to this setting and often fail to preserve such scene-level consistency. We address this problem by introducing a geometry-aware object motion method that operates directly on the positional representations of diffusion transformers. Our key insight is that rotary positional embeddings (RoPE) define a structured spatial field that can be explicitly manipulated to induce controlled motion. We extend 2D RoPE into a depth-aware formulation that encodes 3D spatial structure, enabling consistent object displacement and scene-aware updates. Our model is trained using synthetic data combined with a small set of real images via parameter-efficient fine-tuning. Despite minimal real supervision, it preserves object identity under large spatial displacements, generates plausible content in newly revealed regions, and consistently updates scene-dependent effects such as shadows and illumination. Experimental results on standard object motion benchmarks demonstrate state-of-the-art performance across all evaluation metrics.","take":"The same realization RayPE had, aimed at editing instead of generation: RoPE isn't just index bookkeeping, it's a structured spatial field you can manipulate. They extend 2D RoPE to a depth-aware form, so moving an object in a single image drags its occlusions, disocclusions, shadows, and reflections along with it. Geometry-consistent relocation by editing the positional embeddings rather than the latents — trained mostly on synthetic data with a thin parameter-efficient real fine-tune. The depth-aware RoPE is the transferable trick.","standout":true,"links":{"abs":"https://arxiv.org/abs/2606.27332","pdf":"https://arxiv.org/pdf/2606.27332","html":"https://arxiv.org/html/2606.27332v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.27332"}},{"arxivId":"2606.27123v1","title":"Proposal-Conditioned Latent Diffusion for Closed-Loop Traffic Scenario Generation","authors":["Shubham Vaijanath Phoolari","Aleyna Kara","Christoph Lauer","Steven Peters"],"categories":["cs.RO","cs.CV"],"abstract":"Closed-loop traffic simulation remains challenging because it must generate interactive multi-agent behaviors that are scene-consistent and controllable throughout rollout. Prior diffusion-based approaches achieve strong realism, but their computational cost can hinder deployment in time-constrained replanning loops for autonomous vehicle planning and simulation. We present a diffusion-based scenario generation framework conditioned on instance-centric scene context and multimodal proposal priors, with optional test-time guidance for shaping safety-critical behaviors. A compact action-latent representation and proposal-based initialization improve sampling efficiency and reduce per-step runtime without retraining. Experiments on the Waymo Open Motion Dataset demonstrate a favorable balance among realism, safety, and controllability across diverse interactive scenarios, while showing that test-time guidance enables systematic trade-offs among competing objectives.","take":"Closed-loop traffic sim where the diffusion sampler's cost is the whole problem — you can't drop a slow denoiser inside a time-constrained replanning loop. A compact action-latent plus proposal-based initialization cuts per-step runtime without retraining, and optional test-time guidance shapes safety-critical behaviors on demand. The realism/safety/controllability balance on Waymo Open Motion is the right axis; I'd want the actual per-step latency before believing \"deployable.\"","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.27123","pdf":"https://arxiv.org/pdf/2606.27123","html":"https://arxiv.org/html/2606.27123v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.27123"}},{"arxivId":"2606.27084v1","title":"Pseudo-Text-Conditioned 3D Grounding DINO for Organ Localization in Abdominal CT","authors":["Siqi Chen","Han Gong","Keyi Hou","Jingxuan Yang","Sheethal Bhat","Andreas Maier"],"categories":["cs.CV","eess.IV"],"abstract":"Reliable organ localization in abdominal CT can provide spatial priors for downstream trauma analysis. We propose CT-3GDINO, a lightweight 3D detector that adapts a Grounding-DINO-style query-based architecture to fixed organ localization using frozen pseudo-text class tokens instead of a real text encoder. The model combines a Swin3D visual backbone, bidirectional feature enhancement, pseudo-text-guided query selection, and a cross-modality decoder to predict normalized 3D boxes for liver, spleen, left kidney, right kidney, and bowel. We train and evaluate on 193 matched RSNA/RATIC CT volumes with segmentation-derived boxes. The best multi-scale model, trained from scratch, achieves 0.5830 overall top-1 class-wise mAP over 3D IoU thresholds from 0.1 to 0.7, outperforming fixed- and trainable-backbone classification-pretrained variants with 0.5570 and 0.4657 mAP. Performance is strong for coarse localization, with 0.9649 AP at IoU 0.1, but remains limited for strict box alignment, with 0.1552 AP at IoU 0.7. These results establish CT-3GDINO as an open-source baseline for pseudo-text-conditioned 3D organ localization and motivate future work on localization-aware pretraining, richer multimodal conditioning, and injury-focused detection.","take":"Grounding-DINO for 3D boxes, but with frozen pseudo-text class tokens instead of a real text encoder — a clean simplification once your classes are a fixed set of organs. Swin3D backbone, bidirectional feature enhancement, and a cross-modality decoder hit 0.583 top-1 class-wise mAP over IoU 0.1-0.7 on 193 CT volumes, beating classification-pretrained backbones. The surprise is that training from scratch wins; the pseudo-text query trick is what I'd reuse for any fixed-vocabulary 3D detector.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.27084","pdf":"https://arxiv.org/pdf/2606.27084","html":"https://arxiv.org/html/2606.27084v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.27084"}},{"arxivId":"2606.27264v1","title":"CORTEX: A Structured Reasoning Benchmark for Trustworthy 3D Chest CT MLLMs","authors":["Hashmat Shadab Malik","Anees Ur Rehman Hashmi","Numan Saeed","Muzammal Naseer","Salman Khan","Christoph Lippert"],"categories":["cs.CV"],"abstract":"Reasoning in multimodal large language models (MLLMs) has shown strong promise in medical imaging. However, this reasoning is usually free-form text judged only by its final answer, making it hard to interpret and verify, especially in 3D radiology, where a diagnosis should be traceable to evidence in the scan. Existing chest CT question-answering datasets compound this by reducing expert radiology reports to answer-only pairs, dropping the reasoning that links findings to conclusions and omitting the patient history clinicians rely on. As a result, reasoning-capable 3D chest CT MLLMs remain out of reach, as neither the structured supervision needed to train them nor the protocol needed to verify their reasoning yet exists. We introduce CORTEX (Clinically Organized Reasoning and sTructured EXplanation), a structured reasoning benchmark for 3D chest CT. For each question, CORTEX restores the missing reasoning as a four-stage diagnostic trace mirroring a radiologist's workflow: task understanding, visual observation, diagnostic reasoning, and answer synthesis. We generate these traces using frontier large language models with broad medical and general-domain knowledge, then filter and verify them with a stage-level evaluation protocol combining automated rubric scoring with expert radiologist review. Crucially, both the reasoning structure and evaluation rubrics are designed in close collaboration with clinicians. Built on CT-RATE, a large, publicly available chest CT dataset without reasoning annotations, CORTEX comprises 76,177 validated reasoning traces across open-ended VQA, closed-ended VQA, and report generation, providing both the structured supervision and the stage-level evaluation protocol needed to build and evaluate trustworthy reasoning models for 3D chest CT. Our dataset and evaluation code will be made publicly available upon acceptance.","take":"The right complaint about medical MLLMs: a 3D CT diagnosis judged only on its final answer is unverifiable. CORTEX restores the four-stage radiologist trace — task understanding, visual observation, diagnostic reasoning, answer synthesis — as structured supervision, 76,177 traces validated by a stage-level rubric plus radiologist review. The value is the protocol, not just the dataset: it gives you something to grade the reasoning against, not only the answer.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.27264","pdf":"https://arxiv.org/pdf/2606.27264","html":"https://arxiv.org/html/2606.27264v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.27264"}},{"arxivId":"2606.27377v1","title":"DanceOPD: On-Policy Generative Field Distillation","authors":["Wei Zhou","Xiongwei Zhu","Zelin Xu","Bo Dong","Lixue Gong","Yongyuan Liang","Meng Chu","Leigang Qu","Lingdong Kong","Wei Liu","Tat-Seng Chua"],"categories":["cs.CV","cs.CL","cs.LG"],"abstract":"Modern image generation demands a single model that unifies diverse capabilities, including text-to-image (T2I), local editing, and global editing. However, these capabilities are rarely naturally aligned and often conflict. For instance, editing tends to degrade T2I performance, while global and local editing interfere with each other. Consequently, effectively composing these capabilities has become a central challenge for image generation model training. To tackle this, we introduce DanceOPD, an on-policy generative field distillation framework for flow-matching models that routes each sample to one capability field, queries one low-noise student-induced state, and trains with a simple velocity MSE objective. With each capability source defined as a velocity field over the shared flow state space, the student learns from fields queried on its own rollout states to compose expert capabilities. This formulation also absorbs operator-defined fields such as classifier-free guidance. Comprehensive experiments on T2I, editing, realism-field absorption, and CFG absorption show that our approach improves multi-capability composition, strengthening target capabilities while preserving anchor generation quality. We believe this work establishes a practical route for generative field distillation in flow-matching models.","take":"The unify-everything-in-one-image-model problem, stated honestly: text-to-image, local edit, and global edit actively fight each other. DanceOPD routes each sample to one capability \"field\" and distills the student on its own rollout states with a plain velocity MSE — composing expert velocity fields over a shared flow state. Tidy framing; the open question is whether on-policy querying actually keeps the capabilities from clobbering one another at scale.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.27377","pdf":"https://arxiv.org/pdf/2606.27377","html":"https://arxiv.org/html/2606.27377v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.27377"}},{"arxivId":"2606.27154v1","title":"OpenRCA 2.0: From Outcome Labels to Causal Process Supervision","authors":["Aoyang Fang","Yifan Yang","Jin'ao Shang","Qisheng Lu","Junjielung Xu","Rui Wang","Songhan Zhang","Yuzhong Zhang","Boxi Yu","Pinjia He"],"categories":["cs.AI"],"abstract":"Root cause analysis (RCA) poses a holistic test of LLM agentic capabilities, such as long-context understanding, multi-step reasoning, and tool use. However, existing datasets suffer from a fundamental gap: they label only the root cause, not the propagation path connecting it to the observed symptom, which largely simplifies the task to naive pattern matching. To support rigorous evaluation, we introduce PAVE, a step-wise labeling protocol that leverages known interventions from fault injection to reconstruct causal propagation paths. The mechanism is forward verification: reasoning from cause to effect rather than inferring backward from symptoms. Applying PAVE yields OpenRCA 2.0 (500 instances), the first cross-system RCA benchmark with step-wise causal annotations for LLM agents. Across 11 frontier LLMs, recovering the exact root-cause set succeeds in only 20.7% of cases on average. To locate where this difficulty lies, we relax the criterion and find what we call the ungrounded diagnosis: agents identify at least one correct root-cause service in 76.0% of cases, but ground that service in a verified causal propagation path to the observed symptom in only 61.5%. Outcome-only evaluation hides this failure mode; step-wise causal ground truth is the missing piece for trustworthy LLM-based RCA agents.","take":"Root-cause analysis is a real stress test for agents — long context, multi-step reasoning, tool use — and the honest headline is the number: across 11 frontier LLMs, the exact root-cause set is recovered only 20.7% of the time. The contribution is PAVE, which labels the causal propagation path from known fault injections (forward cause→effect) so the benchmark can't be solved by backward pattern-matching. Step-wise causal supervision is exactly what production RCA needs.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.27154","pdf":"https://arxiv.org/pdf/2606.27154","html":"https://arxiv.org/html/2606.27154v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.27154"}},{"arxivId":"2606.26964v1","title":"Look-Before-Move: Narrative-Grounded World Visual Attention in Dynamic 3D Story Worlds","authors":["Jiaming Bian","Bingliang Li","Yuehao Wu","Pichao Wang","Zhi Wang","Hailan Ma","Huadong Mo","Zhenhong Sun"],"categories":["cs.AI","cs.CV"],"abstract":"As embodied AI and world models increasingly operate in dynamic 3D environments, visual perception must move beyond passively interpreting given observations toward actively deciding what to observe. We study this problem through camera planning in dynamic 3D story worlds, where the camera must not only generate smooth motion, but also decide what visual evidence should be acquired before it moves. We formulate this capability as Narrative-Grounded World Visual Attention, where the camera acts as an embodied observer that determines what to observe, how to compose the observation, and how to shift attention over time under narrative intent and physical 3D constraints. To realize this capability, we propose Look-Before-Move, a camera planning framework that separates observation specification from motion execution. It first builds a Semantic Observation Contract to convert directorial intent into executable visual constraints, then performs Monte Carlo Viewpoint Search to find narrative-compliant and geometrically feasible viewpoints, and finally applies Semantic Trajectory Grounding to connect selected viewpoints into continuous, collision-aware, and temporally coherent camera motion. We further construct a dynamic 3D Story World Benchmark based on StoryBlender, covering 50 stories, 457 scenes, and 1585 shots with animated characters, semantic scene configurations, and executable 3D environments. Experiments show that our framework improves subject perception, intent consistency, and trajectory quality over representative baselines, demonstrating the importance of organizing visual attention before generating camera motion.","take":"Perception as an action rather than a given: the camera decides what evidence to acquire before it moves. Separating a \"Semantic Observation Contract\" from motion execution is a sensible split for embodied 3D agents, and more interesting as a framing of active perception than for the story-world demo. I'd read it for how the contract grounds against physical 3D constraints, not the narrative wrapper.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.26964","pdf":"https://arxiv.org/pdf/2606.26964","html":"https://arxiv.org/html/2606.26964v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.26964"}}],"kind":"arxiv","slug":"2026-06-28","body":"","readingTimeMins":1,"url":"https://ai.thesatyajit.com/arxiv/2026-06-28"},{"date":"2026-06-27","papers":[{"arxivId":"2606.27223v1","title":"SatSplatDiff: Geometry-preserving generative refinement for high-fidelity satellite Gaussian Splatting","authors":["Jiyong Kim","Shuang Song","Ronjgun Qin"],"categories":["cs.CV"],"abstract":"Gaussian Splatting has been recently explored for satellite 3D reconstruction, demonstrating flexibility and efficiency in representing radiometrically diverse satellite scenes. However, the limited top viewpoint of satellite imagery results in insufficient supervision on building facades, leaving surface holes and degraded visual fidelity. Generative refinement, which leverages pretrained generative priors to iteratively refine and update the rendered images used as supervision targets, has recently been investigated to improve the visual fidelity of Gaussian-rendered images. However, since these models refine each view independently, the resulting images can generate hallucinations and break photo-consistency, leading to geometric degradation. To address these limitations, we propose SatSplatDiff, which aims to minimize geometric degradation prevalent in generative refinement. Building on photogrammetric DSM initialization and 2DGS-based shadow casting established in our prior work SatSplat, we first introduce monocular depth supervision and multi-scale geometric refinement to establish a geometrically accurate and well-regularized surface representation. We then apply shadow-guided generative refinement, where geometrically calculated shadow maps guide the Gaussians to maintain consistency with the underlying geometry, improving visual fidelity while reducing geometric degradation. Extensive evaluations on the IARPA2016 and DFC2019 datasets demonstrate state-of-the-art performance, reducing geometric MAE by up to 18% and improving visual fidelity (FID-CLIP) by 28-45% over existing baselines. Our method delivers up to 5x resolution enhancement with minimal hallucination and sensor-consistent appearance, demonstrating seamless cross-tile consistency and strong scalability for large-scale reconstruction. Source code is available at https://github.com/GDAOSU/SatSplatDiff","take":"Satellite splatting starves on building facades — imagery is top-down, so the sides get almost no supervision and you get holes. Bolting a generative prior on per-view hallucinates and breaks photo-consistency; the fix here is to let geometrically-computed shadow maps steer the refinement so it can't drift off the surface. 18% lower geometric MAE and 28-45% better FID-CLIP is a real lift. The shadow-as-geometric-anchor trick is the part I'd steal.","standout":true,"links":{"abs":"https://arxiv.org/abs/2606.27223","pdf":"https://arxiv.org/pdf/2606.27223","html":"https://arxiv.org/html/2606.27223v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.27223"}},{"arxivId":"2606.26928v1","title":"UAV-MapFusion: RTK-Aligned Uncertainty-Aware Coarse-to-Fine Multi-Session UAV Mapping","authors":["Feng Pan","Chunran Zheng","Bing Xue","Yukang Cui","Jiayu Wen","Zhiyu Chen","Wei Wang"],"categories":["cs.RO","cs.SI"],"abstract":"Large-scale point cloud maps are essential for robotics and spatial intelligence tasks. UAVs provide an efficient means for large-scale map acquisition; however, due to limited flight endurance and onboard storage, mapping a large-scale scene within a single flight remains difficult. Existing multi-session map merging methods can extend the mapping range, yet in UAV scenarios they still struggle to simultaneously suppress long-range drift and preserve local geometric accuracy. To address this issue, an uncertainty-aware multi-session point cloud map merging and coarse-to-fine optimization system is proposed. The proposed method first performs initial multi-session map merging based on a scene graph, and then incorporates RTK observations through an RTK spatiotemporal alignment module, where temporal offsets are estimated using Dynamic Time Warping (DTW), and continuous RTK constraints are recovered using Multi-Output Gaussian Processes (MOGP) under incomplete sampling and frame dropouts. On this basis, a unified uncertainty-aware factor graph is constructed, and local geometric accuracy is further improved through iterative plane-factor refinement. Experiments on real-world datasets validate the effectiveness and robustness of the proposed method. To facilitate further research and development in the community, our code and dataset will be publicly released.","take":"The unglamorous production problem stated plainly: one flight can't map a site, so you stitch sessions and fight long-range drift without smearing local geometry. DTW for the RTK temporal offset plus multi-output GPs to recover dropped frames is a sane way to treat RTK as a soft constraint instead of trusting it blindly. The uncertainty-aware factor graph over the merge is the right substrate; I'd push on how the plane-factor refinement holds up across session seams.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.26928","pdf":"https://arxiv.org/pdf/2606.26928","html":"https://arxiv.org/html/2606.26928v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.26928"}},{"arxivId":"2606.27317v1","title":"OctoSense: Self-Supervised Learning for Multimodal Robot Perception","authors":["Anthony Bisulco","Jeremy Wang","Kostas Daniilidis","Randall Balestriero","Pratik Chaudhari"],"categories":["cs.CV","cs.RO"],"abstract":"We present OctoSense, an open-source sensor platform with stereo RGB and event cameras, LiDAR, a thermal camera, an inertial measurement unit, RTK-corrected global positioning system, and proprioception (CAN bus data from a car, and joint angles for a quadruped robot). The eponymous OctoSense dataset contains 59 hours of time-synchronized driving data across different types of environments at different times of the day, including situations with highly degraded sensors. We demonstrate multi-modal self-supervised learning using such real-world robotics data, where sensors have different representations, frequencies, latencies and noise. Our approach, a \"late-fusion\" masked autoencoder, (i) uses modality-specific tokenizers to account for different spatiotemporal characteristics of these sensors, and (ii) caches modality-specific tokens at inference time to process new measurements as they come. This architecture (i) is fast (6.68 ms and 112 ms on NVIDIA 5090 and Orin NX respectively, to compute the representation), (ii) performs better than existing image-only foundation models on tasks such as estimation of optical flow, depth, semantic segmentation, and ego-motion (translation, rotation, and steering angle), and (iii) predicts robustly at nighttime or in situations where sensory data is degraded. See our project page for links to the dataset, code, and supplementary videos: https://abisulco.com/octosense/.","take":"Sensor fusion as a late-fusion masked autoencoder with per-modality tokenizers — and the part that matters for shipping, it caches modality tokens at inference so new measurements stream in instead of re-encoding everything. 6.68 ms on a 5090, 112 ms on an Orin NX, and it degrades gracefully at night or with a dead sensor. That edge-latency number is what separates a deployable perception stack from a benchmark entry.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.27317","pdf":"https://arxiv.org/pdf/2606.27317","html":"https://arxiv.org/html/2606.27317v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.27317"}},{"arxivId":"2606.27071v1","title":"PanoImager: Geometry-Guided Novel View Synthesis and Reconstruction from Sparse Panoramic Views","authors":["Zhisong Xu","Takeshi Oishi"],"categories":["cs.CV"],"abstract":"Panoramic sensing offers wide field-of-view coverage, yet 3D reconstruction from sparse panoramas remains challenging under rotation-dominant, weak-parallax motion. In such regimes, SfM/SLAM initialization is often ill-conditioned and unreliable. We present PanoImager, an SfM-free framework that combines feed-forward pose/depth priors, geometry-conditioned diffusion view completion, and depth-guided 3DGS optimization. Given only a few panoramic images, PanoImager decomposes them into local perspective views, synthesizes auxiliary observations to enrich sparse evidence, and stabilizes Gaussian optimization for improved cross-view consistency. Experiments on multiple benchmarks show improved stability under extreme sparsity, suggesting PanoImager as an offline/background component for map refinement when SfM/SLAM fails to initialize.","take":"When motion is rotation-dominant with weak parallax, SfM/SLAM initialization is ill-conditioned and just falls over — so this is SfM-free on purpose. Decompose panoramas into local perspective views, synthesize auxiliary views with a geometry-conditioned diffusion model, then stabilize 3DGS with depth guidance. Positioned honestly as an offline map-refinement fallback for exactly the regime where the online pipeline can't get off the ground.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.27071","pdf":"https://arxiv.org/pdf/2606.27071","html":"https://arxiv.org/html/2606.27071v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.27071"}},{"arxivId":"2606.27345v1","title":"RayPE: Ray-Space Positional Encoding for 3D-Aware Video Generation","authors":["Minghao Yin","Jiahao Lu","Wenbo Hu","Wang Zhao","Shan Ying","Kai Han"],"categories":["cs.CV"],"abstract":"Modern video diffusion transformers position their tokens through RoPE on the (u,v,t) axes -- a description of the camera's sampling grid that says nothing about the 3D structure of the scene. We observe that the geometric relation between two camera rays is captured by the Plucker reciprocal product, which is bilinear in the two rays -- the same algebraic form as the dot product in Transformer attention. Building on this analogy, we propose RayPE, a positional-encoding extension that injects per-token 6D Plucker coordinates additively into the queries and keys of self-attention, with a query/key flip arrangement under which the symmetric identity configuration coincides exactly with the reciprocal product. The injection is additive, the resulting attention score decomposes into a content term, a geometry term, and two content and geometry cross-terms -- all of which our experiments find individually necessary. To make the encoding stable across video data with heterogeneous camera-translation scales (SfM, deep SLAM, metric), we further decouple ray direction from moment magnitude, gate the encoding by a learned function of the log-magnitude, and apply RMSNorm to align it with the QKNorm-normalized content branch. The full module adds less than 0.1% parameters to a pretrained video DiT, is zero-initialized to start from the pretrained weights, and improves camera controllability, cross-frame 3D consistency, and overall video quality on a four-dataset training mixture.","take":"The clean idea of the batch. Video DiTs position tokens with RoPE over (u,v,t) — the camera's sampling grid, which says nothing about 3D structure. They notice the Plücker reciprocal product between two rays is bilinear in the rays, the same algebraic form as the attention dot product, and inject 6D Plücker coordinates additively into Q and K. Under 0.1% extra params, zero-initialized so it starts exactly from the pretrained model. Geometry-as-positional-encoding is a genuinely nice unification.","standout":true,"links":{"abs":"https://arxiv.org/abs/2606.27345","pdf":"https://arxiv.org/pdf/2606.27345","html":"https://arxiv.org/html/2606.27345v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.27345"}},{"arxivId":"2606.26938v1","title":"Focusing on What Matters: Saliency-Harnessing Accurate Routing for Diffusion MoE","authors":["Haoyou Deng","Keyu Yan","Chaojie Mao","Xiang Wang","Yu Liu","Changxin Gao","Nong Sang"],"categories":["cs.CV"],"abstract":"Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling diffusion models in visual generation. Recent advancements have focused on adaptively allocating computational resources across diverse tokens to improve efficiency and performance. However, we identify a routing assignment problem in existing diffusion MoE frameworks: the router fails to accurately allocate more computational resources to salient tokens. Our analysis attributes this failure to the router's reliance on noise-corrupted latent features throughout the denoising process. Such stochastic noise obscures the critical structural and textural information, thereby preventing the router from effectively distinguishing salient tokens. To address this, we propose SharpMoE, a post-training framework with a saliency-harnessing accurate routing mechanism, which utilizes clean latent features as a noise-free guidance signal for routing. By bypassing the noise-distorted inputs, SharpMoE provides the router with clear saliency guidance, enabling the identification of salient tokens even in high-noise stages. Furthermore, we introduce a trajectory routing loss to constrain the compute allocation throughout the multi-step denoising trajectory, ensuring precise resource allocation along the generation rollout. Extensive experiments demonstrate that SharpMoE serves as a versatile, plug-and-play solution that further enhances the pretrained, converged MoE models, achieving state-of-the-art performance in visual generation.","take":"A specific, believable failure: a diffusion MoE router reads noise-corrupted latents during denoising, so it can't tell which tokens are salient and mis-allocates compute. The fix routes on the clean latent as a noise-free guidance signal, with a trajectory loss constraining allocation along the denoising rollout. Plug-and-play on an already-converged MoE — no retrain — is the appealing part.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.26938","pdf":"https://arxiv.org/pdf/2606.26938","html":"https://arxiv.org/html/2606.26938v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.26938"}},{"arxivId":"2606.27364v1","title":"PhysiFormer: Learning to Simulate Mechanics in World Space","authors":["Yiming Chen","Yushi Lan","Andrea Vedaldi"],"categories":["cs.CV"],"abstract":"We present PhysiFormer, a diffusion transformer for physically-plausible 3D object motion. Unlike video world models that operate in view-dependent pixel space, PhysiFormer represents objects as 3D meshes expressed in world coordinates. Given the initial vertex positions and velocities, as well as object material type, rigid or elastic, the model samples future vertex trajectories. While related neural physics approaches build on ad-hoc latent spaces or explicitly enforce rigidity and causality, PhysiFormer shows that excellent results can be obtained without any such inductive biases, by casting vertex trajectory prediction as a single denoising diffusion process directly in world coordinates. The probabilistic formulation captures uncertainty in the learned dynamics, enabling diverse plausible futures from initial conditions, making this framework potentially useful for applications with unobserved uncertainty. The model features attention factorised over time, space, and objects for efficiency, enabling permutation-invariant multi-object reasoning without needing explicit object encoding. Trained on over 100k simulated trajectories, PhysiFormer generates rigid and elastic mechanics, and generalises to mixed-material settings, unseen real-world geometries, and larger object counts. It substantially outperforms autoregressive baselines in trajectory accuracy, rigidity preservation, and momentum-based physical consistency. Our results position coordinate-space diffusion as a promising step toward view-invariant, geometry-aware world modelling for robotics, graphics, and physical design. Visualisations, code, and models are available at https://yimingc9.github.io/physiformer.","take":"Predict vertex trajectories with a single denoising diffusion directly in world coordinates — no view-dependent pixel space, no hard-coded rigidity or causality — with attention factorized over time, space, and objects. The bet is that you recover rigid and elastic mechanics without baking the physics in, and the probabilistic head gives you diverse plausible futures. World-space rather than pixel-space is the right call for anything feeding robotics or design.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.27364","pdf":"https://arxiv.org/pdf/2606.27364","html":"https://arxiv.org/html/2606.27364v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.27364"}},{"arxivId":"2606.27305v1","title":"Sculpting NeRF Geometry: Human-Preference Fine-Tuning of a 3D-Aware Face GAN","authors":["Archer Moore","Mingming Gong","Liam Hodgkinson"],"categories":["cs.CV"],"abstract":"Reinforcement learning from human feedback (RLHF) for 3D generation is now established across a number of works, but most existing pipelines optimise explicit surface representations, often by converting radiance fields into meshes and training heavily on surface-supervised data. We instead fine-tune a pretrained 3D-aware generative model directly from a learned reward over radiance-field density ($σ$) values, with no externally supplied mesh or shape prior. The reward model requires no pretraining, trains easily on a small set of preference samples, and yields robust improvement in 3D geometry. Working on an unconditional 3D-aware face GAN (EG3D), our reward reads the continuous 3D density field of the neural radiance field (NeRF) directly and supplies a geometry-only learning signal, requiring neither text conditioning, mesh extraction, nor multi-view rendering. A density-consistency constraint keeps the 2D appearance qualitatively similar while the geometry is reshaped, at a measurable but bounded distributional cost (FID-50k rises from 4.09 to 6.66): the fine-tuned generator, trained from the preferences of a single annotator as a proof of concept, produces face geometries preferred by users in 74.4% of pairwise comparisons.","take":"RLHF that reads the NeRF density field directly and supplies a geometry-only reward — no mesh extraction, no multi-view render, no text conditioning — with a reward model that trains on a handful of preference pairs. FID rises 4.09 to 6.66 (the bounded appearance cost is stated honestly) while geometry is preferred in 74.4% of pairwise comparisons. Optimizing the continuous density field instead of a surface proxy is the move worth noting.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.27305","pdf":"https://arxiv.org/pdf/2606.27305","html":"https://arxiv.org/html/2606.27305v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.27305"}}],"kind":"arxiv","slug":"2026-06-27","body":"","readingTimeMins":1,"url":"https://ai.thesatyajit.com/arxiv/2026-06-27"},{"date":"2026-06-23","papers":[{"arxivId":"2606.23688v1","title":"Lift4D: Harmonizing Single-View 3D Estimation for 4D Reconstruction In-the-Wild","authors":["Yehonathan Litman","Xiaoxuan Ma","Manan Shah","Nicolas Ugrinovic","Kris Kitani","Fernando De la Torre","Shubham Tulsiani"],"categories":["cs.CV"],"abstract":"Reconstructing dynamic non-rigid objects from monocular video requires integrating visual cues from direct observations with data-driven priors over geometry and appearance. Prior approaches either learn to directly predict 4D representations from visual input or initialize a 3D representation that is subsequently deformed and refined based on video evidence. However, the former are constrained by the scarcity of 4D training data, while the latter leverage priors only for the initial reconstruction and rely solely on video supervision thereafter; neither handles complex in-the-wild scenarios with large deformations and occlusions well. We present Lift4D, a test-time optimization framework that addresses both limitations. First, we adapt an existing single-view 3D reconstruction model to yield temporally consistent per-frame predictions via causal latent conditioning, providing a coherent initialization for a deformable 3D Gaussian Splatting representation. We then ``sculpt'' this representation to match the input video through an occlusion-aware optimization that faithfully recovers visible surface details while completing unobserved regions using a view-conditioned diffusion prior. We demonstrate that Lift4D clearly improves over prior 4D reconstruction methods, particularly on challenging in-the-wild sequences with severe occlusions and non-rigid motion.","take":"The split that matters: lean on the data-driven prior for the initialization *and* throughout the deformation, not just the seed. Most 4D-from-monocular pipelines trust the prior once and then trust the video — which is exactly when in-the-wild noise wins. I'd stress it on fast non-rigid motion, where monocular depth is least reliable and the refinement has to carry the most weight.","standout":true,"links":{"abs":"https://arxiv.org/abs/2606.23688","pdf":"https://arxiv.org/pdf/2606.23688","html":"https://arxiv.org/html/2606.23688v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.23688"}},{"arxivId":"2606.23455v1","title":"MeGAS: Thermomechanical Dynamic Gaussian Splatting for Thermophysical Scene Editing","authors":["Zesong Yang","Yuanhang Lei","Liyuan Cui","Yihang Chen","Jiaer Huang","Boming Zhao","Peter Yichen Chen","Hujun Bao","Zhaopeng Cui"],"categories":["cs.CV"],"abstract":"Recent advances integrate physically grounded Newtonian dynamics with neural rendering frameworks, narrowing the gap between photorealistic scene reconstruction and physics-based animation. However, existing approaches focus on mechanically driven dynamics while neglecting temperature, a fundamental yet invisible physical factor underlying phenomena such as melting, solidification, and other thermomechanical processes. In this paper, we propose MeGAS, a novel framework that incorporates thermomechanical phase-change dynamics into 3D Gaussian Splatting (3DGS). Specifically, we propose a new thermomechanical dynamic Gaussian Splatting representation that augments 3DGS with temperature attributes and employs a heat advection-diffusion solver with MPM dynamics incorporating phase transitions, enabling physically plausible and visually realistic synthesis of thermophysical phenomena. Furthermore, a new topology-adaptive Gaussian rendering strategy is proposed to mitigate cracking and floaters under extreme deformation. Extensive experiments demonstrate that MeGAS produces physically consistent thermomechanical behavior while maintaining high-fidelity photorealistic rendering, advancing toward physics-integrated world models.","take":"Temperature as the missing state variable in physics-grounded splatting. Melting and solidification need a thermal field, not just Newtonian forces — bolting one on is obvious only in hindsight. The real question is whether the phase-change coupling is physically calibrated or merely visually plausible; the editing demo will look great either way.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.23455","pdf":"https://arxiv.org/pdf/2606.23455","html":"https://arxiv.org/html/2606.23455v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.23455"}},{"arxivId":"2606.23653v1","title":"Lightweight Neural Framework for Robust 3D Volume and Surface Estimation from Multi-View Images","authors":["Diego E. Farchione","Ramzi Idoughi","Peter Wonka"],"categories":["cs.CV"],"abstract":"Accurate volume and surface area estimation is critical for diverse applications, from marine ecology to medical diagnostics. However, existing methods often suffer from high computational costs and poor performance with sparse and noisy data. We propose a fully feed-forward framework that regresses scale-normalized volume and surface area and their associated uncertainties directly from multi-view images. By fusing 3D point cloud reconstructions with view-aligned 2D features through a graph-based decoder, our model bypasses iterative optimization, ensuring exceptional scalability and rapid inference. Experimental results demonstrate that our approach outperforms state-of-the-art methods, particularly when operating with a low number of input images. Validated across coral monitoring, dietary analysis, and anthropometry, our proposed framework provides a robust, adaptable solution for quantitative shape analysis. This architecture provides a high-speed, scalable alternative for precise geometric estimation from visual data, maintaining high performance even in resource-constrained or sparse-view scenarios.","take":"Feed-forward, scale-normalized volume and surface area *with uncertainty* from multi-view images — the uncertainty head is what makes this usable in an actual measurement loop rather than a demo. Fusing point-cloud reconstructions with view-aligned 2D features through a graph decoder is a sane layout; I'd go straight to the calibration of those error bars on sparse, noisy inputs.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.23653","pdf":"https://arxiv.org/pdf/2606.23653","html":"https://arxiv.org/html/2606.23653v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.23653"}},{"arxivId":"2606.23623v1","title":"dVLA-RL: Reinforcement Learning over Denoising Trajectories for Discrete Diffusion Vision-Language-Action Models","authors":["Yuhao Wu","Yitian Liu","Weijie Shen","Mishuo Han","Wenjie Xu","Haotian Liang","Zhongshan Liu","Yinan Mao","Lei Xu","Xinping Guan","Ru Ying","Ran Zheng","Wei Sui","Xiaokang Yang","Wenbo Ding","Yao Mu"],"categories":["cs.RO"],"abstract":"Vision-Language-Action (VLA) models have established a powerful paradigm for generalist robotic manipulation by grounding control into the semantic reasoning of VLMs. Prevailing architectures typically model actions continuously via diffusion or flow processes, or discretely through either autoregressive generation or parallel decoding. Recently, Discrete Diffusion VLAs (dVLAs) have emerged as a distinct alternative, unifying vision, language, and action into a single discrete token space via masked generative modeling. While combining iterative refinement with unified representations, its training has thus far been restricted to Supervised Fine-Tuning (SFT), leaving the potential of Reinforcement Learning (RL) for further policy refinement largely unexplored. A fundamental challenge in RL for dVLAs is that the marginal probability of the final action generated by dVLAs remains intractable. To solve this problem, we propose \\textbf{dVLA-RL}, shifting the learning objective from the marginal action probability to the joint probability of the sampled generation path. Specifically, by modeling the denoising process as a Markov Decision Process (MDP), we mathematically formulate this path probability as a product of step-wise transitions. This trajectory-level objective provides a unified formulation that natively accommodates variable denoising steps. Leveraging this intrinsic fexibility, we introduce a unified step scheduling approach for complex multi-task learning, tailoring denoising steps to specific task complexities to maximize both success rates and computational effciency. Extensive evaluations demonstrate that our approach achieves a success rate of \\textbf{99.7\\%} on LIBERO. Furthermore, it establishes strong VLA-based results on RoboTwin 2.0 by delivering a \\textbf{30.6\\%} improvement over the SFT baseline, remaining competitive with strong World-Action Model baselines.","take":"RL applied directly over the denoising trajectory of a discrete-diffusion VLA — rewarding the unmasking path, not just the final action tokens. For parallel-decoding policies that's the right surface to optimize. The open question is reward density and stability across masked generative rollouts, where credit assignment over the trajectory gets slippery.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.23623","pdf":"https://arxiv.org/pdf/2606.23623","html":"https://arxiv.org/html/2606.23623v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.23623"}},{"arxivId":"2606.23567v1","title":"Scheduling Thoughts: Learning the Order of Thought in Diffusion Language Models","authors":["Jiawei Xu","Minghui Liu","Aakriti Agrawal","Yifan Chen","Furong Huang"],"categories":["cs.LG","cs.AI"],"abstract":"Masked diffusion language models decode by iteratively unmasking tokens, where the unmasking order defines an \"order of thought\" that strongly influences generation quality yet is typically chosen heuristically. We derive a tractable upper bound on the sequential decoding mismatch, measured by the Kullback-Leibler divergence and expressed in terms of the model's pathwise log-likelihood, with tightness under sufficient model expressivity. This bound induces a dense self-aware reward over ordered trajectories, casting order selection as a principled policy optimization problem with a frozen denoiser. We instantiate this idea as Self-Aware Scheduling (SAS), which learns a lightweight order policy using Group Relative Policy Optimization and applies seamlessly to both any-order and semi-autoregressive decoding. On Sudoku with 1B MDM, SAS improves puzzle accuracy from 82.0% (best heuristic schedule) to 91.8%, and reaches 97.5% with second-stage fine-tuning along learned trajectories. On mathematical reasoning with LLaDA-8B, SAS improves pass@1 on GSM8K from 64% to 76% and on MBPP from 39.5% to 41%, consistently matching or exceeding heuristic schedules across generation lengths and block sizes. Project page: https://jimmyxu123.github.io/SAS","take":"A tractable KL bound on decoding-order mismatch, turned into a dense reward over unmasking trajectories — replacing the heuristic 'which token to unmask next' with a learned order of thought. The theory-to-reward move is clean, and it's the kind of change that lifts diffusion-LM quality without touching the backbone.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.23567","pdf":"https://arxiv.org/pdf/2606.23567","html":"https://arxiv.org/html/2606.23567v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.23567"}},{"arxivId":"2606.23565v1","title":"HoloAgent-0: A Unified Embodied Agent Framework with 3D Spatial Memory","authors":["Xiaolin Zhou","Liu Liu","Tingyang Xiao","Wei Feng","Fa Fu","Xinrui Meng","Xinjie Wang","Jialiang Han","Boyang Yu","Yun Du","Wei Sui","Zhizhong Su"],"categories":["cs.RO","cs.CV"],"abstract":"LLM agents follow a practical execution loop in digital environments: they reason over structured states, invoke tools, inspect feedback, and revise actions. Extending this loop to physical robots is difficult because physical execution is continuous, embodiment-dependent, uncertain, and constrained by safety. Existing embodied-AI systems have advanced manipulation, spatial understanding, navigation, and humanoid control, but these capabilities often remain specialized modules or loosely coupled decision loops. In this work, we introduce HoloAgent-0, a unified embodied agent framework for real-world robot deployment. Embodied AgentOS converts language instructions into executable skill graphs, schedules robot resources, monitors execution, and triggers clarification or re-planning from runtime feedback. HoloAgent-0 organizes heterogeneous robot models and controllers through three coupled layers: Embodied AgentOS for closed-loop execution, 3D spatial memory for physical world grounding, and embodied skills for robot action. We deploy HoloAgent-0 on real hardware and evaluate its spatial memory, long-horizon navigation, and closed-loop execution across motion generation, object search, cross-robot coordination, and mobile manipulation.","take":"Porting the digital-agent loop — reason, call a tool, inspect feedback, revise — onto robots, with a persistent 3D spatial memory as the shared state. The hard part is always the continuous, embodiment-dependent, safety-constrained execution gap; a unified 3D memory is a reasonable substrate for it. Read it for how the loop grounds against uncertainty, not for the framework diagram.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.23565","pdf":"https://arxiv.org/pdf/2606.23565","html":"https://arxiv.org/html/2606.23565v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.23565"}}],"kind":"arxiv","slug":"2026-06-23","body":"","readingTimeMins":1,"url":"https://ai.thesatyajit.com/arxiv/2026-06-23"},{"date":"2026-06-10","papers":[{"arxivId":"2606.11155v1","title":"Mean Flow Distillation: Robust and Stable Distillation for Flow Matching Models","authors":["An Zhao","Shengyuan Zhang","Zhongjian Sun","Yixiang Zhou","Zejian Li","Ling Yang","Tianrun Chen","Lingyun Sun"],"categories":["cs.CV"],"abstract":"Flow Matching models perform well across generative tasks, but their ODE-based iterative sampling makes inference expensive and rules out real-time use. Existing distillation borrows from diffusion score matching, ignores the geometric structure of flows, and suffers training instability, high variance, and quality loss. Mean Flow Distillation (MFD) is a distillation framework built for flow matching: the authors show it acts as a temporal low-pass filter that suppresses the high-frequency optimization noise of variational score distillation while keeping global trajectory consistency, and prove a Mean Flow Matching Theorem — matching expected average velocities is sufficient for strict distribution alignment. On 4D occupancy forecasting and text-to-image, MFD reaches SOTA with high-fidelity single-step generation.","take":"The framing I like: treat the distillation instability as a signal-processing problem and show VSD is just leaking high-frequency noise into the student. The averaged-velocity target is the kind of move that's obvious only after someone proves it's sufficient. 4D occupancy forecasting in one step is the result worth poking at — that's the regime where iterative sampling actually kills you in a perception loop.","standout":true,"links":{"abs":"https://arxiv.org/abs/2606.11155","pdf":"https://arxiv.org/pdf/2606.11155","html":"https://arxiv.org/html/2606.11155v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.11155"}},{"arxivId":"2606.11152v1","title":"P3D-Bench: Benchmarking MLLMs for Parametric 3D Generation and Structural Reasoning","authors":["Yikang Yang","Zhanpeng Hu","Youtian Lin","Mengqi Zhou","Jingxi Xu","Feihu Zhang","Jiaheng Liu","Yao Yao"],"categories":["cs.CV"],"abstract":"MLLMs can write code that drives 3D modeling, which opens a path to 3D generation that leans on their priors and reasoning. But most benchmarks score meshes, not programs. P3D-Bench scores parametric 3D programs, which expose explicit dimensions, construction operations, and part relations — revealing whether a model recovers a design's structure, not just its look. It covers Text-to-3D, Image-to-3D, and Assembly-3D, grading executability, geometric fidelity, topology, text-grounded constraints, multiview alignment, and part-level structure across 400 text, 400 image, and 203 annotated assembly cases. Findings: assemblies are hardest; models recover global shape and identity but miss precise parametric geometry; and part-level modeling stays weak.","take":"This benchmarks the thing I actually care about in CAD/BIM work — can the model produce a parametric program with the right dimensions and part relations, not a pretty mesh that's geometrically wrong. The honest result is the useful one: frontier models get the silhouette and semantics but fluff the exact geometry and assembly structure. That gap is exactly where these pipelines break in practice.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.11152","pdf":"https://arxiv.org/pdf/2606.11152","html":"https://arxiv.org/html/2606.11152v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.11152"}},{"arxivId":"2606.11129v1","title":"WorldOlympiad: Can Your World Model Survive a Triathlon?","authors":["Yuke Zhao","Wangbo Zhao","Weijie Wang","Zeyu Zhang","Dakai An","Akide Liu","Yinghao Yu","Jiasheng Tang","Fan Wang","Wei Wang","Bohan Zhuang"],"categories":["cs.CV"],"abstract":"WorldOlympiad diagnoses video-based world models along physical faithfulness, geometric consistency, and interaction fidelity, instead of the usual visual-quality and short-horizon temporal checks. The physical track uses object segmentation plus an MLLM judge to test mechanics, thermal, and material rules. The geometry track reconstructs generated videos with Gaussian splatting and scores structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track checks whether rollouts follow complex action prompts and stay coherent across consecutive chunks, spanning gaming, robotics, and real-world video. Experiments on SOTA models show large gaps in physical reasoning, 3D consistency, and long-horizon interaction.","take":"The smart instrumentation is the geometry track — reconstruct the generated video with Gaussian splatting and measure whether the implied 3D is actually consistent across views. That turns \"looks 3D\" into a number. The takeaway is unsurprising but worth having stated: today's world models render well and reason about physics badly.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.11129","pdf":"https://arxiv.org/pdf/2606.11129","html":"https://arxiv.org/html/2606.11129v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.11129"}},{"arxivId":"2606.11180v1","title":"Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization","authors":["Paul Hyunbin Cho","Jinhyuk Jang","SeokYoung Lee","Joungbin Lee","Siyoon Jin","Heeseong Shin","Jung Yi","Yunjin Park","Chulmin Park","Seungryong Kim"],"categories":["cs.CV"],"abstract":"Diffusion lip-sync models have strong audio-visual alignment but full bidirectional attention and many denoising steps make them too slow for real time. Lip Forcing distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students that generate each chunk in two denoising steps with no inference-time CFG. A trajectory analysis reveals a CFG fidelity-versus-sync tradeoff — no-CFG predictions favor reference fidelity, CFG-guided ones favor sync within a mid-trajectory band — which the method turns into three components: Sync-Window DMD, a two-step schedule, and a SyncNet reward. The 1.3B student streams at 31 FPS, 17.6x faster than its bidirectional counterpart; the 14B student runs 39.8x faster than its teacher at comparable fidelity, with sub-millisecond time-to-first-frame.","take":"The interesting engineering is converting a trajectory observation — CFG helps sync only inside a mid-trajectory band — into a scheduled, windowed distillation instead of a global knob. Two steps, no inference-time CFG, causal students: that's a real path from a 14B bidirectional teacher to something that streams. Sub-millisecond TTFF is the number that decides whether this is usable live.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.11180","pdf":"https://arxiv.org/pdf/2606.11180","html":"https://arxiv.org/html/2606.11180v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.11180"}},{"arxivId":"2606.10971v1","title":"Resilient Navigation for Autonomous Farm Robots by Leveraging Jerk-Augmented Models with IMU-Only Disturbance Rejection","authors":["Batu Candan","Mohammed Atallah","Simone Servadio","Saeed Arabi"],"categories":["cs.RO","eess.SY"],"abstract":"State estimation for off-road agricultural robots is degraded by sensor outages (GNSS/LiDAR/visual) and high-frequency vibration. This work uses a jerk-augmented Extended Kalman Filter paired with a Multiple Tuning Factor adaptation that adjusts the measurement covariance in real time instead of assuming constant measurement noise, letting the filter handle sudden disturbances and outliers. Evaluated on real-world data from a Salin247 robot, jerk-augmentation plus MTF adaptation cuts 3D position RMSE versus baseline EKF models and gives better dead-reckoning when sensors drop out.","take":"Not flashy, and that's the point — when LiDAR and GNSS drop out, the thing that saves you is a well-tuned filter, not a bigger network. Adding jerk to the state and adapting the measurement covariance online is the pragmatic fix for vibration-heavy platforms. I'd want to see how the MTF gains were chosen before trusting it off the test field, but the dead-reckoning story is the right thing to optimize.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.10971","pdf":"https://arxiv.org/pdf/2606.10971","html":"https://arxiv.org/html/2606.10971v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.10971"}},{"arxivId":"2606.11088v1","title":"A Distributed Multi-UGV Exploration Framework With Loop-Aware Planning and Descriptor-Aided Localization in Resource-Limited Environments","authors":["Zhiwei Li","Haiou Liu","Xijun Zhao","Ji Li","Yingze Wang","Boyang Wang"],"categories":["cs.RO"],"abstract":"Cooperative exploration with multiple ground robots in unknown, GPS-denied, bandwidth-limited environments is hard because localization drift breaks map consistency and causes redundant coverage. This framework couples descriptor-aided inter-robot loop closure with loop-aware hierarchical planning. A lightweight LiDAR global descriptor with range-image pre-alignment enables cross-robot place recognition under large yaw and lateral shifts, and verified loop closures maintain globally consistent trajectories over a sparse topological map. An uncertainty-aware loop-closure selection module scores candidates under pose uncertainty and keeps high-utility ones as planning anchors. The loop-closure module hits AR@1/AR@1% of 89.9%/95.5%, cuts trajectory error and two-way communication, and reduces exploration time and distance by 15% and 14% versus an mTSP baseline.","take":"The detail that makes this real is the bandwidth angle — a LiDAR global descriptor compact enough to share for cross-robot place recognition, not raw scans. Folding loop closures into the planner as anchors rather than treating SLAM and planning as separate stages is the right coupling. 15% less exploration time is modest but honest for a fully distributed, GPS-denied setup.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.11088","pdf":"https://arxiv.org/pdf/2606.11088","html":"https://arxiv.org/html/2606.11088v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.11088"}}],"kind":"arxiv","slug":"2026-06-10","body":"Diffusion-and-distillation heavy today, with a 3D thread running through it. The standout is\nMean Flow Distillation — single-step flow-matching generation framed as suppressing\nhigh-frequency optimization noise, with 4D occupancy forecasting as the test case that\nmatters for perception loops. Also worth a look: P3D-Bench, which scores parametric 3D\n*programs* instead of meshes and confirms MLLMs still miss exact geometry, and WorldOlympiad's\nGaussian-splatting geometry track that turns \"looks 3D\" into a measurable number.\n","readingTimeMins":1,"url":"https://ai.thesatyajit.com/arxiv/2026-06-10"},{"date":"2026-06-09","papers":[{"arxivId":"2606.09634v1","title":"ATN3D: Density-Aware LiDAR-Radar Early 3D Object Detection Under Extreme Sparsity","authors":["Debojyoti Biswas","Xianbiao Hu"],"categories":["cs.CV","cs.AI"],"abstract":"3D object detection is the backbone of perception for automated vehicles. Long-range detection is hard because sensing evidence is sparse, yet on roadways >30m affords only ~1–2s to perceive and decide. Under extreme sparsity, early multimodal fusion tends to discard sparsity information and inject noise from empty/falsely-occupied cells, and uniform channel supervision favors dense near-range samples. ATN3D (\"Ask The Neighbor\") introduces density-aware early fusion with cross-modal gating, occupancy-gated neighborhood aggregation with circular kernels, evidence-conditioned channel self-attention, and a range-aware loss. On the VoD benchmark it beats strong baselines by +3.55% mAP (clear) and +8.41% mAP (heavy fog); for >30m objects, +3.33% and +2.09%.","take":"This is exactly the regime that matters for real perception — long range is where the time budget is smallest and the points are fewest. The smart bit isn't a bigger backbone; it's making fusion and supervision density-aware so the model stops drowning sparse far-range evidence in near-range noise. The fog numbers (+8.4 mAP) are the ones I'd trust as a signal it's solving the actual problem.","standout":true,"links":{"abs":"https://arxiv.org/abs/2606.09634","pdf":"https://arxiv.org/pdf/2606.09634","html":"https://arxiv.org/html/2606.09634v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.09634"}},{"arxivId":"2606.09828v1","title":"Latent Spatial Memory for Video World Models","authors":["Weijie Wang","Haoyu Zhao","Yifan Yang","Feng Chen","Zeyu Zhang","Bohan Zhuang"],"categories":["cs.CV"],"abstract":"Video world models that keep 3D spatial consistency across frames usually rely on an explicit point-cloud memory built in RGB space — expensive (repeated rendering + VAE encoding) and lossy (the pixel-space round trip discards latent features). This paper introduces \"latent spatial memory\": a persistent 3D cache that stores scene information directly in the diffusion latent space. Their framework, Mirage, lifts latent tokens into 3D via depth-guided back-projection and queries by synthesizing novel views through direct latent-space warping. Reported: up to 10.57× faster end-to-end generation and 55× lower memory vs explicit 3D baselines, with SOTA on WorldScore.","take":"Keeping the spatial memory in latent space instead of round-tripping through pixels is the obvious-in-hindsight move, and the 55× memory cut is the kind of systems win that decides whether a world model is deployable or a demo. Depth-guided back-projection of latent tokens is a clean idea.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.09828","pdf":"https://arxiv.org/pdf/2606.09828","html":"https://arxiv.org/html/2606.09828v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.09828"}},{"arxivId":"2606.09738v1","title":"HDSL: A Hierarchical Domain-Specific Language for Structured 3D Indoor Scene Generation and Localized Editing with LLM Agents","authors":["Letian Li","Chao Shen","Shuzhao Xie","Chenghao Gu","Zhi Wang"],"categories":["cs.CV"],"abstract":"Text-driven indoor scene generation/editing needs an intermediate representation an LLM can both produce and revise. Scene graphs and global constraint lists are compact but underspecify local geometry and make edits hard to localize. HDSL frames it as structured program generation + local program repair: an XML/CSS-style DSL representing rooms, regions, objects, and support surfaces as a tree with local coordinates. LLM agents generate HDSL subtrees with bounded verification, ground nodes via multimodal asset retrieval, and apply force-directed layout to fix collisions. For editing, Hierarchical RAG rewrites only the relevant subtree and merges back deterministically — cutting tokens 5.22× and runtime 6.19× while preserving unrelated objects.","take":"Treating a 3D scene as a program you can locally repair — rather than a blob you regenerate — is the right abstraction, and it rhymes with how BIM and CAD actually work. The deterministic three-way merge on a scene tree is the detail that makes LLM editing trustworthy instead of destructive.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.09738","pdf":"https://arxiv.org/pdf/2606.09738","html":"https://arxiv.org/html/2606.09738v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.09738"}},{"arxivId":"2606.09719v1","title":"Safe Polytope-in-Polytope Motion Planning and Control with Control Barrier Functions","authors":["Alejandro Gonzalez-Garcia","Dries Dirckx","Jan Swevers","Wilm Decré"],"categories":["cs.RO"],"abstract":"Robots in tight spaces need planning that respects their actual footprint; a point/circle approximation throws away the info needed to thread narrow passages. This work keeps a polytopic robot footprint inside a continuously-updated convex free-space region, formulated as discrete-time control-barrier-function constraints inside an MPC. The number of safety constraints scales with local free-space geometry and robot shape, not the number of obstacles, and it needs no obstacle detection or segmentation. Up to 91× faster than a polytope-based obstacle-avoidance formulation as obstacles grow; validated at 10 Hz on embedded hardware with occupancy grids and LiDAR.","take":"Making cost scale with free-space complexity instead of obstacle count is the elegant inversion here — and \"no obstacle detection required\" sidesteps a whole brittle perception stage. 10 Hz on an onboard embedded computer is the line that says this is real, not just a sim result.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.09719","pdf":"https://arxiv.org/pdf/2606.09719","html":"https://arxiv.org/html/2606.09719v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.09719"}},{"arxivId":"2606.09718v1","title":"Evaluating the Representation Space of Diffusion Models via Self-Supervised Principles","authors":["Xiao Li","Yixuan Jia","Zekai Zhang","Liyue Shen","Qing Qu"],"categories":["cs.LG","cs.CV"],"abstract":"Diffusion models are both strong generators and strong self-supervised representation learners, but the link is under-explored. This paper decomposes features into invariant and residual components and derives the Invariant Contamination Ratio (ICR), a Fisher-based metric for how residual variation contaminates the invariant signal. Findings: invariance peaks at intermediate noise levels (which also give the best downstream classification), and ICR is a sensitive training-time indicator of the onset of memorization — detectable from training features alone, with no held-out set or external evaluator.","take":"The practically useful nugget: a training-time memorization detector that needs no held-out data. If ICR holds up, that's a cheap early-warning light for \"your model has stopped generalizing\" — exactly the thing you want in data-limited fine-tuning runs.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.09718","pdf":"https://arxiv.org/pdf/2606.09718","html":"https://arxiv.org/html/2606.09718v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.09718"}}],"kind":"arxiv","slug":"2026-06-09","body":"Heavy 3D-perception day. The standout is ATN3D — LiDAR-Radar fusion tuned for the\nlong-range, sparse regime that actually constrains autonomous perception. Also notable: a\nlatent-space spatial memory for video world models (55× less memory), and a footprint-aware\nmotion planner whose cost scales with free space, not obstacle count.\n","readingTimeMins":1,"url":"https://ai.thesatyajit.com/arxiv/2026-06-09"},{"date":"2026-06-03","papers":[{"arxivId":"2606.05162v1","title":"Controllable Dynamic 3D Shape Generation via 3D Trajectories and Text","authors":["Jaeyeong Kim","Ines Kim","Jahyeok Koo","Seungryong Kim"],"categories":["cs.CV"],"abstract":"We introduce T2Mo, a feed-forward framework for controllable dynamic 3D shape generation conditioned on 3D trajectories and text. Due to the inherent ambiguity of language, generating precisely intended motions using text alone remains challenging. To address this, we adopt 3D trajectories as controllable spatial guidance, specifying the exact paths along which selected points should move.","take":"Text-to-3D-motion has always been mushy because language underspecifies geometry — anchoring generation to explicit 3D point trajectories is the right fix. Feed-forward (no per-scene optimization) makes this actually usable in a pipeline.","standout":true,"links":{"abs":"https://arxiv.org/abs/2606.05162","pdf":"https://arxiv.org/pdf/2606.05162","html":"https://arxiv.org/html/2606.05162v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.05162"}},{"arxivId":"2606.05142v1","title":"GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes","authors":["Josef Bengtson","Yaroslava Lochman","Fredrik Kahl"],"categories":["cs.CV","cs.AI"],"abstract":"Recent developments in multi-view image editing with generative models have brought us a step closer toward general 3D content generation and customization. Most existing works focus on rigid or appearance-only edits by utilizing the geometry of the unedited scene. This naturally limits these methods to edits that preserve the underlying scene structure.","take":"Multi-view consistency for *nonrigid* edits is the hard version of the problem — most methods cheat by keeping the original geometry. Worth a read if you care about editable digital twins.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.05142","pdf":"https://arxiv.org/pdf/2606.05142","html":"https://arxiv.org/html/2606.05142v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.05142"}},{"arxivId":"2606.05152v1","title":"Reinforcement Learning from Rich Feedback with Distributional DAgger","authors":["Rishabh Agrawal","Jacob Fein-Ashley","Paria Rashidinejad"],"categories":["cs.LG","cs.AI","cs.CL"],"abstract":"Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations.","take":"RLVR throws away everything except one bit per rollout — execution traces and tool outputs are sitting right there. Using them is obvious in hindsight, which is usually the mark of a good idea.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.05152","pdf":"https://arxiv.org/pdf/2606.05152","html":"https://arxiv.org/html/2606.05152v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.05152"}},{"arxivId":"2606.05145v1","title":"Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)","authors":["Nizar Islah","Istabrak Abbes","Irina Rish","Sarath Chandar","Eilif B. Muller"],"categories":["cs.LG","cs.AI","cs.CL"],"abstract":"When post-trained language models fail on reasoning problems, the common test-time-scaling response is to spend more compute on additional attempts, and the failed traces play no further role. We argue this discards a crucial signal; some failures come from unlucky sampling, where more rollouts help, while others are structural and resist resampling regardless of budget.","take":"Separating \"unlucky sampling\" failures from structural ones before you burn test-time compute is a genuinely useful triage — resampling a structurally-broken trace is just paying to fail again.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.05145","pdf":"https://arxiv.org/pdf/2606.05145","html":"https://arxiv.org/html/2606.05145v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.05145"}}],"kind":"arxiv","slug":"2026-06-03","body":"First digest from the paper-scout pipeline. Heavy 3D day on cs.CV — trajectory-conditioned\ndynamic shape generation (T2Mo) is the standout: explicit spatial guidance over text-only\nconditioning is a pattern I expect to see everywhere in 3D generation this year.\n","readingTimeMins":1,"url":"https://ai.thesatyajit.com/arxiv/2026-06-03"}]