arXiv digest

2026-06-09 · 5 papers · curated by paper-scout

Heavy 3D-perception day. The standout is ATN3D — LiDAR-Radar fusion tuned for the long-range, sparse regime that actually constrains autonomous perception. Also notable: a latent-space spatial memory for video world models (55× less memory), and a footprint-aware motion planner whose cost scales with free space, not obstacle count.

★ ATN3D: Density-Aware LiDAR-Radar Early 3D Object Detection Under Extreme Sparsity

2606.09634

Debojyoti Biswas, Xianbiao Hu · cs.CV cs.AI

3D object detection is the backbone of perception for automated vehicles. Long-range detection is hard because sensing evidence is sparse, yet on roadways >30m affords only ~1–2s to perceive and decide. Under extreme sparsity, early multimodal fusion tends to discard sparsity information and inject noise from empty/falsely-occupied cells, and uniform channel supervision favors dense near-range samples. ATN3D ("Ask The Neighbor") introduces density-aware early fusion with cross-modal gating, occupancy-gated neighborhood aggregation with circular kernels, evidence-conditioned channel self-attention, and a range-aware loss. On the VoD benchmark it beats strong baselines by +3.55% mAP (clear) and +8.41% mAP (heavy fog); for >30m objects, +3.33% and +2.09%.

take · This is exactly the regime that matters for real perception — long range is where the time budget is smallest and the points are fewest. The smart bit isn't a bigger backbone; it's making fusion and supervision density-aware so the model stops drowning sparse far-range evidence in near-range noise. The fog numbers (+8.4 mAP) are the ones I'd trust as a signal it's solving the actual problem.

abs · pdf · html · ar5iv

Latent Spatial Memory for Video World Models

2606.09828

Weijie Wang, Haoyu Zhao, Yifan Yang, Feng Chen et al. · cs.CV

Video world models that keep 3D spatial consistency across frames usually rely on an explicit point-cloud memory built in RGB space — expensive (repeated rendering + VAE encoding) and lossy (the pixel-space round trip discards latent features). This paper introduces "latent spatial memory": a persistent 3D cache that stores scene information directly in the diffusion latent space. Their framework, Mirage, lifts latent tokens into 3D via depth-guided back-projection and queries by synthesizing novel views through direct latent-space warping. Reported: up to 10.57× faster end-to-end generation and 55× lower memory vs explicit 3D baselines, with SOTA on WorldScore.

take · Keeping the spatial memory in latent space instead of round-tripping through pixels is the obvious-in-hindsight move, and the 55× memory cut is the kind of systems win that decides whether a world model is deployable or a demo. Depth-guided back-projection of latent tokens is a clean idea.

abs · pdf · html · ar5iv

HDSL: A Hierarchical Domain-Specific Language for Structured 3D Indoor Scene Generation and Localized Editing with LLM Agents

2606.09738

Letian Li, Chao Shen, Shuzhao Xie, Chenghao Gu et al. · cs.CV

Text-driven indoor scene generation/editing needs an intermediate representation an LLM can both produce and revise. Scene graphs and global constraint lists are compact but underspecify local geometry and make edits hard to localize. HDSL frames it as structured program generation + local program repair: an XML/CSS-style DSL representing rooms, regions, objects, and support surfaces as a tree with local coordinates. LLM agents generate HDSL subtrees with bounded verification, ground nodes via multimodal asset retrieval, and apply force-directed layout to fix collisions. For editing, Hierarchical RAG rewrites only the relevant subtree and merges back deterministically — cutting tokens 5.22× and runtime 6.19× while preserving unrelated objects.

take · Treating a 3D scene as a program you can locally repair — rather than a blob you regenerate — is the right abstraction, and it rhymes with how BIM and CAD actually work. The deterministic three-way merge on a scene tree is the detail that makes LLM editing trustworthy instead of destructive.

abs · pdf · html · ar5iv

Safe Polytope-in-Polytope Motion Planning and Control with Control Barrier Functions

2606.09719

Alejandro Gonzalez-Garcia, Dries Dirckx, Jan Swevers, Wilm Decré · cs.RO

Robots in tight spaces need planning that respects their actual footprint; a point/circle approximation throws away the info needed to thread narrow passages. This work keeps a polytopic robot footprint inside a continuously-updated convex free-space region, formulated as discrete-time control-barrier-function constraints inside an MPC. The number of safety constraints scales with local free-space geometry and robot shape, not the number of obstacles, and it needs no obstacle detection or segmentation. Up to 91× faster than a polytope-based obstacle-avoidance formulation as obstacles grow; validated at 10 Hz on embedded hardware with occupancy grids and LiDAR.

take · Making cost scale with free-space complexity instead of obstacle count is the elegant inversion here — and "no obstacle detection required" sidesteps a whole brittle perception stage. 10 Hz on an onboard embedded computer is the line that says this is real, not just a sim result.

abs · pdf · html · ar5iv

Evaluating the Representation Space of Diffusion Models via Self-Supervised Principles

2606.09718

Xiao Li, Yixuan Jia, Zekai Zhang, Liyue Shen et al. · cs.LG cs.CV

Diffusion models are both strong generators and strong self-supervised representation learners, but the link is under-explored. This paper decomposes features into invariant and residual components and derives the Invariant Contamination Ratio (ICR), a Fisher-based metric for how residual variation contaminates the invariant signal. Findings: invariance peaks at intermediate noise levels (which also give the best downstream classification), and ICR is a sensitive training-time indicator of the onset of memorization — detectable from training features alone, with no held-out set or external evaluator.

take · The practically useful nugget: a training-time memorization detector that needs no held-out data. If ICR holds up, that's a cheap early-warning light for "your model has stopped generalizing" — exactly the thing you want in data-limited fine-tuning runs.

abs · pdf · html · ar5iv