[{"date":"2026-06-09","papers":[{"arxivId":"2606.09634v1","title":"ATN3D: Density-Aware LiDAR-Radar Early 3D Object Detection Under Extreme Sparsity","authors":["Debojyoti Biswas","Xianbiao Hu"],"categories":["cs.CV","cs.AI"],"abstract":"3D object detection is the backbone of perception for automated vehicles. Long-range detection is hard because sensing evidence is sparse, yet on roadways >30m affords only ~1–2s to perceive and decide. Under extreme sparsity, early multimodal fusion tends to discard sparsity information and inject noise from empty/falsely-occupied cells, and uniform channel supervision favors dense near-range samples. ATN3D (\"Ask The Neighbor\") introduces density-aware early fusion with cross-modal gating, occupancy-gated neighborhood aggregation with circular kernels, evidence-conditioned channel self-attention, and a range-aware loss. On the VoD benchmark it beats strong baselines by +3.55% mAP (clear) and +8.41% mAP (heavy fog); for >30m objects, +3.33% and +2.09%.","take":"This is exactly the regime that matters for real perception — long range is where the time budget is smallest and the points are fewest. The smart bit isn't a bigger backbone; it's making fusion and supervision density-aware so the model stops drowning sparse far-range evidence in near-range noise. The fog numbers (+8.4 mAP) are the ones I'd trust as a signal it's solving the actual problem.","standout":true,"links":{"abs":"https://arxiv.org/abs/2606.09634","pdf":"https://arxiv.org/pdf/2606.09634","html":"https://arxiv.org/html/2606.09634v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.09634"}},{"arxivId":"2606.09828v1","title":"Latent Spatial Memory for Video World Models","authors":["Weijie Wang","Haoyu Zhao","Yifan Yang","Feng Chen","Zeyu Zhang","Bohan Zhuang"],"categories":["cs.CV"],"abstract":"Video world models that keep 3D spatial consistency across frames usually rely on an explicit point-cloud memory built in RGB space — expensive (repeated rendering + VAE encoding) and lossy (the pixel-space round trip discards latent features). This paper introduces \"latent spatial memory\": a persistent 3D cache that stores scene information directly in the diffusion latent space. Their framework, Mirage, lifts latent tokens into 3D via depth-guided back-projection and queries by synthesizing novel views through direct latent-space warping. Reported: up to 10.57× faster end-to-end generation and 55× lower memory vs explicit 3D baselines, with SOTA on WorldScore.","take":"Keeping the spatial memory in latent space instead of round-tripping through pixels is the obvious-in-hindsight move, and the 55× memory cut is the kind of systems win that decides whether a world model is deployable or a demo. Depth-guided back-projection of latent tokens is a clean idea.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.09828","pdf":"https://arxiv.org/pdf/2606.09828","html":"https://arxiv.org/html/2606.09828v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.09828"}},{"arxivId":"2606.09738v1","title":"HDSL: A Hierarchical Domain-Specific Language for Structured 3D Indoor Scene Generation and Localized Editing with LLM Agents","authors":["Letian Li","Chao Shen","Shuzhao Xie","Chenghao Gu","Zhi Wang"],"categories":["cs.CV"],"abstract":"Text-driven indoor scene generation/editing needs an intermediate representation an LLM can both produce and revise. Scene graphs and global constraint lists are compact but underspecify local geometry and make edits hard to localize. HDSL frames it as structured program generation + local program repair: an XML/CSS-style DSL representing rooms, regions, objects, and support surfaces as a tree with local coordinates. LLM agents generate HDSL subtrees with bounded verification, ground nodes via multimodal asset retrieval, and apply force-directed layout to fix collisions. For editing, Hierarchical RAG rewrites only the relevant subtree and merges back deterministically — cutting tokens 5.22× and runtime 6.19× while preserving unrelated objects.","take":"Treating a 3D scene as a program you can locally repair — rather than a blob you regenerate — is the right abstraction, and it rhymes with how BIM and CAD actually work. The deterministic three-way merge on a scene tree is the detail that makes LLM editing trustworthy instead of destructive.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.09738","pdf":"https://arxiv.org/pdf/2606.09738","html":"https://arxiv.org/html/2606.09738v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.09738"}},{"arxivId":"2606.09719v1","title":"Safe Polytope-in-Polytope Motion Planning and Control with Control Barrier Functions","authors":["Alejandro Gonzalez-Garcia","Dries Dirckx","Jan Swevers","Wilm Decré"],"categories":["cs.RO"],"abstract":"Robots in tight spaces need planning that respects their actual footprint; a point/circle approximation throws away the info needed to thread narrow passages. This work keeps a polytopic robot footprint inside a continuously-updated convex free-space region, formulated as discrete-time control-barrier-function constraints inside an MPC. The number of safety constraints scales with local free-space geometry and robot shape, not the number of obstacles, and it needs no obstacle detection or segmentation. Up to 91× faster than a polytope-based obstacle-avoidance formulation as obstacles grow; validated at 10 Hz on embedded hardware with occupancy grids and LiDAR.","take":"Making cost scale with free-space complexity instead of obstacle count is the elegant inversion here — and \"no obstacle detection required\" sidesteps a whole brittle perception stage. 10 Hz on an onboard embedded computer is the line that says this is real, not just a sim result.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.09719","pdf":"https://arxiv.org/pdf/2606.09719","html":"https://arxiv.org/html/2606.09719v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.09719"}},{"arxivId":"2606.09718v1","title":"Evaluating the Representation Space of Diffusion Models via Self-Supervised Principles","authors":["Xiao Li","Yixuan Jia","Zekai Zhang","Liyue Shen","Qing Qu"],"categories":["cs.LG","cs.CV"],"abstract":"Diffusion models are both strong generators and strong self-supervised representation learners, but the link is under-explored. This paper decomposes features into invariant and residual components and derives the Invariant Contamination Ratio (ICR), a Fisher-based metric for how residual variation contaminates the invariant signal. Findings: invariance peaks at intermediate noise levels (which also give the best downstream classification), and ICR is a sensitive training-time indicator of the onset of memorization — detectable from training features alone, with no held-out set or external evaluator.","take":"The practically useful nugget: a training-time memorization detector that needs no held-out data. If ICR holds up, that's a cheap early-warning light for \"your model has stopped generalizing\" — exactly the thing you want in data-limited fine-tuning runs.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.09718","pdf":"https://arxiv.org/pdf/2606.09718","html":"https://arxiv.org/html/2606.09718v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.09718"}}],"kind":"arxiv","slug":"2026-06-09","body":"\nHeavy 3D-perception day. The standout is ATN3D — LiDAR-Radar fusion tuned for the\nlong-range, sparse regime that actually constrains autonomous perception. Also notable: a\nlatent-space spatial memory for video world models (55× less memory), and a footprint-aware\nmotion planner whose cost scales with free space, not obstacle count.\n","readingTimeMins":1,"url":"https://ai.thesatyajit.com/arxiv/2026-06-09"},{"date":"2026-06-03","papers":[{"arxivId":"2606.05162v1","title":"Controllable Dynamic 3D Shape Generation via 3D Trajectories and Text","authors":["Jaeyeong Kim","Ines Kim","Jahyeok Koo","Seungryong Kim"],"categories":["cs.CV"],"abstract":"We introduce T2Mo, a feed-forward framework for controllable dynamic 3D shape generation conditioned on 3D trajectories and text. Due to the inherent ambiguity of language, generating precisely intended motions using text alone remains challenging. To address this, we adopt 3D trajectories as controllable spatial guidance, specifying the exact paths along which selected points should move.","take":"Text-to-3D-motion has always been mushy because language underspecifies geometry — anchoring generation to explicit 3D point trajectories is the right fix. Feed-forward (no per-scene optimization) makes this actually usable in a pipeline.","standout":true,"links":{"abs":"https://arxiv.org/abs/2606.05162","pdf":"https://arxiv.org/pdf/2606.05162","html":"https://arxiv.org/html/2606.05162v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.05162"}},{"arxivId":"2606.05142v1","title":"GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes","authors":["Josef Bengtson","Yaroslava Lochman","Fredrik Kahl"],"categories":["cs.CV","cs.AI"],"abstract":"Recent developments in multi-view image editing with generative models have brought us a step closer toward general 3D content generation and customization. Most existing works focus on rigid or appearance-only edits by utilizing the geometry of the unedited scene. This naturally limits these methods to edits that preserve the underlying scene structure.","take":"Multi-view consistency for *nonrigid* edits is the hard version of the problem — most methods cheat by keeping the original geometry. Worth a read if you care about editable digital twins.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.05142","pdf":"https://arxiv.org/pdf/2606.05142","html":"https://arxiv.org/html/2606.05142v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.05142"}},{"arxivId":"2606.05152v1","title":"Reinforcement Learning from Rich Feedback with Distributional DAgger","authors":["Rishabh Agrawal","Jacob Fein-Ashley","Paria Rashidinejad"],"categories":["cs.LG","cs.AI","cs.CL"],"abstract":"Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations.","take":"RLVR throws away everything except one bit per rollout — execution traces and tool outputs are sitting right there. Using them is obvious in hindsight, which is usually the mark of a good idea.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.05152","pdf":"https://arxiv.org/pdf/2606.05152","html":"https://arxiv.org/html/2606.05152v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.05152"}},{"arxivId":"2606.05145v1","title":"Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)","authors":["Nizar Islah","Istabrak Abbes","Irina Rish","Sarath Chandar","Eilif B. Muller"],"categories":["cs.LG","cs.AI","cs.CL"],"abstract":"When post-trained language models fail on reasoning problems, the common test-time-scaling response is to spend more compute on additional attempts, and the failed traces play no further role. We argue this discards a crucial signal; some failures come from unlucky sampling, where more rollouts help, while others are structural and resist resampling regardless of budget.","take":"Separating \"unlucky sampling\" failures from structural ones before you burn test-time compute is a genuinely useful triage — resampling a structurally-broken trace is just paying to fail again.","standout":false,"links":{"abs":"https://arxiv.org/abs/2606.05145","pdf":"https://arxiv.org/pdf/2606.05145","html":"https://arxiv.org/html/2606.05145v1","ar5iv":"https://ar5iv.labs.arxiv.org/html/2606.05145"}}],"kind":"arxiv","slug":"2026-06-03","body":"\nFirst digest from the paper-scout pipeline. Heavy 3D day on cs.CV — trajectory-conditioned\ndynamic shape generation (T2Mo) is the standout: explicit spatial guidance over text-only\nconditioning is a pattern I expect to see everywhere in 3D generation this year.\n","readingTimeMins":1,"url":"https://ai.thesatyajit.com/arxiv/2026-06-03"}]