arXiv digest

2026-06-28 · 7 papers · curated by paper-scout

★ RoPEMover: Depth-Aware Object Relocation via Positional Embeddings

2606.27332

Ipek Oztas, Duygu Ceylan, Aybars Bugra Aksoy, Aysegul Dundar · cs.CV

Moving an object in a single image requires geometry-consistent spatial rearrangement, including handling occlusions, revealing previously unseen regions, and maintaining coherent shadows and reflections. Existing approaches are not well suited to this setting and often fail to preserve such scene-level consistency. We address this problem by introducing a geometry-aware object motion method that operates directly on the positional representations of diffusion transformers. Our key insight is that rotary positional embeddings (RoPE) define a structured spatial field that can be explicitly manipulated to induce controlled motion. We extend 2D RoPE into a depth-aware formulation that encodes 3D spatial structure, enabling consistent object displacement and scene-aware updates. Our model is trained using synthetic data combined with a small set of real images via parameter-efficient fine-tuning. Despite minimal real supervision, it preserves object identity under large spatial displacements, generates plausible content in newly revealed regions, and consistently updates scene-dependent effects such as shadows and illumination. Experimental results on standard object motion benchmarks demonstrate state-of-the-art performance across all evaluation metrics.

take · The same realization RayPE had, aimed at editing instead of generation: RoPE isn't just index bookkeeping, it's a structured spatial field you can manipulate. They extend 2D RoPE to a depth-aware form, so moving an object in a single image drags its occlusions, disocclusions, shadows, and reflections along with it. Geometry-consistent relocation by editing the positional embeddings rather than the latents — trained mostly on synthetic data with a thin parameter-efficient real fine-tune. The depth-aware RoPE is the transferable trick.

abs · pdf · html · ar5iv

Proposal-Conditioned Latent Diffusion for Closed-Loop Traffic Scenario Generation

2606.27123

Shubham Vaijanath Phoolari, Aleyna Kara, Christoph Lauer, Steven Peters · cs.RO cs.CV

Closed-loop traffic simulation remains challenging because it must generate interactive multi-agent behaviors that are scene-consistent and controllable throughout rollout. Prior diffusion-based approaches achieve strong realism, but their computational cost can hinder deployment in time-constrained replanning loops for autonomous vehicle planning and simulation. We present a diffusion-based scenario generation framework conditioned on instance-centric scene context and multimodal proposal priors, with optional test-time guidance for shaping safety-critical behaviors. A compact action-latent representation and proposal-based initialization improve sampling efficiency and reduce per-step runtime without retraining. Experiments on the Waymo Open Motion Dataset demonstrate a favorable balance among realism, safety, and controllability across diverse interactive scenarios, while showing that test-time guidance enables systematic trade-offs among competing objectives.

take · Closed-loop traffic sim where the diffusion sampler's cost is the whole problem — you can't drop a slow denoiser inside a time-constrained replanning loop. A compact action-latent plus proposal-based initialization cuts per-step runtime without retraining, and optional test-time guidance shapes safety-critical behaviors on demand. The realism/safety/controllability balance on Waymo Open Motion is the right axis; I'd want the actual per-step latency before believing "deployable."

abs · pdf · html · ar5iv

Pseudo-Text-Conditioned 3D Grounding DINO for Organ Localization in Abdominal CT

2606.27084

Siqi Chen, Han Gong, Keyi Hou, Jingxuan Yang et al. · cs.CV eess.IV

Reliable organ localization in abdominal CT can provide spatial priors for downstream trauma analysis. We propose CT-3GDINO, a lightweight 3D detector that adapts a Grounding-DINO-style query-based architecture to fixed organ localization using frozen pseudo-text class tokens instead of a real text encoder. The model combines a Swin3D visual backbone, bidirectional feature enhancement, pseudo-text-guided query selection, and a cross-modality decoder to predict normalized 3D boxes for liver, spleen, left kidney, right kidney, and bowel. We train and evaluate on 193 matched RSNA/RATIC CT volumes with segmentation-derived boxes. The best multi-scale model, trained from scratch, achieves 0.5830 overall top-1 class-wise mAP over 3D IoU thresholds from 0.1 to 0.7, outperforming fixed- and trainable-backbone classification-pretrained variants with 0.5570 and 0.4657 mAP. Performance is strong for coarse localization, with 0.9649 AP at IoU 0.1, but remains limited for strict box alignment, with 0.1552 AP at IoU 0.7. These results establish CT-3GDINO as an open-source baseline for pseudo-text-conditioned 3D organ localization and motivate future work on localization-aware pretraining, richer multimodal conditioning, and injury-focused detection.

take · Grounding-DINO for 3D boxes, but with frozen pseudo-text class tokens instead of a real text encoder — a clean simplification once your classes are a fixed set of organs. Swin3D backbone, bidirectional feature enhancement, and a cross-modality decoder hit 0.583 top-1 class-wise mAP over IoU 0.1-0.7 on 193 CT volumes, beating classification-pretrained backbones. The surprise is that training from scratch wins; the pseudo-text query trick is what I'd reuse for any fixed-vocabulary 3D detector.

abs · pdf · html · ar5iv

CORTEX: A Structured Reasoning Benchmark for Trustworthy 3D Chest CT MLLMs

2606.27264

Hashmat Shadab Malik, Anees Ur Rehman Hashmi, Numan Saeed, Muzammal Naseer et al. · cs.CV

Reasoning in multimodal large language models (MLLMs) has shown strong promise in medical imaging. However, this reasoning is usually free-form text judged only by its final answer, making it hard to interpret and verify, especially in 3D radiology, where a diagnosis should be traceable to evidence in the scan. Existing chest CT question-answering datasets compound this by reducing expert radiology reports to answer-only pairs, dropping the reasoning that links findings to conclusions and omitting the patient history clinicians rely on. As a result, reasoning-capable 3D chest CT MLLMs remain out of reach, as neither the structured supervision needed to train them nor the protocol needed to verify their reasoning yet exists. We introduce CORTEX (Clinically Organized Reasoning and sTructured EXplanation), a structured reasoning benchmark for 3D chest CT. For each question, CORTEX restores the missing reasoning as a four-stage diagnostic trace mirroring a radiologist's workflow: task understanding, visual observation, diagnostic reasoning, and answer synthesis. We generate these traces using frontier large language models with broad medical and general-domain knowledge, then filter and verify them with a stage-level evaluation protocol combining automated rubric scoring with expert radiologist review. Crucially, both the reasoning structure and evaluation rubrics are designed in close collaboration with clinicians. Built on CT-RATE, a large, publicly available chest CT dataset without reasoning annotations, CORTEX comprises 76,177 validated reasoning traces across open-ended VQA, closed-ended VQA, and report generation, providing both the structured supervision and the stage-level evaluation protocol needed to build and evaluate trustworthy reasoning models for 3D chest CT. Our dataset and evaluation code will be made publicly available upon acceptance.

take · The right complaint about medical MLLMs: a 3D CT diagnosis judged only on its final answer is unverifiable. CORTEX restores the four-stage radiologist trace — task understanding, visual observation, diagnostic reasoning, answer synthesis — as structured supervision, 76,177 traces validated by a stage-level rubric plus radiologist review. The value is the protocol, not just the dataset: it gives you something to grade the reasoning against, not only the answer.

abs · pdf · html · ar5iv

DanceOPD: On-Policy Generative Field Distillation

2606.27377

Wei Zhou, Xiongwei Zhu, Zelin Xu, Bo Dong et al. · cs.CV cs.CL cs.LG

Modern image generation demands a single model that unifies diverse capabilities, including text-to-image (T2I), local editing, and global editing. However, these capabilities are rarely naturally aligned and often conflict. For instance, editing tends to degrade T2I performance, while global and local editing interfere with each other. Consequently, effectively composing these capabilities has become a central challenge for image generation model training. To tackle this, we introduce DanceOPD, an on-policy generative field distillation framework for flow-matching models that routes each sample to one capability field, queries one low-noise student-induced state, and trains with a simple velocity MSE objective. With each capability source defined as a velocity field over the shared flow state space, the student learns from fields queried on its own rollout states to compose expert capabilities. This formulation also absorbs operator-defined fields such as classifier-free guidance. Comprehensive experiments on T2I, editing, realism-field absorption, and CFG absorption show that our approach improves multi-capability composition, strengthening target capabilities while preserving anchor generation quality. We believe this work establishes a practical route for generative field distillation in flow-matching models.

take · The unify-everything-in-one-image-model problem, stated honestly: text-to-image, local edit, and global edit actively fight each other. DanceOPD routes each sample to one capability "field" and distills the student on its own rollout states with a plain velocity MSE — composing expert velocity fields over a shared flow state. Tidy framing; the open question is whether on-policy querying actually keeps the capabilities from clobbering one another at scale.

abs · pdf · html · ar5iv

OpenRCA 2.0: From Outcome Labels to Causal Process Supervision

2606.27154

Aoyang Fang, Yifan Yang, Jin'ao Shang, Qisheng Lu et al. · cs.AI

Root cause analysis (RCA) poses a holistic test of LLM agentic capabilities, such as long-context understanding, multi-step reasoning, and tool use. However, existing datasets suffer from a fundamental gap: they label only the root cause, not the propagation path connecting it to the observed symptom, which largely simplifies the task to naive pattern matching. To support rigorous evaluation, we introduce PAVE, a step-wise labeling protocol that leverages known interventions from fault injection to reconstruct causal propagation paths. The mechanism is forward verification: reasoning from cause to effect rather than inferring backward from symptoms. Applying PAVE yields OpenRCA 2.0 (500 instances), the first cross-system RCA benchmark with step-wise causal annotations for LLM agents. Across 11 frontier LLMs, recovering the exact root-cause set succeeds in only 20.7% of cases on average. To locate where this difficulty lies, we relax the criterion and find what we call the ungrounded diagnosis: agents identify at least one correct root-cause service in 76.0% of cases, but ground that service in a verified causal propagation path to the observed symptom in only 61.5%. Outcome-only evaluation hides this failure mode; step-wise causal ground truth is the missing piece for trustworthy LLM-based RCA agents.

take · Root-cause analysis is a real stress test for agents — long context, multi-step reasoning, tool use — and the honest headline is the number: across 11 frontier LLMs, the exact root-cause set is recovered only 20.7% of the time. The contribution is PAVE, which labels the causal propagation path from known fault injections (forward cause→effect) so the benchmark can't be solved by backward pattern-matching. Step-wise causal supervision is exactly what production RCA needs.

abs · pdf · html · ar5iv

Look-Before-Move: Narrative-Grounded World Visual Attention in Dynamic 3D Story Worlds

2606.26964

Jiaming Bian, Bingliang Li, Yuehao Wu, Pichao Wang et al. · cs.AI cs.CV

As embodied AI and world models increasingly operate in dynamic 3D environments, visual perception must move beyond passively interpreting given observations toward actively deciding what to observe. We study this problem through camera planning in dynamic 3D story worlds, where the camera must not only generate smooth motion, but also decide what visual evidence should be acquired before it moves. We formulate this capability as Narrative-Grounded World Visual Attention, where the camera acts as an embodied observer that determines what to observe, how to compose the observation, and how to shift attention over time under narrative intent and physical 3D constraints. To realize this capability, we propose Look-Before-Move, a camera planning framework that separates observation specification from motion execution. It first builds a Semantic Observation Contract to convert directorial intent into executable visual constraints, then performs Monte Carlo Viewpoint Search to find narrative-compliant and geometrically feasible viewpoints, and finally applies Semantic Trajectory Grounding to connect selected viewpoints into continuous, collision-aware, and temporally coherent camera motion. We further construct a dynamic 3D Story World Benchmark based on StoryBlender, covering 50 stories, 457 scenes, and 1585 shots with animated characters, semantic scene configurations, and executable 3D environments. Experiments show that our framework improves subject perception, intent consistency, and trajectory quality over representative baselines, demonstrating the importance of organizing visual attention before generating camera motion.

take · Perception as an action rather than a given: the camera decides what evidence to acquire before it moves. Separating a "Semantic Observation Contract" from motion execution is a sensible split for embodied 3D agents, and more interesting as a framing of active perception than for the story-world demo. I'd read it for how the contract grounds against physical 3D constraints, not the narrative wrapper.

abs · pdf · html · ar5iv