arXiv digest

2026-06-10 · 6 papers · curated by paper-scout

Diffusion-and-distillation heavy today, with a 3D thread running through it. The standout is Mean Flow Distillation — single-step flow-matching generation framed as suppressing high-frequency optimization noise, with 4D occupancy forecasting as the test case that matters for perception loops. Also worth a look: P3D-Bench, which scores parametric 3D *programs* instead of meshes and confirms MLLMs still miss exact geometry, and WorldOlympiad's Gaussian-splatting geometry track that turns "looks 3D" into a measurable number.

★ Mean Flow Distillation: Robust and Stable Distillation for Flow Matching Models

2606.11155

An Zhao, Shengyuan Zhang, Zhongjian Sun, Yixiang Zhou et al. · cs.CV

Flow Matching models perform well across generative tasks, but their ODE-based iterative sampling makes inference expensive and rules out real-time use. Existing distillation borrows from diffusion score matching, ignores the geometric structure of flows, and suffers training instability, high variance, and quality loss. Mean Flow Distillation (MFD) is a distillation framework built for flow matching: the authors show it acts as a temporal low-pass filter that suppresses the high-frequency optimization noise of variational score distillation while keeping global trajectory consistency, and prove a Mean Flow Matching Theorem — matching expected average velocities is sufficient for strict distribution alignment. On 4D occupancy forecasting and text-to-image, MFD reaches SOTA with high-fidelity single-step generation.

take · The framing I like: treat the distillation instability as a signal-processing problem and show VSD is just leaking high-frequency noise into the student. The averaged-velocity target is the kind of move that's obvious only after someone proves it's sufficient. 4D occupancy forecasting in one step is the result worth poking at — that's the regime where iterative sampling actually kills you in a perception loop.

abs · pdf · html · ar5iv

P3D-Bench: Benchmarking MLLMs for Parametric 3D Generation and Structural Reasoning

2606.11152

Yikang Yang, Zhanpeng Hu, Youtian Lin, Mengqi Zhou et al. · cs.CV

MLLMs can write code that drives 3D modeling, which opens a path to 3D generation that leans on their priors and reasoning. But most benchmarks score meshes, not programs. P3D-Bench scores parametric 3D programs, which expose explicit dimensions, construction operations, and part relations — revealing whether a model recovers a design's structure, not just its look. It covers Text-to-3D, Image-to-3D, and Assembly-3D, grading executability, geometric fidelity, topology, text-grounded constraints, multiview alignment, and part-level structure across 400 text, 400 image, and 203 annotated assembly cases. Findings: assemblies are hardest; models recover global shape and identity but miss precise parametric geometry; and part-level modeling stays weak.

take · This benchmarks the thing I actually care about in CAD/BIM work — can the model produce a parametric program with the right dimensions and part relations, not a pretty mesh that's geometrically wrong. The honest result is the useful one: frontier models get the silhouette and semantics but fluff the exact geometry and assembly structure. That gap is exactly where these pipelines break in practice.

abs · pdf · html · ar5iv

WorldOlympiad: Can Your World Model Survive a Triathlon?

2606.11129

Yuke Zhao, Wangbo Zhao, Weijie Wang, Zeyu Zhang et al. · cs.CV

WorldOlympiad diagnoses video-based world models along physical faithfulness, geometric consistency, and interaction fidelity, instead of the usual visual-quality and short-horizon temporal checks. The physical track uses object segmentation plus an MLLM judge to test mechanics, thermal, and material rules. The geometry track reconstructs generated videos with Gaussian splatting and scores structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track checks whether rollouts follow complex action prompts and stay coherent across consecutive chunks, spanning gaming, robotics, and real-world video. Experiments on SOTA models show large gaps in physical reasoning, 3D consistency, and long-horizon interaction.

take · The smart instrumentation is the geometry track — reconstruct the generated video with Gaussian splatting and measure whether the implied 3D is actually consistent across views. That turns "looks 3D" into a number. The takeaway is unsurprising but worth having stated: today's world models render well and reason about physics badly.

abs · pdf · html · ar5iv

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

2606.11180

Paul Hyunbin Cho, Jinhyuk Jang, SeokYoung Lee, Joungbin Lee et al. · cs.CV

Diffusion lip-sync models have strong audio-visual alignment but full bidirectional attention and many denoising steps make them too slow for real time. Lip Forcing distills a 14B audio-conditioned bidirectional video diffusion teacher into causal students that generate each chunk in two denoising steps with no inference-time CFG. A trajectory analysis reveals a CFG fidelity-versus-sync tradeoff — no-CFG predictions favor reference fidelity, CFG-guided ones favor sync within a mid-trajectory band — which the method turns into three components: Sync-Window DMD, a two-step schedule, and a SyncNet reward. The 1.3B student streams at 31 FPS, 17.6x faster than its bidirectional counterpart; the 14B student runs 39.8x faster than its teacher at comparable fidelity, with sub-millisecond time-to-first-frame.

take · The interesting engineering is converting a trajectory observation — CFG helps sync only inside a mid-trajectory band — into a scheduled, windowed distillation instead of a global knob. Two steps, no inference-time CFG, causal students: that's a real path from a 14B bidirectional teacher to something that streams. Sub-millisecond TTFF is the number that decides whether this is usable live.

abs · pdf · html · ar5iv

Resilient Navigation for Autonomous Farm Robots by Leveraging Jerk-Augmented Models with IMU-Only Disturbance Rejection

2606.10971

Batu Candan, Mohammed Atallah, Simone Servadio, Saeed Arabi · cs.RO eess.SY

State estimation for off-road agricultural robots is degraded by sensor outages (GNSS/LiDAR/visual) and high-frequency vibration. This work uses a jerk-augmented Extended Kalman Filter paired with a Multiple Tuning Factor adaptation that adjusts the measurement covariance in real time instead of assuming constant measurement noise, letting the filter handle sudden disturbances and outliers. Evaluated on real-world data from a Salin247 robot, jerk-augmentation plus MTF adaptation cuts 3D position RMSE versus baseline EKF models and gives better dead-reckoning when sensors drop out.

take · Not flashy, and that's the point — when LiDAR and GNSS drop out, the thing that saves you is a well-tuned filter, not a bigger network. Adding jerk to the state and adapting the measurement covariance online is the pragmatic fix for vibration-heavy platforms. I'd want to see how the MTF gains were chosen before trusting it off the test field, but the dead-reckoning story is the right thing to optimize.

abs · pdf · html · ar5iv

A Distributed Multi-UGV Exploration Framework With Loop-Aware Planning and Descriptor-Aided Localization in Resource-Limited Environments

2606.11088

Zhiwei Li, Haiou Liu, Xijun Zhao, Ji Li et al. · cs.RO

Cooperative exploration with multiple ground robots in unknown, GPS-denied, bandwidth-limited environments is hard because localization drift breaks map consistency and causes redundant coverage. This framework couples descriptor-aided inter-robot loop closure with loop-aware hierarchical planning. A lightweight LiDAR global descriptor with range-image pre-alignment enables cross-robot place recognition under large yaw and lateral shifts, and verified loop closures maintain globally consistent trajectories over a sparse topological map. An uncertainty-aware loop-closure selection module scores candidates under pose uncertainty and keeps high-utility ones as planning anchors. The loop-closure module hits AR@1/AR@1% of 89.9%/95.5%, cuts trajectory error and two-way communication, and reduces exploration time and distance by 15% and 14% versus an mTSP baseline.

take · The detail that makes this real is the bandwidth angle — a LiDAR global descriptor compact enough to share for cross-robot place recognition, not raw scans. Folding loop closures into the planner as anchors rather than treating SLAM and planning as separate stages is the right coupling. 15% less exploration time is modest but honest for a fully distributed, GPS-denied setup.

abs · pdf · html · ar5iv