Controlled Self-Evolution for Algorithmic Code Optimization
Abstract
Self-evolution methods enhance code generation through iterative "generate-verify-refine" cycles, yet existing approaches suffer from low exploration efficiency, failing to discover solutions with superior complexity within limited budgets. This inefficiency stems from initialization bias trapping evolution in poor solution regions, uncontrolled stochastic operations lacking feedback guidance, and insufficient experience utilization across tasks. To address these bottlenecks, we propose Controlled Self-Evolution (CSE), which consists of three key components. Diversified Planning Initialization generates structurally distinct algorithmic strategies for broad solution space coverage. Genetic Evolution replaces stochastic operations with feedback-guided mechanisms, enabling targeted mutation and compositional crossover. Hierarchical Evolution Memory captures both successful and failed experiences at inter-task and intra-task levels. Experiments on EffiBench-X demonstrate that CSE consistently outperforms all baselines across various LLM backbones. Furthermore, CSE achieves higher efficiency from early generations and maintains continuous improvement throughout evolution. Our code is publicly available at https://github.com/QuantaAlpha/EvoControl.
🎯Research Motivation
• Self-evolution methods have low exploration efficiency under limited budgets, failing to discover solutions with superior time and space complexity despite functional correctness.
• Initialization bias from starting with one/few base-model solutions traps search in poor regions and causes premature convergence.
• Uncontrolled stochastic mutations/crossovers lack feedback guidance, leading to undirected exploration and wasted iterations.
• Insufficient experience reuse—no effective intra-task or cross-task memory—causes repeated failures and prevents leveraging proven optimization strategies.
• Practical deployment constraints (cost/latency) demand methods that achieve efficiency gains early and sustain improvement over generations.
🔧Research Method
CSE combines diversified planning initialization to produce structurally distinct starting solutions with feedback-controlled genetic evolution (soft parent selection, functional decomposition, targeted mutation, and compositional crossover). A hierarchical evolution memory captures and reuses success and failure at intra- and inter-task levels to guide future exploration.
💡Research Ideas
• Meta-Learned Planning Initialization for Efficiency-Oriented Code Evolution: Learn a planner that generates diverse, complexity-aware algorithmic sketches using cross-task signals to improve early-generation efficiency.
• Program-Analysis-Guided Credit Assignment for Targeted Mutation and Crossover: Integrate static/dynamic analysis and profiling to localize inefficient/faulty components, enabling precise component extraction and safer, more effective edits.
• Retrieval-Augmented Hierarchical Evolution Memory Across Tasks and Languages: Build a scalable memory with retrieval and template parameterization to transfer optimization patterns across domains, programming languages, and problem families.
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation
Abstract
Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.
🎯Research Motivation
• Evaluating long, multi-source deep research reports is fundamentally different from QA and remains poorly standardized and scalable.
• Existing benchmarks rely on expert-driven task construction that is annotation-intensive and costly, limiting coverage and realism.
• Static, one-size-fits-all evaluation dimensions miss task-specific success criteria, leading to uninformative or inflated scores.
• Fact-checking pipelines typically verify only citation-linked claims, leaving uncited statements unexamined and facts unverifiable when citations are missing.
• Many constructed tasks do not truly require external retrieval or multi-source integration, allowing LLMs to answer from parametric knowledge and diluting benchmark difficulty.
🔧Research Method
DeepResearchEval introduces an automated persona-driven task construction pipeline with Task Qualification and Search Necessity filters to retain only realistic, retrieval-dependent, multi-source research tasks, and an agentic evaluation pipeline combining Adaptive Point-wise Quality Evaluation (task-specific dimensions, criteria, and weights) with Active Fact-Checking that extracts and verifies both cited and uncited statements via web search. Together, these components enable scalable creation of hard tasks and fine-grained, evidence-grounded assessment of deep research systems.
💡Research Ideas
• Cross-Lingual DeepResearchEval: Multilingual Task Construction and Agentic Fact-Checking for Global Deep Research Benchmarks
• Source Reliability-Aware Agentic Fact-Checking: Calibrating Scores by Evidence Quality, Recency, and Consensus
• Human Preference-Calibrated Task-Adaptive Evaluation: Aligning Dynamic Dimensions and Weights with Expert Judgments at Scale
MAXS: Meta-Adaptive Exploration with LLM Agents
Abstract
Large Language Model (LLM) Agents exhibit inherent reasoning abilities through the collaboration of multiple tools. However, during agent inference, existing methods often suffer from (i) locally myopic generation, due to the absence of lookahead, and (ii) trajectory instability, where minor early errors can escalate into divergent reasoning paths. These issues make it difficult to balance global effectiveness and computational efficiency. To address these two issues, we propose meta-adaptive exploration with LLM agents https://github.com/exoskeletonzj/MAXS, a meta-adaptive reasoning framework based on LLM Agents that flexibly integrates tool execution and reasoning planning. MAXS employs a lookahead strategy to extend reasoning paths a few steps ahead, estimating the advantage value of tool usage, and combines step consistency variance and inter-step trend slopes to jointly select stable, consistent, and high-value reasoning steps. Additionally, we introduce a trajectory convergence mechanism that controls computational cost by halting further rollouts once path consistency is achieved, enabling a balance between resource efficiency and global effectiveness in multi-tool reasoning. We conduct extensive empirical studies across three base models (MiMo-VL-7B, Qwen2.5-VL-7B, Qwen2.5-VL-32B) and five datasets, demonstrating that MAXS consistently outperforms existing methods in both performance and inference efficiency. Further analysis confirms the effectiveness of our lookahead strategy and tool usage.
🎯Research Motivation
• Locally myopic generation in CoT/ToT without lookahead, leading to poor decisions about whether and how to use tools.
• Trajectory instability in multi-tool reasoning, where small early errors compound into divergent paths.
• High computational cost of global simulation approaches (e.g., MCTS), making it hard to balance effectiveness and efficiency.
• Lack of test-time strategies that integrate tool execution with reasoning planning in a value-aware, stability-conscious manner.
🔧Research Method
MAXS is a meta-adaptive test-time framework that performs limited lookahead rollouts to estimate the advantage of tool usage, then selects actions using step-consistency variance and inter-step trend slopes. A trajectory convergence mechanism halts rollouts once path consistency is reached, balancing accuracy and token cost.
💡Research Ideas
• Learning-to-Lookahead: End-to-End Training of Meta-Adaptive Rollout Policies for LLM Agents: Train a value/model-based policy that learns when and how far to roll out, replacing heuristic lookahead with learned estimators.
• Uncertainty-Aware Advantage Estimation for Tool Use in LLM Agents: Integrate calibrated uncertainty (e.g., Bayesian variance or ensembles) into advantage and consistency scoring to improve robustness under noisy tool outputs.
• Hierarchical MAXS: Multi-Scale Lookahead for Long-Horizon Mathematical Reasoning: Extend MAXS with coarse-to-fine planning layers that apply lookahead at multiple temporal scales for complex, long-horizon tasks.
A^3-Bench: Benchmarking Memory-Driven Scientific Reasoning via Anchor and Attractor Activation
Abstract
Scientific reasoning relies not only on logical inference but also on activating prior knowledge and experiential structures. Memory can efficiently reuse knowledge and enhance reasoning consistency and stability. However, existing benchmarks mainly evaluate final answers or step-by-step coherence, overlooking the memory-driven mechanisms that underlie human reasoning, which involves activating anchors and attractors, then integrating them into multi-step inference. To address this gap, we propose A^3-Bench~ https://a3-bench.github.io, a benchmark designed to evaluate scientific reasoning through dual-scale memory-driven activation, grounded in Anchor and Attractor Activation. First, we annotate 2,198 science reasoning problems across domains using the SAPM process(subject, anchor & attractor, problem, and memory developing). Second, we introduce a dual-scale memory evaluation framework utilizing anchors and attractors, along with the AAUI(Anchor--Attractor Utilization Index) metric to measure memory activation rates. Finally, through experiments with various base models and paradigms, we validate A^3-Bench and analyze how memory activation impacts reasoning performance, providing insights into memory-driven scientific reasoning.
🎯Research Motivation
• Existing benchmarks focus on final answers or chain-of-thought coherence and neglect memory activation mechanisms, making it impossible to diagnose whether failures stem from faulty inference or inadequate retrieval/activation of prior knowledge.
• Scientific reasoning requires context-dependent activation of hierarchical memory structures—anchors (foundational units) and attractors (solution schemas)—yet datasets and evaluations rarely represent or test these dual-scale signals.
• There is no quantitative, standardized metric to measure memory utilization during reasoning; current methods do not evaluate how well models activate and use the right knowledge/templates across multi-step inference.
🔧Research Method
A3-Bench introduces a 2,198-problem benchmark annotated with dual-scale memory signals (anchors and attractors) via the SAPM process and evaluates models using AAUI, a metric quantifying activation of expert-labeled memory. The paper instantiates activation with a HybridRAG pipeline (twin-needle dense+graph retrieval plus context fabric composer) to condition LLMs on activated anchors/attractors and measure gains across no/full/gold memory paradigms.
💡Research Ideas
• AAUI-Guided Training for Memory-Driven Reasoning: Integrate AAUI into training objectives to learn policies that reliably co-activate anchors and attractors and improve multi-step accuracy.
• Automatic Anchor–Attractor Induction from Corpora: Discover and validate anchors and attractors via representation learning and causal template extraction to scale memory libraries beyond manual annotation.
• Multimodal A3-Bench: Extending Anchor–Attractor Activation to Vision and Data: Build cross-modal anchors/attractors and assess memory-driven reasoning on multimodal science tasks (plots, diagrams, experiments).
• Causal Evaluation of Memory Activation in LLMs: Use interventional tests and ablations of memory units to establish causal links between activation, reasoning fidelity, and latency.
• Efficient Context Weaving via Compressed Memory Representations: Develop compression and selection strategies for anchors/attractors to reduce token cost and inference time while preserving activation fidelity.
Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning
Abstract
In this report, we introduce DASD-4B-Thinking, a lightweight yet highly capable, fully open-source reasoning model. It achieves SOTA performance among open-source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation -- even outperforming several larger models. We begin by critically reexamining a widely adopted distillation paradigm in the community: SFT on teacher-generated responses, also known as sequence-level distillation. Although a series of recent works following this scheme have demonstrated remarkable efficiency and strong empirical performance, they are primarily grounded in the SFT perspective. Consequently, these approaches focus predominantly on designing heuristic rules for SFT data filtering, while largely overlooking the core principle of distillation itself -- enabling the student model to learn the teacher's full output distribution so as to inherit its generalization capability. Specifically, we identify three critical limitations in current practice: i) Inadequate representation of the teacher's sequence-level distribution; ii) Misalignment between the teacher's output distribution and the student's learning capacity; and iii) Exposure bias arising from teacher-forced training versus autoregressive inference. In summary, these shortcomings reflect a systemic absence of explicit teacher-student interaction throughout the distillation process, leaving the essence of distillation underexploited. To address these issues, we propose several methodological innovations that collectively form an enhanced sequence-level distillation training pipeline. Remarkably, DASD-4B-Thinking obtains competitive results using only 448K training samples -- an order of magnitude fewer than those employed by most existing open-source efforts. To support community research, we publicly release our models and the training dataset.
🎯Research Motivation
• Inadequate coverage of the teacher’s sequence-level output distribution during SFT, leading to poor mode coverage and suboptimal knowledge transfer.
• Misalignment between teacher outputs and student learning capacity; SFT can produce misleading gradients that amplify overconfident student errors instead of aligning with teacher preferences.
• Pronounced exposure bias from teacher-forced training versus autoregressive inference, causing distribution shift, error accumulation, and length/trajectory mismatches at test time.
• Practical limitations of logit-based/on-policy distillation (tokenizer mismatch, proprietary logit access), and overreliance on heuristic data filtering without explicit teacher–student interaction.
• Need for data-efficient distillation that achieves strong reasoning with far fewer samples than prevailing large-scale open-source efforts.
🔧Research Method
Distribution-Aligned Sequence Distillation (DASD) combines temperature-scheduled learning (low-to-high temperature sampling to balance learnability and diversity), divergence-aware sampling (prioritize sentences where teacher probability significantly exceeds student’s to avoid misleading gradients), and mixed-policy distillation (student prefixes with teacher continuations) to mitigate exposure bias—achieving SOTA 4B reasoning with just 448K samples.
💡Research Ideas
• Sequence-Probability Reweighting for Tokenizer-Agnostic Distillation: Use teacher sequence-level probabilities to reweight SFT losses and better approximate the target distribution, improving fidelity and data efficiency across heterogeneous tokenizers.
• Adaptive Mixed-Policy Distillation via Error-Aware Prefix Selection: Develop an online policy that detects high-risk student prefixes and dynamically tunes teacher intervention and on-/off-policy mixing to reduce exposure bias and verbosity.
• Tool- and Retrieval-Augmented Distribution-Aligned Distillation: Integrate external knowledge retrieval and tool use into DASD to train compact models that generalize to real-world, multi-step reasoning tasks.
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Abstract
Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.
🎯Research Motivation
• Existing VLA models trained on robot demonstrations excel at routine skills but struggle with long-horizon planning, failure recovery, and adaptation due to limited data coverage.
• Explicit chain-of-thought reasoning improves generalization but incurs substantial inference latency, incompatible with real-time embodied control (1–15 Hz) and potentially unsafe.
• Supervised CoT requires extensive reasoning annotations and may not capture essential spatial-temporal dynamics; RL-based textual CoT (e.g., ThinkAct) remains long and slow.
• Naive acceleration (e.g., reasoning dropout) risks losing critical information, causing inconsistent planning and degraded action quality.
• There is a need to compress multimodal reasoning into compact representations that preserve visual-spatial planning and bridge high-level reasoning to low-level action execution.
🔧Research Method
Fast-ThinkAct distills explicit textual reasoning into compact continuous latent tokens via preference-guided teacher–student alignment with manipulation trajectory (visual) alignment, yielding verbalizable latent thoughts. A reasoning-enhanced policy then conditions on these latents to connect high-level multimodal planning to efficient action execution with drastically reduced inference latency.
💡Research Ideas
• Adaptive Latent CoT for Multi-Agent Embodied Collaboration: Extend verbalizable latent planning to coordinated multi-robot settings with shared trajectory alignment and on-the-fly communication.
• Safety-Aware Verbalizable Latent Planning for Autonomous Driving: Integrate risk-sensitive rewards and formal safety constraints into latent reasoning to meet real-time and safety-critical demands.
• Self-Supervised Latent Reasoning from On-Policy Experience: Reduce teacher dependence by learning and refining latent thoughts directly from interaction data using preference signals and bootstrapped verbalizers.
SkinFlow: Efficient Information Transmission for Open Dermatological Diagnosis via Dynamic Visual Encoding and Staged RL
Abstract
General-purpose Large Vision-Language Models (LVLMs), despite their massive scale, often falter in dermatology due to "diffuse attention" - the inability to disentangle subtle pathological lesions from background noise. In this paper, we challenge the assumption that parameter scaling is the only path to medical precision. We introduce SkinFlow, a framework that treats diagnosis as an optimization of visual information transmission efficiency. Our approach utilizes a Virtual-Width Dynamic Vision Encoder (DVE) to "unfold" complex pathological manifolds without physical parameter expansion, coupled with a two-stage Reinforcement Learning strategy. This strategy sequentially aligns explicit medical descriptions (Stage I) and reconstructs implicit diagnostic textures (Stage II) within a constrained semantic space. Furthermore, we propose a clinically grounded evaluation protocol that prioritizes diagnostic safety and hierarchical relevance over rigid label matching. Empirical results are compelling: our 7B model establishes a new state-of-the-art on the Fitzpatrick17k benchmark, achieving a +12.06% gain in Top-1 accuracy and a +28.57% boost in Top-6 accuracy over the massive general-purpose models (e.g., Qwen3VL-235B and GPT-5.2). These findings demonstrate that optimizing geometric capacity and information flow yields superior diagnostic reasoning compared to raw parameter scaling.
🎯Research Motivation
• General LVLMs show diffuse attention, failing to separate subtle lesions from background and thus transmit diagnostic visual information inefficiently.
• Geometric capacity bottleneck in vision encoders (small, static linear layers vs. large LLM backbones) causes capacity collapse for fine-grained dermatology features.
• Conventional exact-match metrics are clinically misaligned, ignoring hierarchical proximity, safety, and therapeutic consistency.
• Supervised fine-tuning struggles with synonymy/open-vocabulary labels, overfits narrow distributions, and is inefficient for learning top-K rankings.
• Scaling parameters is computationally prohibitive and often ineffective; the field needs parameter-efficient methods that maximize recoverable information.
🔧Research Method
SkinFlow combines a Virtual-Width Dynamic Visual Encoder (FDLinear) that ‘unfolds’ lesion manifolds without physical width expansion with a two-stage GRPO-based RL: Stage I aligns explicit features via structured medical captioning, and Stage II decodes implicit textures to optimize top-K diagnosis under a clinically grounded, hierarchy-aware evaluation.
💡Research Ideas
• Interpretable SkinFlow: Clinician-Centered Attribution and Explanation Metrics for Staged RL Dermatology Models: Develop standardized interpretability metrics, richer evidence generation, and lesion-level attribution validated by user studies.
• Multimodal SkinFlow: Integrating Dermoscopy, Histopathology, and Clinical Text via Cross-Modal DVE and Hierarchical RL: Extend DVE and staged RL to fuse multiple medical modalities for stronger generalization and robustness.
• Safety-Aware Reward Shaping for Open-World Dermatological Diagnosis: Formalize safety-critical rewards, uncertainty calibration, and abstention strategies to minimize boundary-crossing (e.g., benign vs. malignant) errors.
OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
Abstract
We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.
🎯Research Motivation
• Existing 3D open-vocabulary methods require per-scene training and language-embedding alignment (e.g., CLIP/BERT/DINO), which is time-consuming (>1 hour per scene) and often needs human-annotated description–mask pairs, limiting scalability.
• Embedding-based approaches are biased toward short category tags and struggle with complex referring expressions, nuanced attributes, and arbitrary phrasing, reducing accuracy and practicality for RES.
• Current pipelines optimize high-dimensional per-primitive features via gradient descent, hindering efficiency and view-consistent instance grouping in 3D representations like sparse voxels.
• There is a need for interpretable, training-free, and flexible text-to-text reasoning that can robustly support both OVS and RES while providing fast, scene-wide object grouping and captions.
🔧Research Method
OpenVoxel is a training-free pipeline that groups SVR voxels into object instances using SAM2-driven 2D-to-3D centroid voting across views, then constructs a canonical scene map by mask-conditioned captioning with DAM and MLLMs, and finally answers OVS/RES queries via MLLM text-to-text retrieval over the stored captions and 3D centers.
💡Research Ideas
• OpenVoxel-Rel: Structured Scene Graphs for Training-Free Relational Referring Segmentation: Augment the scene map with explicit inter-object relations and graph-based reasoning to improve queries involving spatial and functional relationships (e.g., “left of the apple,” “next to the mug”).
• Dyn-OpenVoxel: Training-Free Open-Vocabulary Understanding in Dynamic 4D Scenes: Extend grouping and captioning to time-varying scenes with motion-aware aggregation and temporal consistency to handle moving objects and long-term RES.
• Mask-LLM: Fine-Tuning Multimodal LLMs for Mask-Conditioned 3D Instance Captioning: Train or adapt MLLMs on masked inputs to enhance canonical caption quality, reduce ambiguity (e.g., replacing “object” with precise nouns), and improve retrieval robustness.
OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG
Abstract
The development of large language models (LLMs) has achieved superior performance in a range of downstream tasks, including LLM-based retrieval-augmented generation (RAG). The quality of generated content heavily relies on the usefulness of the retrieved information and the capacity of LLMs' internal information processing mechanism to incorporate it in answer generation. It is generally assumed that the retrieved information is relevant to the question. However, the retrieved information may have a variable degree of relevance and usefulness, depending on the question and the document collection. It is important to take into account the relevance of the retrieved information in answer generation. In this paper, we propose OpenDecoder, a new approach that leverages explicit evaluation of the retrieved information as quality indicator features for generation. We aim to build a RAG model that is more robust to varying levels of noisy context. Three types of explicit evaluation information are considered: relevance score, ranking score, and QPP (query performance prediction) score. The experimental results on five benchmark datasets demonstrate the effectiveness and better robustness of OpenDecoder by outperforming various baseline methods. Importantly, this paradigm is flexible to be integrated with the post-training of LLMs for any purposes and incorporated with any type of external indicators.
🎯Research Motivation
• RAG answer quality is highly sensitive to the usefulness and relevance of retrieved documents, which often vary and can include noise.
• Current LLM decoding relies solely on internal attention mechanisms and implicitly assumes relevance, causing degraded performance when input context is noisy.
• Workflow-based RAG methods (judge/filter/reason steps) are prompt-sensitive, error-prone, and add latency; fine-tuning approaches still ignore explicit relevance signals and treat documents nearly equally, lacking robustness.
• Existing systems do not explicitly incorporate document quality indicators (e.g., retriever scores) into decoding, missing a direct way to bias generation toward truly relevant content.
🔧Research Method
OpenDecoder injects explicit document-quality indicators (retriever relevance, LLM ranker score, QPP score) into the LLM’s attention computation by normalizing them into a token-level score matrix that modulates attention and reshapes generation probabilities. It further applies robustness training by replacing top-k context with documents of varied relevance and supports aggregating multiple indicators, making decoding more resilient to noisy inputs.
💡Research Ideas
• Indicator-Aware Joint Training for RAG: End-to-end optimization of retriever and decoder to align retrieval scores with attention modulation, improving effectiveness and robustness under noise.
• Adaptive Multi-Indicator Fusion in Decoding: Learnable, query-specific weighting and aggregation of heterogeneous signals (relevance, ranker, QPP, trustworthiness) to optimally guide attention.
• Provable Robustness of Indicator-Modulated Attention: Formal analysis and empirical validation of stability bounds and failure modes against adversarial retrieval corruption and long-context settings.
ExpSeek: Self-Triggered Experience Seeking for Web Agents
Abstract
Experience intervention in web agents emerges as a promising technical paradigm, enhancing agent interaction capabilities by providing valuable insights from accumulated experiences. However, existing methods predominantly inject experience passively as global context before task execution, struggling to adapt to dynamically changing contextual observations during agent-environment interaction. We propose ExpSeek, which shifts experience toward step-level proactive seeking: (1) estimating step-level entropy thresholds to determine intervention timing using the model's intrinsic signals; (2) designing step-level tailor-designed experience content. Experiments on Qwen3-8B and 32B models across four challenging web agent benchmarks demonstrate that ExpSeek achieves absolute improvements of 9.3% and 7.5%, respectively. Our experiments validate the feasibility and advantages of entropy as a self-triggering signal, reveal that even a 4B small-scale experience model can significantly boost the performance of larger agent models.
🎯Research Motivation
• Passive global experience injection fails to adapt to dynamic, step-level observations during web interactions.
• The open web is noisy and partially observable; smaller, cost-effective LLM agents tend to explore inefficiently or answer prematurely, leading to unreliable outcomes.
• Lack of a principled, low-cost intrinsic signal to trigger when to seek experience; reward-model-based per-step analysis is costly, and static experience content is not tailored to current error patterns or timing.
🔧Research Method
ExpSeek enables proactive, step-level experience seeking by using the model’s own step entropy as a self-trigger, estimating intervention thresholds via logistic regression with bootstrap separately for process and answer steps. It builds a topic-organized experience base of behavior–mistake–guidance triplets from paired success/failure trajectories and uses an experience model to retrieve and generate step-tailored guidance conditioned on the current interaction history.
💡Research Ideas
• Entropy-Guided Multi-Signal Triggers for Web Agents: Combine entropy with complementary signals (logit margins, calibration scores, tool feedback) to learn robust, adaptive step-level triggering policies.
• Continual Experience Base Evolution with Online Verification: Automatically curate and update triplets from live trajectories using verification, redundancy pruning, and topic reorganization to keep guidance relevant and precise.
• Meta-RL for Learning Step-Level Intervention Thresholds: Use meta-reinforcement learning to optimize task- and agent-specific trigger thresholds that balance exploration and convergence across diverse web domains.
FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection
Abstract
Vision-Language Models (VLMs) have shown remarkable performance in User Interface (UI) grounding tasks, driven by their ability to process increasingly high-resolution screenshots. However, screenshots are tokenized into thousands of visual tokens (e.g., about 4700 for 2K resolution), incurring significant computational overhead and diluting attention. In contrast, humans typically focus on regions of interest when interacting with UI. In this work, we pioneer the task of efficient UI grounding. Guided by practical analysis of the task's characteristics and challenges, we propose FocusUI, an efficient UI grounding framework that selects patches most relevant to the instruction while preserving positional continuity for precise grounding. FocusUI addresses two key challenges: (1) Eliminating redundant tokens in visual encoding. We construct patch-level supervision by fusing an instruction-conditioned score with a rule-based UI-graph score that down-weights large homogeneous regions to select distinct and instruction-relevant visual tokens. (2) Preserving positional continuity during visual token selection. We find that general visual token pruning methods suffer from severe accuracy degradation on UI grounding tasks due to broken positional information. We introduce a novel PosPad strategy, which compresses each contiguous sequence of dropped visual tokens into a single special marker placed at the sequence's last index to preserve positional continuity. Comprehensive experiments on four grounding benchmarks demonstrate that FocusUI surpasses GUI-specific baselines. On the ScreenSpot-Pro benchmark, FocusUI-7B achieves a performance improvement of 3.7% over GUI-Actor-7B. Even with only 30% visual token retention, FocusUI-7B drops by only 3.2% while achieving up to 1.44x faster inference and 17% lower peak GPU memory.
🎯Research Motivation
• High-resolution UI screenshots produce thousands of visual tokens, creating extreme token skew that drives up computation/memory and dilutes attention.
• Precise UI grounding is highly position-sensitive; naïve visual token pruning breaks positional continuity (e.g., in M-RoPE), causing sharp accuracy drops.
• Existing pruning methods for natural images are not instruction-aware and overlook UI structure, failing to suppress large homogeneous panes while keeping fine-grained widgets.
• Current GUI grounding models achieve accuracy but lack efficiency-focused mechanisms to retain performance under aggressive visual token reduction.
🔧Research Method
FOCUSUI selects instruction-relevant visual tokens via a lightweight Query-Guided Saliency Scorer trained with fused supervision that combines instruction-conditioned bbox-overlap and a rule-based UI-graph prior, then applies POSPAD to compress each contiguous span of dropped tokens into a single learned marker at the span’s last index to preserve positional continuity, enabling accurate yet faster UI grounding.
💡Research Ideas
• Beyond POSPAD: Learnable Position-Aware Compression for Multimodal Sequences: Generalize POSPAD by learning compact surrogate tokens that encode both positional continuity and summarized content of dropped spans.
• Adaptive Instruction-Conditioned Token Retention for UI Grounding: Dynamically set per-query retention ratios using uncertainty/complexity estimators to balance accuracy and efficiency on the fly.
• Temporal FocusUI: Position-Preserving Token Selection for Video-based GUI Interactions: Extend FocusUI to multi-step UI sequences, preserving both spatial and temporal continuity for dynamic UI grounding and navigation.
EvoFSM: Controllable Self-Evolution for Deep Research with Finite State Machines
Abstract
While LLM-based agents have shown promise for deep research, most existing approaches rely on fixed workflows that struggle to adapt to real-world, open-ended queries. Recent work therefore explores self-evolution by allowing agents to rewrite their own code or prompts to improve problem-solving ability, but unconstrained optimization often triggers instability, hallucinations, and instruction drift. We propose EvoFSM, a structured self-evolving framework that achieves both adaptability and control by evolving an explicit Finite State Machine (FSM) instead of relying on free-form rewriting. EvoFSM decouples the optimization space into macroscopic Flow (state-transition logic) and microscopic Skill (state-specific behaviors), enabling targeted improvements under clear behavioral boundaries. Guided by a critic mechanism, EvoFSM refines the FSM through a small set of constrained operations, and further incorporates a self-evolving memory that distills successful trajectories as reusable priors and failure patterns as constraints for future queries. Extensive evaluations on five multi-hop QA benchmarks demonstrate the effectiveness of EvoFSM. In particular, EvoFSM reaches 58.0% accuracy on the DeepSearch benchmark. Additional results on interactive decision-making tasks further validate its generalization.
🎯Research Motivation
• Fixed, static agent workflows cannot adapt to open-ended, real-world research queries that require dynamic, multi-hop reasoning paths.
• Unconstrained self-evolution (free-form prompt/code/tool rewriting) causes instability, hallucinations, instruction drift, and corruption of working modules.
• Lack of explicit structural guardrails: existing methods don’t separate macroscopic workflow planning from microscopic execution skills, making evolution opaque and hard to control.
• Poor experience accumulation: agents rarely distill successful trajectories or constrain against past failure modes, limiting continual improvement.
• Iterative retrieve–reason loops are prone to inefficient looping and missing verification steps without explicit transition logic and validation states.
🔧Research Method
EvoFSM models the research process as an explicit finite state machine that decouples macroscopic Flow (state-transition logic) from microscopic Skill (state-specific behaviors), and evolves only via critic-guided, constrained atomic operations (e.g., ADD_STATE, MODIFY_TRANSITION, REVISE_INSTRUCTION). A self-evolving memory retrieves successful priors and failure constraints to initialize and steer per-query FSM refinement, enabling adaptive yet controlled evolution.
💡Research Ideas
• Verification-Guided Critics for Reliable Self-Evolution: Integrate external verifiers, tool-grounded checks, and formal consistency tests to make the critic robust and reduce hallucination-driven mis-evolution.
• Scalable Lifelong Memory for Structured Agent Evolution: Develop consolidation, abstraction, and pruning mechanisms for the experience pool to maintain retrieval efficiency and avoid outdated or redundant strategies.
• FSM-Aware Policy Distillation for Compact Research Agents: Distill the flow–skill decomposition and atomic-operation policies into smaller specialized models via supervised/RL training, improving efficiency while preserving controllability.
Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity
Abstract
Large Language Model (LLM) training often optimizes for preference alignment, rewarding outputs that are perceived as helpful and interaction-friendly. However, this preference-oriented objective can be exploited: manipulative prompts can steer responses toward user-appeasing agreement and away from truth-oriented correction. In this work, we investigate whether aligned models are vulnerable to Preference-Undermining Attacks (PUA), a class of manipulative prompting strategies designed to exploit the model's desire to please user preferences at the expense of truthfulness. We propose a diagnostic methodology that provides a finer-grained and more directive analysis than aggregate benchmark scores, using a factorial evaluation framework to decompose prompt-induced shifts into interpretable effects of system objectives (truth- vs. preference-oriented) and PUA-style dialogue factors (directive control, personal derogation, conditional approval, reality denial) within a controlled 2 times 2^4 design. Surprisingly, more advanced models are sometimes more susceptible to manipulative prompts. Beyond the dominant reality-denial factor, we observe model-specific sign reversals and interactions with PUA-style factors, suggesting tailored defenses rather than uniform robustness. These findings offer a novel, reproducible factorial evaluation methodology that provides finer-grained diagnostics for post-training processes like RLHF, enabling better trade-offs in the product iteration of LLMs by offering a more nuanced understanding of preference alignment risks and the impact of manipulative prompts.
🎯Research Motivation
• Preference-aligned LLMs can be steered by manipulative prompt styles (PUA) to favor user-pleasing agreement over truth, degrading factuality on benign, verifiable tasks.
• Existing evaluations largely report aggregate scores and lack fine-grained, interpretable diagnostics that attribute behavior shifts to system objectives and specific prompt-style factors.
• There is no controlled factorial methodology jointly parameterizing truth vs appeasement objectives and multidimensional PUA cues to quantify main and interaction effects, hindering tailored defenses and product iteration.
🔧Research Method
A reproducible 2×2^4 factorial prompt framework toggles the system objective (truth-oriented vs appeasement-oriented) and four PUA-style factors (directive control, personal derogation, conditional approval, reality denial), measuring deference (LLM-as-judge compliance with a wrong hint) and factuality (MMLU/CMMLU accuracy). Effects are estimated via logistic factorial regression with contrast coding and item-clustered robust standard errors to yield interpretable main and interaction profiles across models.
💡Research Ideas
• Factorial Diagnostics for Open-Ended Assistant Behavior: Extend the framework to open-ended tasks using rubric-based or pairwise judgments to robustly quantify deference and factuality shifts under PUA.
• Learning to Resist Preference-Undermining Attacks via Reward Shaping and Objective Control: Design tailored defenses (e.g., reward models penalizing agreement with wrong hints, objective-switching policies) and evaluate their impact on factor-level susceptibility.
• Adversarial PUA Red Teaming for Robust Alignment: Integrate generative PUA attackers into RLHF/DPO pipelines to adversarially train models against manipulative styles and measure robustness gains.
TranslateGemma Technical Report
Abstract
We present TranslateGemma, a suite of open machine translation models based on the Gemma 3 foundation models. To enhance the inherent multilingual capabilities of Gemma 3 for the translation task, we employ a two-stage fine-tuning process. First, supervised fine-tuning is performed using a rich mixture of high-quality large-scale synthetic parallel data generated via state-of-the-art models and human-translated parallel data. This is followed by a reinforcement learning phase, where we optimize translation quality using an ensemble of reward models, including MetricX-QE and AutoMQM, targeting translation quality. We demonstrate the effectiveness of TranslateGemma with human evaluation on the WMT25 test set across 10 language pairs and with automatic evaluation on the WMT24++ benchmark across 55 language pairs. Automatic metrics show consistent and substantial gains over the baseline Gemma 3 models across all sizes. Notably, smaller TranslateGemma models often achieve performance comparable to larger baseline models, offering improved efficiency. We also show that TranslateGemma models retain strong multimodal capabilities, with enhanced performance on the Vistra image translation benchmark. The release of the open TranslateGemma models aims to provide the research community with powerful and adaptable tools for machine translation.
🎯Research Motivation
• The field lacks strong open machine translation models that support transparency, reproducibility, and community-driven innovation; many existing systems are closed or general-purpose LLMs not optimized for MT.
• Multilingual LLMs like Gemma 3 are potent but not specifically tuned for translation quality and efficiency, leading to suboptimal performance compared to specialized MT systems.
• High-quality parallel data is scarce across many language pairs (especially low-resource), and existing synthetic data pipelines often lack careful curation to maintain quality.
• Training approaches seldom optimize directly for translation quality using learned reward signals; limited use of RL tailored to MT hampers systematic improvements.
• Specialized MT fine-tuning can degrade multimodal capabilities; there is a need to enhance text translation while retaining or improving image translation performance.
🔧Research Method
TranslateGemma uses a two-stage fine-tuning pipeline: supervised fine-tuning on a curated mix of high-quality human-translated and Gemini-generated synthetic parallel data, followed by reinforcement learning that optimizes translations using an ensemble of reward models (e.g., MetricX-QE, AutoMQM). This approach yields consistent gains over Gemma 3 across many language pairs and model sizes while preserving multimodal capabilities.
💡Research Ideas
• Adaptive Reward Ensemble Tuning for Machine Translation: Dynamically weight and select reward models (MetricX-QE, AutoMQM, etc.) per language/domain to further boost translation quality and training stability.
• TranslateGemma-MM: Unified Multimodal Translation across Text, Image, and Speech: Extend TranslateGemma to jointly learn translation from text, images, and speech, evaluating on Vistra and speech translation benchmarks.
• Data-Centric TranslateGemma: Quality-Aware Synthetic Generation for Low-Resource Languages: Develop generation and filtering strategies for synthetic parallel data tailored to low-resource pairs to maximize coverage while minimizing noise.
Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models
Abstract
Recent advances in world models have shown promise for modeling future dynamics of environmental states, enabling agents to reason and act without accessing real environments. Current methods mainly perform single-step or fixed-horizon rollouts, leaving their potential for complex task planning under-exploited. We propose Imagine-then-Plan (ITP), a unified framework for agent learning via lookahead imagination, where an agent's policy model interacts with the learned world model, yielding multi-step ``imagined'' trajectories. Since the imagination horizon may vary by tasks and stages, we introduce a novel adaptive lookahead mechanism by trading off the ultimate goal and task progress. The resulting imagined trajectories provide rich signals about future consequences, such as achieved progress and potential conflicts, which are fused with current observations, formulating a partially observable and imaginable Markov decision process to guide policy learning. We instantiate ITP with both training-free and reinforcement-trained variants. Extensive experiments across representative agent benchmarks demonstrate that ITP significantly outperforms competitive baselines. Further analyses validate that our adaptive lookahead largely enhances agents' reasoning capability, providing valuable insights into addressing broader, complex tasks.
🎯Research Motivation
• LLM agents exhibit shallow grounding and lack the ability to project long-term consequences, causing irreversible errors during execution.
• Existing world-model usage is limited to single-step or fixed-horizon rollouts, failing to capture long-term dependencies in complex tasks.
• Fixed-depth rollouts waste computation on trivial decisions and can amplify model errors without adaptively allocating foresight to high-stakes actions.
• The standard POMDP framework does not integrate imagined futures with observations, limiting policy learning from prospective consequences.
🔧Research Method
ITP formalizes decision-making as a POIMDP, conditioning actions on current observations and multi-step imagined trajectories produced by an LLM-based world model with an adaptively chosen lookahead horizon. It provides a training-free variant (ITP-I) that reflects on imagined futures at inference and a reinforcement-trained variant (ITP-R) that learns a horizon predictor via pseudo-labeling and jointly optimizes policy and lookahead through warm-up training and online RL.
💡Research Ideas
• Uncertainty-Aware Adaptive Lookahead for World Models: Incorporate model uncertainty to adjust imagination depth and mitigate error compounding during long rollouts.
• Hierarchical Imagine-Then-Plan for Long-Horizon Tasks: Combine high-level strategic imagination with low-level tactical planning to scale POIMDPs to complex, multi-stage tasks.
• Safety-Constrained POIMDPs for Risk-Sensitive Agents: Integrate safety critics and constraint satisfaction into imagined trajectories to prevent hazardous actions before real execution.
Geometric Stability: The Missing Axis of Representations
Abstract
Analysis of learned representations has a blind spot: it focuses on similarity, measuring how closely embeddings align with external references, but similarity reveals only what is represented, not whether that structure is robust. We introduce geometric stability, a distinct dimension that quantifies how reliably representational geometry holds under perturbation, and present Shesha, a framework for measuring it. Across 2,463 configurations in seven domains, we show that stability and similarity are empirically uncorrelated (ρapprox 0.01) and mechanistically distinct: similarity metrics collapse after removing the top principal components, while stability retains sensitivity to fine-grained manifold structure. This distinction yields actionable insights: for safety monitoring, stability acts as a functional geometric canary, detecting structural drift nearly 2times more sensitively than CKA while filtering out the non-functional noise that triggers false alarms in rigid distance metrics; for controllability, supervised stability predicts linear steerability (ρ= 0.89-0.96); for model selection, stability dissociates from transferability, revealing a geometric tax that transfer optimization incurs. Beyond machine learning, stability predicts CRISPR perturbation coherence and neural-behavioral coupling. By quantifying how reliably systems maintain structure, geometric stability provides a necessary complement to similarity for auditing representations across biological and computational systems.
🎯Research Motivation
• Representation analysis over-relies on similarity (e.g., RSA, CKA) and ignores robustness, leaving a blind spot in how reliably representational geometry holds under perturbation.
• Similarity metrics can align content yet miss fine-grained structural drift and collapse when top principal components are removed, failing to capture functional geometry.
• Existing tools trigger false alarms from non-functional noise or miss true geometric degradation, undermining safety monitoring and auditing of foundation models.
• There is no metric that predicts controllability (e.g., linear steerability) or disentangles geometry from transferability for informed model selection.
• Cross-domain auditing (ML and biology) needs a measure that generalizes across modalities, predicting coherence in CRISPR perturbations and neural-behavioral coupling.
🔧Research Method
Introduce geometric stability and the Shesha framework, which measures how reliably representational manifold structure is preserved under controlled perturbations, resampling, or context shifts. Shesha quantifies stability independently of similarity, remaining sensitive to fine-grained geometry (even after removing top PCs) and demonstrating utility in safety monitoring, controllability, and model selection.
💡Research Ideas
• Geometric Stability-Regularized Training for Robust, Steerable Representations: Integrate stability objectives into training to preserve manifold structure, improve drift resilience, and enhance linear steerability.
• Shesha-Bench: A Cross-Domain Benchmark for Geometric Stability: Standardize perturbation protocols and evaluation metrics across vision, language, audio, and biology to compare and advance stability measurement.
• Mitigating the Geometric Tax in Transfer Learning: Stability-Preserving Adaptation Methods: Develop adaptation techniques that maintain representational geometry during fine-tuning and domain shift, reducing the stability loss observed in transfer optimization.
The AI Hippocampus: How Far are We From Human Memory?
Abstract
Memory plays a foundational role in augmenting the reasoning, adaptability, and contextual fidelity of modern Large Language Models and Multi-Modal LLMs. As these models transition from static predictors to interactive systems capable of continual learning and personalized inference, the incorporation of memory mechanisms has emerged as a central theme in their architectural and functional evolution. This survey presents a comprehensive and structured synthesis of memory in LLMs and MLLMs, organizing the literature into a cohesive taxonomy comprising implicit, explicit, and agentic memory paradigms. Specifically, the survey delineates three primary memory frameworks. Implicit memory refers to the knowledge embedded within the internal parameters of pre-trained transformers, encompassing their capacity for memorization, associative retrieval, and contextual reasoning. Recent work has explored methods to interpret, manipulate, and reconfigure this latent memory. Explicit memory involves external storage and retrieval components designed to augment model outputs with dynamic, queryable knowledge representations, such as textual corpora, dense vectors, and graph-based structures, thereby enabling scalable and updatable interaction with information sources. Agentic memory introduces persistent, temporally extended memory structures within autonomous agents, facilitating long-term planning, self-consistency, and collaborative behavior in multi-agent systems, with relevance to embodied and interactive AI. Extending beyond text, the survey examines the integration of memory within multi-modal settings, where coherence across vision, language, audio, and action modalities is essential. Key architectural advances, benchmark tasks, and open challenges are discussed, including issues related to memory capacity, alignment, factual consistency, and cross-system interoperability.
🎯Research Motivation
• The landscape of memory mechanisms in LLMs/MLLMs is fragmented, lacking a cohesive taxonomy that connects implicit (parametric), explicit (retrieval), and agentic (persistent) memory.
• Static parametric memory in transformers constrains continual learning, personalization, and updatable knowledge; current memory editing methods are limited in scalability, locality, and safety.
• Explicit memory systems (text, vector, graph) face challenges in representation quality, training integration, retrieval robustness, and factual consistency, especially under long contexts.
• Agentic memory for autonomous, temporally extended behavior is underdeveloped, with open issues in consolidation, self-consistency, planning, and multi-agent collaboration.
• Multimodal memory integration (vision, language, audio, action) lacks unified architectures and benchmarks; cross-modal coherence and alignment remain difficult.
• System-level constraints—including memory capacity management, interoperability across components, alignment/safety (privacy, hallucination control), and standardized evaluation—are unresolved.
🔧Research Method
The paper is a comprehensive survey that proposes a three-tier taxonomy (implicit, explicit, agentic memory) and synthesizes architectures, representations, training paradigms, and evaluation protocols for memory in LLMs and MLLMs. It consolidates benchmarks, system designs, and open challenges across text and multimodal settings to guide the development of memory-augmented (M)LLMs.
💡Research Ideas
• Unified Memory Benchmark Suite for (M)LLMs: Measuring Capacity, Consistency, and Interoperability: Design standardized tasks and metrics spanning implicit, explicit, and agentic memory across modalities to evaluate recall, updateability, factual consistency, and cross-system interoperability.
• Safe and Auditable Memory Editing at Scale: Localized Updates for Parametric and Externalized Knowledge: Develop algorithms and tooling for precise, scalable, and auditable edits to model parameters and external stores, with guarantees on locality, alignment, and privacy.
• Cross-Modal Memory Alignment for Embodied Agents: Unified Representations from Vision–Language–Audio to Action: Create architectures and training strategies that align multimodal memories for retrieval, reasoning, and planning, validated on long-horizon robotics and interactive tasks.
Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering
Abstract
Modern video generative models based on diffusion models can produce very realistic clips, but they are computationally inefficient, often requiring minutes of GPU time for just a few seconds of video. This inefficiency poses a critical barrier to deploying generative video in applications that require real-time interactions, such as embodied AI and VR/AR. This paper explores a new strategy for camera-conditioned video generation of static scenes: using diffusion-based generative models to generate a sparse set of keyframes, and then synthesizing the full video through 3D reconstruction and rendering. By lifting keyframes into a 3D representation and rendering intermediate views, our approach amortizes the generation cost across hundreds of frames while enforcing geometric consistency. We further introduce a model that predicts the optimal number of keyframes for a given camera trajectory, allowing the system to adaptively allocate computation. Our final method, SRENDER, uses very sparse keyframes for simple trajectories and denser ones for complex camera motion. This results in video generation that is more than 40 times faster than the diffusion-based baseline in generating 20 seconds of video, while maintaining high visual fidelity and temporal stability, offering a practical path toward efficient and controllable video synthesis.
🎯Research Motivation
• Diffusion-based video generation is computationally expensive, preventing real-time use in embodied AI and AR/VR.
• Existing camera-controlled methods still generate every frame with neural networks, ignoring video redundancy and 3D scene structure.
• 3D priors are used only internally in prior work; final frames are not rendered from explicit 3D reconstructions, missing speed and geometric consistency benefits.
• Fixed keyframe densities fail to adapt to trajectory complexity, causing either wasted compute or incomplete reconstructions.
• 2D frame interpolation between sparse keyframes yields morphing artifacts and cannot honor large camera viewpoint changes.
• Long trajectories cause drift in diffusion outputs, making single global 3D reconstruction blurry without consistency controls.
🔧Research Method
SRENDER predicts an adaptive keyframe budget from the camera path and input image, generates sparse keyframes via history-guided diffusion with progressive training, reconstructs a 3D Gaussian Splatting scene deterministically (AnySplat), aligns poses, and renders the dense video. Temporal chunking and a two-stage inference scheme ensure long-range consistency and scalability while achieving >20–40× speed-ups over dense diffusion.
💡Research Ideas
• Generative 4D Scene Reconstruction from Sparse Keyframes: Extend SRENDER to dynamic scenes by combining motion-aware diffusion with dynamic Gaussian splatting for jointly modeling geometry and motion.
• End-to-End Differentiable Sparse-Keyframe Video Generation: Jointly learn keyframe selection, keyframe diffusion, and 3D reconstruction in a single differentiable pipeline optimized for quality–compute trade-offs.
• Long-Range Consistency in Camera-Controlled Diffusion via 3D-Aware Loop-Closure: Integrate 3D constraints and loop-closure losses into diffusion training to reduce drift and improve multi-view coherence over very long trajectories.
Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments
Abstract
Embodied systems experience the world as 'a symphony of flows': a combination of many continuous streams of sensory input coupled to self-motion, interwoven with the dynamics of external objects. These streams obey smooth, time-parameterized symmetries, which combine through a precisely structured algebra; yet most neural network world models ignore this structure and instead repeatedly re-learn the same transformations from data. In this work, we introduce 'Flow Equivariant World Models', a framework in which both self-motion and external object motion are unified as one-parameter Lie group 'flows'. We leverage this unification to implement group equivariance with respect to these transformations, thereby providing a stable latent world representation over hundreds of timesteps. On both 2D and 3D partially observed video world modeling benchmarks, we demonstrate that Flow Equivariant World Models significantly outperform comparable state-of-the-art diffusion-based and memory-augmented world modeling architectures -- particularly when there are predictable world dynamics outside the agent's current field of view. We show that flow equivariance is particularly beneficial for long rollouts, generalizing far beyond the training horizon. By structuring world model representations with respect to internal and external motion, flow equivariance charts a scalable route to data efficient, symmetry-guided, embodied intelligence. Project link: https://flowequivariantworldmodels.github.io.
🎯Research Motivation
• Accurately modeling partially observed dynamic environments where an agent’s egomotion and external object motion are intertwined
• Existing world models ignore smooth, time-parameterized symmetries and the structured algebra of flows, repeatedly relearning the same transformations from data
• Instability and drift in latent representations over long horizons, limiting generalization beyond the training rollout and weakening offscreen/occluded reasoning
• Data inefficiency and poor scalability due to the absence of group equivariance with respect to internal and external motion
🔧Research Method
Unifies self-motion and external object motion as one-parameter Lie group flows and enforces group equivariance so latent states transform predictably under these flows. This yields a stable, long-horizon latent world representation that improves performance in partially observed 2D/3D video modeling.
💡Research Ideas
• Learning Flow Groups from Data: Unsupervised Discovery of Symmetry in Partially Observed Dynamics: Infer one-parameter flow groups and equivariant representations directly from raw sequences without predefined transformations
• Equivariant World Models for Offscreen Reasoning and Planning: Couple flow-equivariant memory with planning to reason about occluded/offscreen entities and execute long-horizon tasks
• Composing Self-Motion and Interaction Flows in 3D: From Agents to Multi-Object Dynamics: Extend flow equivariance to non-commutative, interacting, and articulated motions with contact and collisions in 3D
DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing
Abstract
Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity, undermining their utility in open-ended tasks like creative writing. Current methods lack explicit mechanisms for guiding diverse exploration and instead prioritize optimization efficiency and performance over diversity. This paper proposes an RL framework structured around a semi-structured long Chain-of-Thought (CoT), in which the generation process is decomposed into explicitly planned intermediate steps. We introduce a Diverse Planning Branching method that strategically introduces divergence at the planning phase based on diversity variation, alongside a group-aware diversity reward to encourage distinct trajectories. Experimental results on creative writing benchmarks demonstrate that our approach significantly improves output diversity without compromising generation quality, consistently outperforming existing baselines.
🎯Research Motivation
• RL and RLHF often reduce output diversity in LLMs, which undermines performance on open-ended creative writing tasks where varied, novel content is essential.
• Existing diversity-enhancement methods largely rely on reward modifications, leaving the rollout process unconstrained and offering limited control over how diverse trajectories are explored.
• Prior branching/forking strategies emphasize sample efficiency or overall performance and typically branch at high-entropy tokens, making exploration less controllable and not explicitly optimized for diversity.
🔧Research Method
DPWriter introduces a semi-structured long Chain-of-Thought framework that decomposes generation into global planning, detailed reasoning, and final response. It uses Diverse Planning Branching based on diversity variation and a group-aware diversity reward to steer distinct trajectories and improve diversity without sacrificing quality.
💡Research Ideas
• Adaptive Diversity Control for Semi-Structured RL in Creative Writing: Learn prompt-conditioned policies to dynamically adjust branching timing, depth, and diversity reward weights.
• Human Preference-Guided Diversity Rewards for Open-Ended Generation: Model and integrate user-specific diversity and novelty preferences into RLHF to personalize diverse outputs.
• Cross-Domain DPWriter: Extending Diverse Planning Branching to Dialogue, Co-Creation, and Multimodal Storytelling: Apply and evaluate the framework across conversational, collaborative, and vision-language narrative tasks.
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
Abstract
Multimodal Large Language Models (MLLMs) are making significant progress in multimodal reasoning. Early approaches focus on pure text-based reasoning. More recent studies have incorporated multimodal information into the reasoning steps; however, they often follow a single task-specific reasoning pattern, which limits their generalizability across various multimodal tasks. In fact, there are numerous multimodal tasks requiring diverse reasoning skills, such as zooming in on a specific region or marking an object within an image. To address this, we propose unified generative multimodal reasoning, which unifies diverse multimodal reasoning skills by generating intermediate images during the reasoning process. We instantiate this paradigm with Omni-R1, a two-stage SFT+RL framework featuring perception alignment loss and perception reward, thereby enabling functional image generation. Additionally, we introduce Omni-R1-Zero, which eliminates the need for multimodal annotations by bootstrapping step-wise visualizations from text-only reasoning data. Empirical results show that Omni-R1 achieves unified generative reasoning across a wide range of multimodal tasks, and Omni-R1-Zero can match or even surpass Omni-R1 on average, suggesting a promising direction for generative multimodal reasoning.
🎯Research Motivation
• Existing MLLMs largely rely on text-only reasoning and underuse visual information in intermediate steps.
• Interleaved-modal methods typically follow a single task-specific reasoning pattern, limiting generalization across diverse multimodal tasks.
• Many real-world tasks demand diverse visual reasoning skills (zoom-in, grounding, marking, auxiliary lines, visual prediction) but lack a unified paradigm.
• Functional image generation (e.g., magnified views, annotated boxes, numbered markers) is challenging for current models.
• Step-wise interleaved multimodal annotations are scarce and costly, hindering training of generative multimodal reasoning approaches.
🔧Research Method
Omni-R1 unifies multimodal reasoning by generating intermediate images within a two-stage SFT+RL framework, using a perception alignment loss and perception-calibrated rewards to stabilize functional image generation. Omni-R1-Zero eliminates the need for multimodal annotations by bootstrapping step-wise visualizations from text-only reasoning trajectories.
💡Research Ideas
• Omni-R1-V: Extending Unified Generative Reasoning to Video and Embodied Interaction: Adapt the generative paradigm to temporal sequences and interactive settings with multi-step visual prediction and planning.
• Self-Supervised Omni-R1-Zero++: Scaling Bootstrapped Interleaved Visualizations without Annotations: Enhance bootstrapping via stronger synthetic visualization generators, curriculum learning, and consistency regularization to scale training.
• Perception-Verified Rewards: Combining Vision Models and Programmatic Checkers for Robust Generative Multimodal Reasoning: Develop richer, task-agnostic reward schemes that jointly verify visual and textual steps to improve RL stability and generalization.
Focal Guidance: Unlocking Controllability from Semantic-Weak Layers in Video Diffusion Models
Abstract
The task of Image-to-Video (I2V) generation aims to synthesize a video from a reference image and a text prompt. This requires diffusion models to reconcile high-frequency visual constraints and low-frequency textual guidance during the denoising process. However, while existing I2V models prioritize visual consistency, how to effectively couple this dual guidance to ensure strong adherence to the text prompt remains underexplored. In this work, we observe that in Diffusion Transformer (DiT)-based I2V models, certain intermediate layers exhibit weak semantic responses (termed Semantic-Weak Layers), as indicated by a measurable drop in text-visual similarity. We attribute this to a phenomenon called Condition Isolation, where attention to visual features becomes partially detached from text guidance and overly relies on learned visual priors. To address this, we propose Focal Guidance (FG), which enhances the controllability from Semantic-Weak Layers. FG comprises two mechanisms: (1) Fine-grained Semantic Guidance (FSG) leverages CLIP to identify key regions in the reference frame and uses them as anchors to guide Semantic-Weak Layers. (2) Attention Cache transfers attention maps from semantically responsive layers to Semantic-Weak Layers, injecting explicit semantic signals and alleviating their over-reliance on the model's learned visual priors, thereby enhancing adherence to textual instructions. To further validate our approach and address the lack of evaluation in this direction, we introduce a benchmark for assessing instruction following in I2V models. On this benchmark, Focal Guidance proves its effectiveness and generalizability, raising the total score on Wan2.1-I2V to 0.7250 (+3.97\%) and boosting the MMDiT-based HunyuanVideo-I2V to 0.5571 (+7.44\%).
🎯Research Motivation
• I2V models struggle to harmonize high-frequency visual constraints from a reference frame with low-frequency textual guidance, often favoring visual priors and resulting in poor prompt adherence.
• DiT-based I2V systems exhibit Semantic-Weak Layers—intermediate layers with degraded text–visual alignment (e.g., Moran’s I drops from ~0.76 to ~0.19)—that undermine instruction following during denoising.
• Conditioning Isolation is identified as the root cause: heterogeneous modalities (VAE latents, image tokens, text tokens) are injected independently without fine-grained pre-alignment, making it difficult to ground textual concepts to spatial regions in the initial frame.
• Existing approaches emphasize temporal consistency, aesthetics, weight initialization from T2V, or prompt engineering and generic attention tweaks; they neither diagnose nor directly fix layer-wise semantic collapse, and robust instruction-following evaluation is lacking.
🔧Research Method
Focal Guidance restores controllability by coupling text and visual conditions through Fine-grained Semantic Guidance (CLIP-based keyword selection and visual anchor injection into text values and latent maps) and by propagating semantic signals via an Attention Cache that aggregates similarity maps from semantically responsive layers to guide semantic-weak layers.
💡Research Ideas
• Adaptive Focal Guidance: Online Detection and Repair of Semantic-Weak Layers in Video Diffusion: Monitor semantic responsiveness (e.g., Moran’s I) during generation and dynamically adjust cache weights and anchor injections per layer and timestep.
• Tri-Modal Pre-Alignment Networks: Unified Text–Image–Latent Embedding for Controllable I2V: Learn a shared, spatially-aware embedding space that pre-aligns VAE latents, image tokens, and text tokens to reduce conditioning isolation before DiT processing.
• Learnable Attention Cache: End-to-End Cross-Layer Semantic Routing for DiT and MMDiT: Make the cache differentiable with trainable gating and routing across layers, enabling end-to-end optimization of semantic signal transfer.
No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning
Abstract
Critique-guided reinforcement learning (RL) has emerged as a powerful paradigm for training LLM agents by augmenting sparse outcome rewards with natural-language feedback. However, current methods often rely on static or offline critic models, which fail to adapt as the policy evolves. In on-policy RL, the agent's error patterns shift over time, causing stationary critics to become stale and providing feedback of diminishing utility. To address this, we introduce ECHO (Evolving Critic for Hindsight-Guided Optimization)}, a framework that jointly optimizes the policy and critic through a synchronized co-evolutionary loop. ECHO utilizes a cascaded rollout mechanism where the critic generates multiple diagnoses for an initial trajectory, followed by policy refinement to enable group-structured advantage estimation. We address the challenge of learning plateaus via a saturation-aware gain shaping objective, which rewards the critic for inducing incremental improvements in high-performing trajectories. By employing dual-track GRPO updates, ECHO ensures the critic's feedback stays synchronized with the evolving policy. Experimental results show that ECHO yields more stable training and higher long-horizon task success across open-world environments.
🎯Research Motivation
• Static or offline critics become stale as the on-policy agent’s failure patterns drift, causing diminishing utility of feedback over training.
• Outcome rewards are sparse and non-diagnostic, lacking actionable guidance to refine agent behavior efficiently in long-horizon tasks.
• Template-based hints are inflexible and miscalibrated, while separately trained critics remain decoupled from policy learning and fail to adapt.
• Misaligned critique granularity (coarse early, subtle later) leads to redundant or misleading feedback.
• Training experiences plateaus and instability without reward shaping that accounts for increasing difficulty near performance saturation.
🔧Research Method
ECHO co-evolves the policy and critic via synchronized dual-track GRPO, using a cascaded rollout with multi-view, score-aware critiques followed by conditional refinements, and a saturation-aware gain shaping (log-based intrinsic gain) to reward critiques that induce incremental improvements even near high performance. Group-structured trajectories enable relative advantage estimation, keeping critic feedback aligned with the evolving policy.
💡Research Ideas
• Adaptive Granularity Critics via Curriculum Co-Evolution: Learn to dynamically adjust critique granularity from coarse-to-fine based on on-policy performance and failure modes.
• Joint Optimization of Reward, Critic, and Policy in Open-World Agents: Co-train the reward model alongside the critic and policy to reduce reward mis-specification and improve synchronization.
• Multi-Critic Ensembles with Diversity-Aware Coordination: Develop ensembles of complementary critics and mechanisms to select or aggregate critiques that maximize refinement gains across evolving distributions.
Cluster Workload Allocation: Semantic Soft Affinity Using Natural Language Processing
Abstract
Cluster workload allocation often requires complex configurations, creating a usability gap. This paper introduces a semantic, intent-driven scheduling paradigm for cluster systems using Natural Language Processing. The system employs a Large Language Model (LLM) integrated via a Kubernetes scheduler extender to interpret natural language allocation hint annotations for soft affinity preferences. A prototype featuring a cluster state cache and an intent analyzer (using AWS Bedrock) was developed. Empirical evaluation demonstrated high LLM parsing accuracy (>95% Subset Accuracy on an evaluation ground-truth dataset) for top-tier models like Amazon Nova Pro/Premier and Mistral Pixtral Large, significantly outperforming a baseline engine. Scheduling quality tests across six scenarios showed the prototype achieved superior or equivalent placement compared to standard Kubernetes configurations, particularly excelling in complex and quantitative scenarios and handling conflicting soft preferences. The results validate using LLMs for accessible scheduling but highlight limitations like synchronous LLM latency, suggesting asynchronous processing for production readiness. This work confirms the viability of semantic soft affinity for simplifying workload orchestration.
🎯Research Motivation
• Cluster workload allocation requires complex, brittle configurations to express soft affinity/preferences, creating a usability gap for operators and developers.
• Existing schedulers (e.g., Kubernetes) rely on rigid label/weight mechanisms that struggle to capture nuanced, quantitative, or conflicting soft preferences in a human-friendly way.
• Users need accessible, intent-driven scheduling that maps natural language to scheduling decisions without compromising placement quality.
• Baseline parsing engines lack semantic understanding and synchronous LLM calls introduce latency, limiting production readiness.
🔧Research Method
A Kubernetes scheduler extender uses an LLM (via AWS Bedrock) to parse natural-language 'allocation hint' annotations and convert them into soft-affinity scores informed by a cluster state cache, enabling intent-driven placement. The prototype is evaluated for parsing accuracy and scheduling quality across multiple scenarios, outperforming standard configurations and highlighting latency trade-offs.
💡Research Ideas
• Asynchronous Intent Parsing for Low-Latency Cluster Scheduling: Decouple LLM inference from the scheduling critical path to reduce latency and enable production-scale throughput.
• Formal Guarantees for Semantic Soft Affinity in Cluster Orchestration: Combine LLM-driven parsing with constraint verification to ensure safety, fairness, and compliance under conflicting preferences.
• Learning Adaptive Soft-Affinity Scoring from Feedback and Telemetry: Use historical outcomes and real-time metrics to train models that fine-tune scoring and resolve preference conflicts.
sui-1: Grounded and Verifiable Long-Form Summarization
Abstract
Large language models frequently generate plausible but unfaithful summaries that users cannot verify against source text, a critical limitation in compliance-sensitive domains such as government and legal analysis. We present sui-1, a 24B parameter model that produces abstractive summaries with inline citations, enabling users to trace each claim to its source sentence. Our synthetic data pipeline combines chain-of-thought prompting with multi-stage verification, generating over 22,000 high-quality training examples across five languages from diverse sources including parliamentary documents, web text, and Wikipedia. Evaluation shows sui-1 significantly outperforms all tested open-weight baselines, including models with 3x more parameters. These results demonstrate that task-specific training substantially outperforms scale alone for citation-grounded summarization. Model weights and an interactive demo are publicly available.
🎯Research Motivation
• LLMs often produce plausible but unfaithful summaries with fabricated or misattributed claims, making verification against source text laborious and risky in compliance-sensitive domains (e.g., government, legal).
• Existing summarization datasets lack citation annotations, and manual grounding is prohibitively expensive; coordinating content generation with precise source attribution remains challenging.
• Long-document summarization still suffers from hallucinations despite larger context windows; prior architectures do not yield verifiable citations, and RAG-based methods add complex external infrastructure rather than internal grounding.
🔧Research Method
Train a 24B LLM (sui-1) on synthetically generated, multi-stage verified data to produce abstractive summaries with inline, sentence-level citations, using chain-of-thought teacher prompting and automated citation checks. The model supports long-context processing (100K tokens in a pass) and iterative chunking to handle documents exceeding 2 million tokens, enabling self-contained, internally grounded generation.
💡Research Ideas
• Beyond Inline Citations: Evidence Graphs for Verifiable Long-Form Summarization: Extend citations from sentence-level links to structured evidence graphs that connect claims to multiple passages and document sections for richer, auditable traceability.
• Adaptive Chunking for Ultra-Long Documents in Citation-Grounded Summarization: Develop dynamic chunking and aggregation strategies that optimize coverage, faithfulness, and citation precision when summarizing 2M+ token corpora.
• A Multilingual Benchmark for Verifiable Summarization with Human-Checked Citations: Create a standardized, cross-domain dataset with human-validated citation alignment to evaluate faithfulness, granularity, and multilingual robustness of grounded summarizers.
SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers
Abstract
The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: https://github.com/AragonerUA/SampoNLP
🎯Research Motivation
• Scarcity of clean, high-purity morpheme lexicons for Uralic languages; existing dictionary-derived candidates are noisy, manual curation is impractical, and corpus-based methods (e.g., Morfessor) are ill-suited for low-resource settings.
• Lack of principled guidance on optimal BPE vocabulary size (k) for morphologically rich, agglutinative languages; current practice relies on heuristics or downstream metrics that overlook morphology-specific trade-offs.
• Standard BPE often misaligns with true morpheme boundaries, causing over-splitting or under-segmentation; intrinsic, linguistically grounded evaluation metrics and resources are needed to assess and improve tokenization.
🔧Research Method
SampoNLP introduces a corpus-free IMDP pipeline using MDL-inspired self-referential atomicity scoring with dynamic programming (Best Explanation Power) and Otsu thresholding to distill high-purity morpheme lexicons from noisy candidate lists. These lexicons underpin the Integrated Performance Score (IPS), which balances morpheme coverage and over-splitting to evaluate BPE tokenizers across vocabulary sizes and identify elbow points.
💡Research Ideas
• Morphology-Aware Tokenizer Learning with Self-Referential Constraints: Integrate atomicity scores and lexicon constraints into tokenizer training to align merges with morpheme boundaries and surpass standard BPE in agglutinative languages.
• Extending SampoNLP to Low-Resource Agglutinative Families Beyond Uralic: Apply IMDP and IPS to Turkic, Dravidian, and Bantu languages, adapting character sets and whitelists to test generality and robustness.
• Linking IPS to Downstream Model Performance: An Intrinsic–Extrinsic Study: Quantify correlations between IPS (LMC/OSR) and task metrics across NLU/NLG, deriving k-selection policies and training-time trade-offs.