TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
Abstract
We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Step distillation: TurboDiffusion adopts rCM for efficient step distillation. (3) W8A8 quantization: TurboDiffusion quantizes model parameters and activations to 8 bits to accelerate linear layers and compress the model. In addition, TurboDiffusion incorporates several other engineering optimizations. We conduct experiments on the Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, and Wan2.1-T2V-14B-480P models. Experimental results show that TurboDiffusion achieves 100-200x speedup for video generation even on a single RTX 5090 GPU, while maintaining comparable video quality. The GitHub repository, which includes model checkpoints and easy-to-use code, is available at https://github.com/thu-ml/TurboDiffusion.
🎯Research Motivation
• End-to-end video diffusion inference is prohibitively slow (minutes to hours per 5-second clip on a single GPU), preventing interactive and real-time applications.
• Quadratic full attention dominates compute and memory; naive sparsity or caching typically hurts quality or fails to scale with long sequences/resolutions.
• Reducing sampling steps is necessary for speed, but prior distillation often degrades visual fidelity/temporal consistency or is hard to integrate with other accelerations.
• Large video models at 480P–720P strain memory/throughput; existing quantization often targets weights only and misses activation/Tensor Core efficiency.
• Existing systems (e.g., FastVideo) achieve smaller speedups and show quality regressions; there is a lack of an end-to-end, co-designed pipeline that composes low-bit attention, trainable sparsity, step distillation, and INT8 linear acceleration.
🔧Research Method
TurboDiffusion co-trains Sparse-Linear Attention (SLA) and rCM step distillation, merges their updates, and at inference uses SageSLA (low-bit SageAttention2++ fused with SLA), W8A8 block-wise INT8 quantization for both weights and activations in Linear layers, plus fused CUDA/Triton norms to maximize Tensor Core utilization. This algorithm–system co-design yields 100–200× end-to-end speedups on a single RTX 5090 while maintaining video quality.
💡Research Ideas
• Adaptive SageSLA: Dynamic Quality-Aware Sparse–Dense Switching for Video Diffusion: Learn to adjust Top‑K sparsity per layer/frame conditioned on noise level and scene complexity to optimize speed–quality trade-offs.
• Ultra-Low-Bit Video Diffusion with Error-Compensated W4A8/W4A4: Push quantization beyond W8A8 using per-block/per-channel scaling and calibration to preserve quality at even higher throughput.
• rCM-XL: Content-Adaptive Step Scheduling for Long-Form, High-Resolution Video Generation: Extend step distillation with variable step counts across timesteps and shots to sustain quality on longer clips and 1080P+ outputs.
• Cross-Modal TurboDiffusion: A Unified Acceleration Stack for Audio-Video and 3D Diffusion Models: Generalize SageSLA + rCM + INT8 pipeline to multimodal and 3D tasks with modality-specific sparsity patterns and kernels.
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
Abstract
Vision-language models (VLM) excel at general understanding yet remain weak at dynamic spatial reasoning (DSR), i.e., reasoning about the evolvement of object geometry and relationship in 3D space over time, largely due to the scarcity of scalable 4D-aware training resources. To bridge this gap across aspects of dataset, benchmark and model, we introduce DSR Suite. First, we propose an automated pipeline that generates multiple-choice question-answer pairs from in-the-wild videos for DSR. By leveraging modern vision foundation models, the pipeline extracts rich geometric and motion information, including camera poses, local point clouds, object masks, orientations, and 3D trajectories. These geometric cues enable the construction of DSR-Train for learning and further human-refined DSR-Bench for evaluation. Compared with previous works, our data emphasize (i) in-the-wild video sources, (ii) object- and scene-level 3D requirements, (iii) viewpoint transformations, (iv) multi-object interactions, and (v) fine-grained, procedural answers. Beyond data, we propose a lightweight Geometry Selection Module (GSM) to seamlessly integrate geometric priors into VLMs, which condenses question semantics and extracts question-relevant knowledge from pretrained 4D reconstruction priors into a compact set of geometry tokens. This targeted extraction avoids overwhelming the model with irrelevant knowledge. Experiments show that integrating DSR-Train and GSM into Qwen2.5-VL-7B significantly enhances its dynamic spatial reasoning capability, while maintaining accuracy on general video understanding benchmarks.
🎯Research Motivation
• Vision-language models (VLMs) are weak at dynamic spatial reasoning (DSR)—understanding how object geometry and relationships evolve in 3D over time—due to scarce, scalable 4D-aware training resources.
• Existing data/benchmarks emphasize static scenes or short-horizon motion, narrow domains (e.g., driving/HOI), lack viewpoint transformations and multi-object interactions, and provide coarse, non-procedural answers; training data for DSR is largely missing.
• Naive fusion of large 3D priors into VLMs (e.g., direct cross-attention/addition) introduces noisy, task-specific signals that degrade general video understanding; there is no targeted, question-guided selection of relevant geometry.
• Realistic, in-the-wild videos require egocentric–allocentric transformations and fine-grained temporal reasoning; monocular 3D lacks metric scale, complicating faithful supervision without qualitative, trend-based formulations.
🔧Research Method
DSR Suite couples an automated pipeline that converts in-the-wild videos into 4D-grounded multiple-choice QA (DSR-Train and the human-refined DSR-Bench) using camera poses, point clouds, masks, orientations, and 3D trajectories, with a lightweight Geometry Selection Module (GSM). GSM stacks two Q-Formers to condense the question and retrieve only question-relevant geometry tokens from pretrained 4D priors, then fuses them with vision tokens to boost DSR while preserving general video understanding.
💡Research Ideas
• GSM++: End-to-End Question-Guided 4D Geometry Selection for Video VLMs: Jointly train the selector, video encoder, and 3D backbone to learn adaptive, compact geometry tokens and further reduce noise from in-the-wild priors.
• DSR-Agent: Embodied 4D Spatial Reasoning for Real-World Robotics and AR: Deploy DSR-enhanced VLMs in closed-loop agents for dynamic navigation/manipulation; study action-conditioned, safety-aware reasoning in changing scenes.
• Metric-Scale DSR: Toward Quantitative 4D Reasoning from In-the-Wild Videos: Integrate self-calibration or auxiliary sensors to recover metric scale, enabling numeric distance/speed judgments and longer-horizon prediction and counterfactuals.
DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation
Abstract
The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.
🎯Research Motivation
• Existing approaches stitch independently generated clips or rely only on first/last-frame conditioning, often causing disjointed transitions and weak temporal coherence; precise temporal control over intermediate moments is missing.
• 3D VideoVAE encoders are causal with temporal downsampling, so an intermediate condition latent encodes multiple neighboring frames, breaking exact frame-level correspondence and hindering arbitrary-frame control.
• Super-resolution stages amplify mismatches between conditions and generated content, yielding flicker and cross-frame color shifts, especially under intermediate conditioning.
• Large semantic/visual gaps between conditioning points trigger abrupt cuts and physically implausible subject motion; base models lack mechanisms to prefer smooth transitions and realistic dynamics.
• Long one-shot videos exceed the memory/compute envelope of DiT models, making single-pass generation impractical without specialized, memory-efficient inference.
• Addressing these issues enables controllable, coherent, and cinematic one-shot videos from fragmented visual inputs—valuable for filmmaking, pre-viz, advertising, and creative storytelling.
🔧Research Method
DreaMontage adds lightweight arbitrary frame/video conditioning to a DiT-based video generator via channel-wise latent concatenation, an Interm-Cond Adaptive Tuning scheme to align training with inference, and a Shared-RoPE sequence-wise conditioning in the SR DiT to suppress flicker. It further applies a progressive pipeline—Visual Expression SFT and Tailored DPO (for cut avoidance and realistic motion)—and uses a Segment-wise Auto-Regressive inference strategy to generate long, seamless one-shot videos efficiently.
💡Research Ideas
• Non-Causal VideoVAE for Precise Arbitrary-Frame Conditioning: Replace causal, temporally downsampled encoders with bidirectional or localized VAEs that produce single-frame-consistent latents, eliminating re-encode/resample approximations.
• Planning-Guided Transition Synthesis Between Heterogeneous Conditions: Introduce a high-level planner that predicts camera/object trajectories and transition styles between conditions to improve narrative coherence and controllability.
• Cross-Scale Consistent Super-Resolution for Arbitrary-Conditioned Video Diffusion: Train SR with shared positional priors and cross-scale temporal alignment losses to further reduce flicker and color drift under intermediate conditioning.
• Interactive Streaming One-Shot Generation with Online Condition Insertion: Enable real-time keyframe/clip insertion and prompt edits during generation using causal decoding plus SAR for low-latency, controllable outputs.
• 3D-Aware One-Shot Video Generation with View-Consistent Camera Control: Integrate 3D scene representations (e.g., NeRF/Gaussians) to maintain geometric consistency across large camera motions and enable novel-view one-shots.
• Scalable Preference Alignment for Video via Multimodal RLHF: Generalize Tailored DPO to large-scale reward modeling and RLHF for smoothness, physics realism, and style adherence while minimizing human labeling.
• Script-to-Timeline Compiler for Arbitrary-Frame Conditioning: Use LLM/VLM tools to convert scripts and storyboards into optimal timed conditions and prompts, jointly optimizing conditioning schedules and transitions.
T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
Abstract
Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.
🎯Research Motivation
• Evaluation of text-to-audio-video (T2AV) generation is fragmented and often unimodal, failing to assess cross-modal semantic alignment and temporal synchronization.
• Existing benchmarks have limited coverage of fine-grained audiovisual coupling (e.g., multi-source mixing, off-screen sound, physical causality), struggle with long and compositional prompts, and rely on narrow metric sets without interpretable diagnostics.
• There is no unified, scalable protocol that combines objective signal-level metrics with high-level instruction following and perceptual realism assessment.
• Realistic T2AV requires simultaneous success in video/audio quality, cross-modal alignment, temporal sync, and physically grounded realism; without comprehensive evaluation, progress is hard to measure and compare.
• State-of-the-art models exhibit persistent failures—especially an “Audio Realism Bottleneck,” weak fine-grained synchronization, and imperfect instruction following—highlighting the need for a challenging, diagnostic benchmark.
🔧Research Method
T2AV-Compass proposes a taxonomy-driven benchmark of 500 complex prompts (curated via semantic clustering, LLM rewriting, and real-video inversion) plus a dual-level evaluation suite. It fuses objective metrics for video/audio quality and cross-modal alignment/synchrony with a reasoning-first MLLM-as-a-Judge that scores instruction following and perceptual realism using granular checklists and violation checks.
💡Research Ideas
• Breaking the Audio Realism Bottleneck: Joint Audiovisual Diffusion for Physically Grounded Foley and Synchrony: Develop end-to-end architectures that learn cross-modal physical correlations to improve material–timbre consistency and event-level A–V sync.
• Long-Form T2AV-Compass: Benchmarking and Modeling Minute-Scale Narratives with Stable Cross-Modal Coherence: Extend evaluation and modeling to long-duration (>10s) videos with multi-event audio mixing, narrative continuity, and robust temporal consistency.
• LightJudge: Distilled and Debiased MLLM-as-a-Judge for Scalable T2AV Evaluation: Train compact, interpretable evaluators from reasoning-first judges with human-in-the-loop feedback to reduce cost and bias while preserving diagnostic power.
Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models
Abstract
We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/
🎯Research Motivation
• Lack of a global, open, continuous-label benchmark for building-age estimation; prior datasets are geographically narrow, temporally shallow, lack photos, or are closed-source (Table 1, page 2).
• Accurate dating matters for sustainability audits, heritage preservation, and disaster assessment, yet ages for most of the world’s buildings are unknown (Introduction and Figure 1, page 1).
• Prior methods often cast age prediction as classification, ignoring temporal ordinality; licensing issues hinder reproducibility (pages 1–2).
• SOTA VLMs may memorize famous landmarks rather than reason architecturally; strong popularity bias observed (e.g., Gemini2.0-Flash gains +34.18% IA5 on high-pageview buildings; Table 2, page 6).
• Models show geographic and temporal biases, with large errors in early periods and uneven performance across continents (Table 3, page 7).
🔧Research Method
They introduce YearGuessr (55,546 CC BY-SA Wikipedia-sourced facade images with GPS and pageviews, 1001–2024) and YearCLIP, a CLIP-based ordinal regression model that fuses image and GPS embeddings via a learnable zero-convolution and uses style tokens plus reasoning prompts to output year estimates and human-verifiable rationales (Figure 4, page 5). They also define popularity-aware evaluation (Interval Accuracy and stratified IA5/MAE) to quantify memorization bias.
💡Research Ideas
• Popularity-Debiasing for VLMs on YearGuessr: Counterfactual prompts, reweighting by pageviews, and adversarial training to reduce landmark memorization.
• Renovation-Aware Building Dating: Add temporal-segmentation/renovation labels and a multi-stage ordinal regressor that separates original vs. rebuilt phases.
• Geo-Diverse Expansion of YearGuessr: Active learning and diffusion-based synthesis to boost pre-1600 and underrepresented regions’ coverage.
• Uncertainty-Aware OrdinalCLIP: Probabilistic ordinal losses with calibrated prediction intervals and risk-sensitive evaluation.
• Multi-View and 3D Cues for Architectural Age: Fuse street-level, aerial, and open-vocabulary 3D features for robust dating across periods.
• Explainable Age Prediction with Human-in-the-Loop: Curate and refine reasoning prompts using expert feedback for trustworthy rationales.
• Metadata-as-Language Fusion for Dating: Integrate EXIF, address, and climate priors (e.g., AddressCLIP/EXIF-as-language) to improve fairness and robustness.
• Stress-Testing Memorization in VLMs: A standardized suite to benchmark closed- vs. open-source VLMs under popularity-controlled splits.
Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning
Abstract
We present Nemotron 3 Nano 30B-A3B, a Mixture-of-Experts hybrid Mamba-Transformer language model. Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2, followed by supervised fine tuning and large-scale RL on diverse environments. Nemotron 3 Nano achieves better accuracy than our previous generation Nemotron 2 Nano while activating less than half of the parameters per forward pass. It achieves up to 3.3x higher inference throughput than similarly-sized open models like GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507, while also being more accurate on popular benchmarks. Nemotron 3 Nano demonstrates enhanced agentic, reasoning, and chat abilities and supports context lengths up to 1M tokens. We release both our pretrained Nemotron 3 Nano 30B-A3B Base and post-trained Nemotron 3 Nano 30B-A3B checkpoints on Hugging Face.
🎯Research Motivation
• Improve inference efficiency without sacrificing accuracy by activating only a small fraction of parameters per token (sparse MoE) to outperform similarly sized open models in throughput-heavy regimes.
• Enable reliable ultra-long context (up to 1M tokens) without degrading short-context performance, which prior long-context training often harms.
• Avoid capability regressions from single-environment RL by training across multiple verifiable environments simultaneously with a principled curriculum.
• Replace brittle Bradley–Terry preference models susceptible to reward hacking with a stronger, reasoning-based generative reward model for RLHF.
• Provide controllable reasoning (on/off and token-budget control) to reduce unnecessary chain-of-thought verbosity and inference cost, which most models lack.
• Preserve accuracy under aggressive post-training quantization; naive full-model quantization can significantly degrade performance.
• Expand and decontaminate pretraining data (code, STEM, high-quality web text) to overcome scarcity/duplication and improve downstream reasoning.
🔧Research Method
A 31.6B-parameter Mixture-of-Experts hybrid Mamba–Transformer (GQA) model that activates ~3.2B params per token, pretrained on 25T tokens with a two-phase curriculum and long-context CPT to 1M, then post-trained via unified multi-environment RL from verifiable rewards and RLHF using a large generative reward model with group-relative length control; finally, selective FP8 PTQ keeps attention/preceding Mamba layers in BF16 to retain accuracy while boosting throughput.
💡Research Ideas
• Off-Policy Multi-Environment RL for Generalist LLMs: Scale beyond synchronous GRPO by adding replay, masked importance corrections, and adaptive curricula to further improve stability, sample-efficiency, and cross-domain retention.
• Learning to Budget: Dynamic Reasoning-Length Control via Reward Shaping: Generalize group-relative length control to instance-adaptive, tool-aware budgets that jointly optimize answer quality, latency, and cost across tasks.
• Beyond One Million Tokens: Mixture-of-Length CPT with Retrieval for Extreme Contexts: Combine mixture-of-length continuous pretraining, retrieval augmentation, and stability regularizers to push reliable context windows to >1M without short-context regressions.
HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming
Abstract
High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To address this, we introduce HiStream, an efficient autoregressive framework that systematically reduces redundancy across three axes: i) Spatial Compression: denoising at low resolution before refining at high resolution with cached features; ii) Temporal Compression: a chunk-by-chunk strategy with a fixed-size anchor cache, ensuring stable inference speed; and iii) Timestep Compression: applying fewer denoising steps to subsequent, cache-conditioned chunks. On 1080p benchmarks, our primary HiStream model (i+ii) achieves state-of-the-art visual quality while demonstrating up to 76.2x faster denoising compared to the Wan2.1 baseline and negligible quality loss. Our faster variant, HiStream+, applies all three optimizations (i+ii+iii), achieving a 107.5x acceleration over the baseline, offering a compelling trade-off between speed and quality, thereby making high-resolution video generation both practical and scalable.
🎯Research Motivation
• High-resolution video diffusion has quadratic spatio-temporal complexity, making 1080p and long-duration generation impractical in latency and memory.
• Sampling acceleration (e.g., timestep distillation) reduces the number of steps but per-step cost at high resolution remains prohibitively high.
• Sparse/sliding attention reduces quadratic cost, yet KV caches and memory still grow with video length, causing unstable speed and memory blow-up.
• There is inherent redundancy across axes: early denoising steps only set coarse structure (spatial redundancy), far-past frames contribute little (temporal redundancy), and later chunks need fewer steps (timestep redundancy).
• Maintaining temporal coherence without attending to the full history is challenging; current autoregressive designs suffer from growing context and drift.
• Two-stage pipelines with external super-resolution are cheap but often lose fine details; DiT-based models also face high-res positional encoding misalignment, yielding blur.
🔧Research Method
HiStream eliminates spatial, temporal, and timestep redundancy via Dual-Resolution Caching (early steps at low resolution followed by high-resolution refinement with aligned dual KV caches) and an Anchor-Guided Sliding Window (a persistent first-frame anchor plus a small neighbor cache for fixed-size context). Optionally, Asymmetric Denoising uses fewer steps for subsequent chunks, achieving up to 76.2×–107.5× speedups at 1080p with minimal quality loss.
💡Research Ideas
• Adaptive HiStream: Content-Aware Scheduling for Streaming Video Diffusion: Dynamically allocate resolution, chunk size, and denoising steps based on motion/texture complexity to further reduce compute while preserving fidelity.
• Multi-Anchor Streaming Attention for Long-Form and Scene-Change Robustness: Learn anchor selection/refresh and use multi-anchor caches or anchor compression to handle scene transitions and prevent drift without growing memory.
• HiStream-4K: Hierarchical Dual-Resolution Caching for Ultra-High-Resolution, Long-Duration Video: Extend to multi-scale pyramids with quantized/low-rank KV caches and joint distillation to scale to 4K+ resolution and hour-long videos with stable latency.
NVIDIA Nemotron 3: Efficient and Open Intelligence
Abstract
We introduce the Nemotron 3 family of models - Nano, Super, and Ultra. These models deliver strong agentic, reasoning, and conversational capabilities. The Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens. Super and Ultra models are trained with NVFP4 and incorporate LatentMoE, a novel approach that improves model quality. The two larger models also include MTP layers for faster text generation. All Nemotron 3 models are post-trained using multi-environment reinforcement learning enabling reasoning, multi-step tool use, and support granular reasoning budget control. Nano, the smallest model, outperforms comparable models in accuracy while remaining extremely cost-efficient for inference. Super is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Ultra, the largest model, provides state-of-the-art accuracy and reasoning performance. Nano is released together with its technical report and this white paper, while Super and Ultra will follow in the coming months. We will openly release the model weights, pre- and post-training software, recipes, and all data for which we hold redistribution rights.
🎯Research Motivation
• High inference latency and poor throughput in Transformer-based MoE for reasoning workloads due to attention KV-cache growth and all-to-all expert routing bottlenecks (see Fig. 2, p.2).
• Limited long-context capability; RoPE-based Transformers degrade or fail beyond training length, hindering 512k–1M token tasks like code repositories and RAG (Table 3, p.8).
• Existing MoE layers are memory-bandwidth–bound at low batch and communication–bound at high throughput, reducing accuracy per byte under fixed latency/compute (§2.2).
• Training cost/efficiency constraints for trillion-token pretraining; current FP8/BF16 regimes underutilize new hardware throughput (NVFP4 offers up to 3× FP8 peak on GB300; §2.4).
• Post-training fragility: staged RL can cause reward hacking and capability regressions across tasks (§2.6), and models lack controllable reasoning budgets at inference (§2.7).
• Fragmented or closed releases (weights/recipes/data) limit reproducibility and adoption; the paper aims to provide open, end-to-end assets.
🔧Research Method
Nemotron 3 uses a hybrid Mamba–Transformer Mixture-of-Experts with LatentMoE (expert routing/compute in a lower-dimensional latent space) and minimal attention to cut communication and KV-cache costs while improving accuracy per byte. It trains with NVFP4 and Multi-Token Prediction, extends context to 1M tokens without RoPE, and applies multi-environment RL plus inference-time reasoning-budget control.
💡Research Ideas
• Adaptive LatentMoE: Learning Per-Layer Latent Dimensions and Top-K for Hardware-Aware Expert Routing: Jointly optimize latent size and active experts per token/layer to maximize accuracy-throughput under bandwidth constraints.
• Auto-Hybridization of Mamba, Attention, and MoE for 1M-Token Contexts: Neural architecture search that places attention vs. Mamba vs. MoE under target latency, memory, and long-context retrieval objectives.
• Policy-Driven Reasoning Budgets for Agentic LLMs: Reinforcement Learning of Dynamic Think-Token Allocation: Train controllers to allocate thinking tokens online to meet accuracy/latency SLAs across tasks and users.
TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior
Abstract
Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we train fourteen models that use different tokenizers but are otherwise identical using the same architecture, dataset, training budget, and initialization. Additionally, we curate and release a new benchmark that specifically measures model performance subject to real-world perturbations that are likely to influence tokenization. Together, TokSuite allows robust decoupling of the influence of a model's tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.
🎯Research Motivation
• It is hard to isolate the impact of tokenization because existing models confound tokenizer effects with differences in architecture, training data, and budgets—there is no open suite of models that are identical except for the tokenizer.
• Current benchmarks largely evaluate on clean text and underrepresent real-world, tokenizer-sensitive perturbations (e.g., Unicode styling, OCR noise, romanization, diacritics, keyboard constraints, morphology, LaTeX/STEM formatting); see Figure 1 on page 2 for illustrative perturbation examples.
• Multilingual settings expose tokenization inefficiency and unfairness (high subword fertility, poor parity, high continued-word rates), leading to cost and performance gaps across languages (Appendix C, Table 7).
• Preprocessing choices (normalization, whitespace handling, contraction/number rules) and OOV strategies vary widely across tokenizers, but their downstream effects on robustness are poorly understood.
• Universal vulnerabilities persist—particularly Unicode styling and structural formatting—while simply scaling model size or training longer provides only modest robustness gains (Tables 1, 18, 21, 22).
🔧Research Method
TokSuite trains 14 language models that are identical in architecture, data, training budget, and initialization but differ only in their tokenizer; a super-vocabulary creates bijective mappings so shared tokens receive identical embedding initializations. It introduces a multilingual robustness benchmark (~5k examples across EN/TR/IT/FA/ZH) with real-world perturbation families, and evaluates models via byte-length–normalized log-likelihood and relative accuracy drop, complemented by intrinsic tokenizer efficiency metrics (fertility, parity, PCW).
💡Research Ideas
• Boundless Tokenization for Robust Multilingual LMs: Allow cross pre-tokenization merges and morphology-aware segmentation to reduce fragmentation under noise and agglutinative morphology; evaluate on TokSuite.
• Unicode-Native, Reversible Normalization Pipelines for LLMs: Design lossless normalization and tokenizer vocabularies that preserve styled/compatibility characters while eliminating performance cliffs from Unicode styling and homoglyphs.
• Decoupling Input and Output Vocabularies in Technical Domains: Use byte-level or STEM-specialized input tokenization with compact output vocabularies to improve robustness on LaTeX, diagrams, and structured ASCII without sacrificing efficiency.
• Robustness Scaling Laws with Controlled Tokenization: Systematically scale parameters and data with fixed tokenizers (and vice versa) to quantify how robustness curves change, isolating effects beyond what Table 22 suggests.
• Inference-Time Tokenization Repair via Exact Byte-Level Probabilities: Integrate exact byte-level probability conversion and token healing into decoding to mitigate perturbation-induced segmentation errors without retraining; benchmark on TokSuite.
• Vocabulary Optimization under Fixed Token Budgets: Jointly optimize vocabulary size, composition, and normalization to minimize fertility/parity gaps across languages while preserving downstream accuracy.
• Adversarial Tokenization and Defenses in Safety Pipelines: Build non-canonical segmentation attacks (e.g., styled Unicode, zero-width, diacritics) and develop training-time and inference-time defenses for alignment-critical applications.
Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations
Abstract
Recent advances in pretraining general foundation models have significantly improved performance across diverse downstream tasks. While autoregressive (AR) generative models like GPT have revolutionized NLP, most visual generative pretraining methods still rely on BERT-style masked modeling, which often disregards the temporal information essential for video analysis. The few existing autoregressive visual pretraining methods suffer from issues such as inaccurate semantic localization and poor generation quality, leading to poor semantics. In this work, we propose NExT-Vid, a novel autoregressive visual generative pretraining framework that utilizes masked next-frame prediction to jointly model images and videos. NExT-Vid introduces a context-isolated autoregressive predictor to decouple semantic representation from target decoding, and a conditioned flow-matching decoder to enhance generation quality and diversity. Through context-isolated flow-matching pretraining, our approach achieves strong representations. Extensive experiments on large-scale pretrained models demonstrate that our proposed method consistently outperforms previous generative pretraining methods for visual representation learning via attentive probing in downstream classification.
🎯Research Motivation
• Visual generative pretraining is dominated by BERT-style masked modeling that underutilizes temporal dependencies critical for video understanding (pages 1–2; Fig. 1a).
• Existing visual autoregressive (AR) pretraining often embeds semantics deep in intermediate layers and needs layer-wise probing; semantic localization is inaccurate (page 2).
• Direct regression objectives for generating patches/frames struggle with diversity, leading to blurry or averaged outputs and weak semantics (page 2).
• Video AR models can trivially copy previous frames due to temporal redundancy; masked next-frame prediction is needed to make the task non-trivial (page 3, Sec. 3.1).
• Representation and decoding are entangled in end-to-end AR models, causing encoder features to be altered by decoding dynamics; decoupling is needed for stable, strong semantics (pages 3–4; Fig. 2).
• Typical conditioning injection (e.g., AdaLN or sequence concat used in text-to-image) is mismatched for dense, spatially structured conditions; spatially aligned conditioning is required (page 2).
• Preventing information leakage across frames during generation is crucial; custom attention masks are needed (page 4; Fig. 3).
🔧Research Method
NExT-Vid pretrains via masked next-frame prediction with a context-isolated autoregressive predictor that forecasts next-frame latent features from ViT encoder outputs, and a conditioned flow-matching decoder that generates VAE latents using spatially aligned concatenation. An EMA-updated reference encoder provides alignment regularization, while custom attention masks (frame-wise causal, autoregressive, frame-isolated) prevent leakage and decouple representation from decoding.
💡Research Ideas
• Mask-Free Autoregressive Video Pretraining at GPT Efficiency: Design objectives/architectures that avoid trivial copying without masking to recover GPT-like training efficiency for video.
• Joint Generation-and-Representation Optimization for Video Foundation Models: Multi-objective or curriculum strategies to simultaneously reach high-fidelity generation and strong downstream representations, resolving the noted trade-off (page 15, Limitations).
• Long-Context Autoregressive Pretraining for Minute-Scale Video Understanding: Hierarchical predictors and memory-efficient attention to extend frame-isolated decoding to long videos and complex temporal reasoning.
• Unified Multimodal Next-Frame Prediction with Audio/Text Conditioning: Incorporate audio and language conditions into the flow-matching decoder to enrich temporal semantics and cross-modal grounding.
• Video-Native Tokenizers for Flow-Matching Targets: Develop improved video VAEs or tokenizers (beyond image VAEs and current video VAEs) that yield stable, precise latents and better pretraining signals.
From Word to World: Can Large Language Models be Implicit Text-based World Models?
Abstract
Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale. World models offer a potential way to improve learning efficiency through simulated experience, but it remains unclear whether large language models can reliably serve this role and under what conditions they meaningfully benefit agents. We study these questions in text-based environments, which provide a controlled setting to reinterpret language modeling as next-state prediction under interaction. We introduce a three-level framework for evaluating LLM-based world models: (i) fidelity and consistency, (ii) scalability and robustness, and (iii) agent utility. Across five representative environments, we find that sufficiently trained world models maintain coherent latent state, scale predictably with data and model size, and improve agent performance via action verification, synthetic trajectory generation, and warm-starting reinforcement learning. Meanwhile, these gains depend critically on behavioral coverage and environment complexity, delineating clear boundry on when world modeling effectively supports agent learning.
🎯Research Motivation
• Agentic RL faces an experience bottleneck: real environments are non-adaptive, limited in coverage, costly, and hard to scale, hindering experience-driven scaling.
• It is unclear whether and when LLMs can reliably serve as world models that maintain coherent latent state and improve agents’ learning and safety.
• Prior LLM-as-simulator approaches often rely on domain-specific, structured outputs or zero/few-shot prompting, yielding limited accuracy and poor transfer to open-ended settings.
• Existing evaluations focus on single-step prediction; they rarely test long-horizon rollout consistency, distribution shift robustness, or practical agent utility.
• There is no unified framework connecting next-token prediction to next-state prediction for multi-turn interaction, nor clear data/model scaling laws for text-based world models.
🔧Research Method
Reformulate world modeling as multi-turn next-state prediction in text and train LLMs via supervised fine-tuning on large interaction trajectories across five environments, evaluating short-term fidelity (EM/F1) and long-horizon rollout fidelity using a WM→Real consistency protocol (Real/WM/W2R/CR). Use the trained world models for pre-execution action verification, synthetic trajectory generation, and early-experience warm-start RL, while analyzing scalability (data/model size), OOD generalization, cross-environment joint training, and mixed-agent behavioral coverage.
💡Research Ideas
• Grounded Text World Models: Reducing Simulation Drift with Partial Real Observations: Incorporate real-environment anchors (e.g., initial search results) during rollouts to stabilize open-ended domains like WebShop.
• Generalist World Models via Cross-Environment Pretraining: Jointly pretrain on diverse text environments to discover transferable dynamics and serve multiple tasks with a single model.
• Behaviorally Diverse Training for Robust WM-to-Real Transfer: Curate mixed-agent trajectory corpora to broaden behavioral coverage and improve consistency under policy shift.
• Uncertainty-Aware Action Verification for Safe Agents: Calibrate world models (ensembles/MC-dropout) to gate irreversible actions with confidence-aware pass/fail decisions.
• Retrieval- and Tool-Augmented World Models for Open-Ended Tasks: Fuse retrieval or structured tool signals to handle long-tail, compositional dynamics beyond fixed schemas.
• From Text to Multimodal World Models: Extend next-state prediction from text to vision and embodied domains for richer, grounded dynamics.
• Co-Training Agents and World Models with Shared Objectives: Jointly optimize planning agents and simulators to reduce simulator-agent mismatch and improve sample efficiency.
• Benchmarks and Metrics for Long-Horizon Consistency: Develop standardized datasets and measures beyond EM (e.g., executable rollouts, CR variants, repairability) for robust evaluation.
Streaming Video Instruction Tuning
Abstract
We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.
🎯Research Motivation
• Offline video LLMs process complete, bounded clips and cannot handle continuous, unbounded video streams or decide when to respond under real-time latency constraints.
• Existing streaming approaches split perception/decision with external controllers, creating accuracy–efficiency trade-offs, higher latency, and weak coupling between perception and response.
• Prior streaming methods often focus narrowly on real-time narration using special tokens, failing to balance silence/standby/response states or generalize to diverse tasks (grounding, event/action captioning, time-sensitive QA).
• Heterogeneous, temporally inconsistent datasets hinder precise temporal alignment and unified instruction-following across tasks.
• Severe class imbalance in streaming supervision (Silence dominates) causes models to over-predict silence and miss correct response timing.
• Current streaming benchmarks are predominantly QA-style and do not adequately test broader instruction-following across mixed task types.
🔧Research Method
Streamo is an end-to-end streaming video LLM that predicts response states as special tokens (
💡Research Ideas
• Infinite-Horizon Streamo: KV-Cache Management and Token Pruning for Real-Time Video LLMs: Integrate sliding-window attention, cache eviction, and visual token pruning to scale to unbounded streams with low latency and memory.
• Learning When to Speak: Reinforcement Learning for Response Timing and Proactive Streaming Interaction: Optimize silence/standby/response decisions and proactive alerts with task- and latency-aware rewards beyond supervised state labeling.
• Label-Efficient Temporal Supervision: Self-/Weakly-Supervised Streaming Instruction Tuning for Event Boundaries: Automatically mine temporal boundaries and response states from large unlabeled streams to expand multi-task streaming datasets.
Multi-hop Reasoning via Early Knowledge Alignment
Abstract
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for Large Language Models (LLMs) to address knowledge-intensive queries requiring domain-specific or up-to-date information. To handle complex multi-hop questions that are challenging for single-step retrieval, iterative RAG approaches incorporating reinforcement learning have been proposed. However, existing iterative RAG systems typically plan to decompose questions without leveraging information about the available retrieval corpus, leading to inefficient retrieval and reasoning chains that cascade into suboptimal performance. In this paper, we introduce Early Knowledge Alignment (EKA), a simple but effective module that aligns LLMs with retrieval set before planning in iterative RAG systems with contextually relevant retrieved knowledge. Extensive experiments on six standard RAG datasets demonstrate that by establishing a stronger reasoning foundation, EKA significantly improves retrieval precision, reduces cascading errors, and enhances both performance and efficiency. Our analysis from an entropy perspective demonstrate that incorporating early knowledge reduces unnecessary exploration during the reasoning process, enabling the model to focus more effectively on relevant information subsets. Moreover, EKA proves effective as a versatile, training-free inference strategy that scales seamlessly to large models. Generalization tests across diverse datasets and retrieval corpora confirm the robustness of our approach. Overall, EKA advances the state-of-the-art in iterative RAG systems while illuminating the critical interplay between structured reasoning and efficient exploration in reinforcement learning-augmented frameworks. The code is released at https://github.com/yxzwang/EarlyKnowledgeAlignment{Github}.
🎯Research Motivation
• Single-step RAG struggles with multi-hop questions because relevant evidence is rarely retrievable in one shot, leading to frequent failures on knowledge-intensive queries.
• Iterative RAG often plans first without awareness of what the retriever can actually fetch, causing plan failure in the initial think step and cascading retrieval errors.
• RL-based iterative RAG wastes budget on high-entropy, unfocused exploration when initial reasoning is ungrounded, degrading both efficiency and final answer quality.
• Many RL pipelines rely heavily on innate LLM reasoning or SFT data quality; misalignment with the retrieval corpus yields redundant or suboptimal reasoning paths.
• Lack of early contextual grounding amplifies compounding errors during multi-hop reasoning, hurting both retrieval precision and downstream accuracy and efficiency.
🔧Research Method
Early Knowledge Alignment (EKA) injects a small set of top-k retrieved passages before the first planning step, grounding the initial think so the agent iteratively Think-Search-Answer with lower-entropy, retrieval-aware trajectories. It is training-free or RL-compatible (GRPO/PPO), retriever-agnostic, and empirically improves retrieval precision, reduces turns, and raises end-to-end accuracy.
💡Research Ideas
• EKA for Deep Research Agents in Open Web Environments: Extend early knowledge alignment to long-horizon, browsing-based research with budgeted, multi-session planning and dynamic web sources.
• Entropy-Guided RL Objectives for Iterative RAG: Design RL losses that explicitly regulate exploration entropy conditioned on retrieved context to focus search and minimize cascading plan errors.
• Co-Training Retrievers and Policies with Early Knowledge Signals: Jointly optimize retriever and generator using EKA-derived signals to align retrieval distributions with policy needs and maximize information gain per token.
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
Abstract
Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or implementing a small feature. However, real-world software engineering is fundamentally a long-horizon endeavor: developers must interpret high-level requirements, plan coordinated changes across many files, and evolve codebases over multiple iterations while preserving existing functionality. We introduce SWE-EVO, a benchmark that evaluates agents on this long-horizon software evolution challenge. Constructed from release notes and version histories of seven mature open-source Python projects, Tool comprises 48 evolution tasks that require agents to implement multi-step modifications spanning an average of 21 files, validated against comprehensive test suites averaging 874 tests per instance. Experiments with state-of-the-art models reveal a striking capability gap: even GPT-5 with OpenHands achieves only a 21 percent resolution rate on Tool, compared to 65 percent on the single-issue SWE-Bench Verified. This demonstrates that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a fine-grained metric that captures partial progress toward solving these complex, long-horizon tasks.
🎯Research Motivation
• Existing benchmarks emphasize isolated, single-issue fixes and underrepresent long-horizon software evolution that spans many files, commits, and versions (Figure 2 p.3; Section 1).
• Real-world SE requires interpreting high-level requirements (SRS), planning multi-step modifications, coordinating cross-module changes, and preserving functionality under heavy regression testing; maintenance dominates industry work and AI adoption is high (Section 1).
• SWE-Bench shows signs of saturation and may inflate performance via incomplete fixes, limited test coverage, or data contamination; its binary scoring obscures partial progress (Sections 1, 3.2).
• Current agents struggle with sustained, multi-file reasoning and instruction following; strong models drop from ~65% on SWE-Bench Verified to ~21% on SWE-EVO (Tables 2–3, pp.10–11).
• There is a need for a benchmark with longer specifications and broader changes: SWE-EVO tasks average 2390-word specs, 21 files edited, 610 lines changed, and 874 tests per instance across 48 tasks and 7 repos (Table 1 p.6; Figure 4 p.9), with difficulty correlated to multiple PRs per instance (Figure 5 p.10; Figure 7 p.14).
🔧Research Method
Introduce SWE-EVO, a benchmark built from release notes and versioned snapshots of 7 Python projects where agents must evolve a codebase from one release tag to the next, validated by large FAIL_TO_PASS and PASS_TO_PASS test suites while remaining SWE-Bench-compatible. Propose Fix Rate, a soft metric that credits the fraction of failing tests fixed without regressions, complementing Resolved Rate and Patch Apply Rate (Section 3).
💡Research Ideas
• Soft-Score RL for Regression-Safe Code Evolution: Train agents to optimize Fix Rate with strict no-regression constraints using verifier feedback and trajectory signals on SWE-EVO.
• SRS-Grounded Planning for Multi-PR Codebase Upgrades: Parse release notes into structured intermediate plans that map requirements to code changes across files and commits to improve instruction following.
• Difficulty-Aware Multi-Agent Orchestration for Long-Horizon SE: Adapt agent roles and search depth based on difficulty estimators (e.g., PR count) to better coordinate navigation, patching, and verification at scale.
PhononBench:A Large-Scale Phonon-Based Benchmark for Dynamical Stability in Crystal Generation
Abstract
In this work, we introduce PhononBench, the first large-scale benchmark for dynamical stability in AI-generated crystals. Leveraging the recently developed MatterSim interatomic potential, which achieves DFT-level accuracy in phonon predictions across more than 10,000 materials, PhononBench enables efficient large-scale phonon calculations and dynamical-stability analysis for 108,843 crystal structures generated by six leading crystal generation models. PhononBench reveals a widespread limitation of current generative models in ensuring dynamical stability: the average dynamical-stability rate across all generated structures is only 25.83%, with the top-performing model, MatterGen, reaching just 41.0%. Further case studies show that in property-targeted generation-illustrated here by band-gap conditioning with MatterGen--the dynamical-stability rate remains as low as 23.5% even at the optimal band-gap condition of 0.5 eV. In space-group-controlled generation, higher-symmetry crystals exhibit better stability (e.g., cubic systems achieve rates up to 49.2%), yet the average stability across all controlled generations is still only 34.4%. An important additional outcome of this study is the identification of 28,119 crystal structures that are phonon-stable across the entire Brillouin zone, providing a substantial pool of reliable candidates for future materials exploration. By establishing the first large-scale dynamical-stability benchmark, this work systematically highlights the current limitations of crystal generation models and offers essential evaluation criteria and guidance for their future development toward the design and discovery of physically viable materials. All model-generated crystal structures, phonon calculation results, and the high-throughput evaluation workflows developed in PhononBench will be openly released at https://github.com/xqh19970407/PhononBench
🎯Research Motivation
• Existing evaluations of generative crystal models emphasize thermodynamic stability (e.g., Ehull) but neglect dynamical stability, which determines whether a structure can physically exist (imaginary phonon modes indicate instability)
• First-principles phonon calculations (DFPT/supercell) are too costly for large-scale assessment, so models are rarely tested for phonon stability at scale
• Current models frequently generate structures that appear stable thermodynamically but are dynamically unstable, undermining reliability and synthesizability
• Lack of a standardized, scalable benchmark and metric to compare models on dynamical stability across architectures, datasets, and conditioning settings
• Limited understanding of how symmetry constraints and property conditioning (e.g., band gap) affect dynamical stability and novelty
🔧Research Method
PhononBench conducts large-scale, standardized phonon-based dynamical stability evaluation of AI-generated crystals by coupling the MatterSim uMLIP (DFT-level phonon accuracy) with Phonopy to screen 108,843 relaxed structures from six generative models, using absence of imaginary modes as the stability criterion. It reports a unified dynamical-stability rate metric, analyzes property- and symmetry-conditioned generation, and releases 28,119 phonon-stable structures and all workflows for reproducibility.
💡Research Ideas
• Phonon-Guided Diffusion Models for Crystal Generation: Integrate differentiable phonon predictors or guidance into the generative loop to directly enforce dynamical stability during sampling
• Soft-Mode-Aware Auto-Relaxation for AI-Generated Crystals: Follow unstable phonon modes to symmetry-broken minima to systematically convert dynamically unstable outputs into stable polymorphs
• Finite-Temperature PhononBench: Benchmarking Dynamical Stability under Temperature and Pressure with MLIPs: Extend the benchmark to quasi-harmonic/anharmonic regimes to assess stability landscapes beyond 0 K and ambient conditions
LLM Swiss Round: Aggregating Multi-Benchmark Performance via Competitive Swiss-System Dynamics
Abstract
The rapid proliferation of Large Language Models (LLMs) and diverse specialized benchmarks necessitates a shift from fragmented, task-specific metrics to a holistic, competitive ranking system that effectively aggregates performance across multiple ability dimensions. Primarily using static scoring, current evaluation methods are fundamentally limited. They struggle to determine the proper mix ratio across diverse benchmarks, and critically, they fail to capture a model's dynamic competitive fitness or its vulnerability when confronted with sequential, high-stakes tasks. To address this, we introduce the novel Competitive Swiss-System Dynamics (CSD) framework. CSD simulates a multi-round, sequential contest where models are dynamically paired across a curated sequence of benchmarks based on their accumulated win-loss record. And Monte Carlo Simulation (N=100,000 iterations) is used to approximate the statistically robust Expected Win Score (E[S_m]), which eliminates the noise of random pairing and early-round luck. Furthermore, we implement a Failure Sensitivity Analysis by parameterizing the per-round elimination quantity (T_k), which allows us to profile models based on their risk appetite--distinguishing between robust generalists and aggressive specialists. We demonstrate that CSD provides a more nuanced and context-aware ranking than traditional aggregate scoring and static pairwise models, representing a vital step towards risk-informed, next-generation LLM evaluation.
🎯Research Motivation
• Fragmented multi-benchmark landscape makes model selection hard; practitioners need a single, holistic ranking for deployment while manual inspection is infeasible (pp. 1–2).
• Arbitrary weighting in static aggregation: rankings depend on heuristic task weights across diverse benchmarks, lacking an objective mix ratio (pp. 2–3).
• Static averages hide path dependence; early foundational failures block downstream capabilities (illustrated in Figure 2), so "excel later" cannot compensate for "fail early".
• Static pairwise models (Elo/Bradley–Terry) ignore sequential dynamics and failure risk; they do not model elimination pressure or path-dependent pairing (Sec. 3).
• Tournament/pairing luck introduces high variance; need a statistically robust estimate that removes random pairing effects (Abstract; Sec. 2.4).
• Lack of risk profiling: current methods cannot distinguish robust generalists from aggressive specialists under varying failure penalties (Abstract; Sec. 2.5).
🔧Research Method
Competitive Swiss-System Dynamics (CSD): simulate a multi-round Swiss tournament over a sequenced set of benchmarks using a precomputed pairwise win‑rate tensor, with zero-point byes and per‑round elimination to enforce path dependence and failure penalties. Use Monte Carlo (many runs) to estimate each model’s Expected Win Score and vary the elimination parameter to conduct Failure Sensitivity Analysis that profiles robustness versus fragility.
💡Research Ideas
• Agentic CSD: Predicting Real-World Agent Performance via Sequential Benchmark Curricula: Map multi-step agent workflows to ordered benchmark sequences and test how CSD rankings predict agent task success while controlling for contamination.
• Risk‑Calibrated CSD: Learning Elimination Schedules and Benchmark Orders from Deployment Logs: Fit elimination intensity (Tk) and sequencing/weights from real failure data to align Expected Win Scores with observed operational reliability.
• Uncertainty‑Aware CSD: Bayesian Estimation of Win‑Rate Tensors and Confidence Intervals for E[Sm]: Model uncertainty in pairwise win rates, propagate it through the Monte Carlo, and report calibrated intervals and variance‑reduced estimators.