Latent Implicit Visual Reasoning
Abstract
While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.
šÆResearch Motivation
⢠LMMs are overly text-centric, projecting images once and then reasoning only in language space, which limits expressive, spatial visual reasoning on perception-heavy tasks.
⢠Textual chain-of-thought intermediates struggle to capture rich, spatially structured abstractions that are hard to verbalize.
⢠Prior interleaved approaches depend on explicit visual supervision (e.g., boxes, crops, depth/helper images), imposing human biases, incurring heavy annotation cost, and generalizing poorly across tasks.
⢠Many tasks lack clear, well-defined intermediate visual targets, making supervised intermediates hard to specify and non-scalable.
⢠Existing models exhibit language bias and cannot adaptively re-encode visual information for task-specific reasoning.
⢠There is a need for a task-agnostic mechanism that discovers useful visual abstractions without extra data or explicit intermediate labels.
š§Research Method
LIVR introduces trainable latent visual tokens and a visual bottleneck attention mask that forces all visual information to flow through these tokens. Training proceeds in two stages: (1) masked bottleneck to make latents encode task-relevant visual features, and (2) standard masking to jointly use image and latent tokens, with loss computed only on answer tokens and lightweight LoRA tuning of the language backbone.
š”Research Ideas
⢠Interpretable Latent Visual Reasoning: Probing and Editing Visual Latents for Human-Understandable Concepts: Develop methods to visualize, cluster, and steer latent tokens so their visual abstractions become interpretable and editable.
⢠Scaling LIVR: Larger Models, More Latents, and Longer Visual Compute for Complex Perception Tasks: Study scaling laws for latent capacity, number of tokens, and compute allocation across diverse, harder benchmarks.
⢠Dynamic Latent Allocation: Adaptive Token Routing and Iterative Refinement for Visual Reasoning: Introduce mechanisms that allocate/iterate latent tokens on-demand (e.g., pause or recurrent passes) to match task difficulty.
⢠Semi-Supervised Visual Bottlenecks: Combining Implicit Latents with Optional Sparse Visual Hints: Explore hybrid training that remains task-agnostic but can incorporate occasional, cheap visual hints when available.
⢠Video-LIVR: Extending Latent Bottlenecks to Temporal and Action-Centric Reasoning: Generalize latent bottlenecking to video and vision-language-action settings for spatiotemporal reasoning and planning.
Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning
Abstract
Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.
šÆResearch Motivation
⢠Token-by-token exploration in autoregressive models is inefficient for long-horizon, sparse-reward tasks; success requires multiple correct tokens before any reward, making learning impractically slow (Fig. 8).
⢠Existing hierarchical RL struggles to discover reusable, temporally-abstract actions (options) from unlabeled dataāmethods like option-critic/CompILE can degenerate and need careful switching regularization, often failing in practice (baseline failures in Fig. 8).
⢠Standard RL post-training of pretrained models (e.g., GRPO-like) provides too little signal because successful trajectories are astronomically rare under token-level sampling; credit assignment over long horizons is poor (Fig. A5).
⢠Current activation steering methods typically require supervision (e.g., truthfulness/persona labels) and do not discover abstract actions unsupervised; they also do not target temporal abstraction.
⢠Although pretrained models contain rich internal structure, directly doing RL in residual-stream space is a high-dimensional continuous control problem and is extremely hard to optimize.
⢠Real-world problems demand compositional generalizationārecombining subgoals in novel orders and lengthsāwhich token-level policies and non-hierarchical RL do not exploit (Fig. 2, Fig. 4).
š§Research Method
Freeze a pretrained autoregressive model and attach a non-causal metacontroller (a recurrent hypernetwork) that reads residual-stream activations and emits mid-depth linear residual controllers from latent codes with a learned switching unit, trained via a variational next-action objective with a KL prior. Then perform āinternal RLā by treating the base model as the environment and learning a causal policy over the latent controller codes and switches, reducing action dimensionality and horizon to enable efficient exploration and credit assignment (Fig. 1, 5, 8).
š”Research Ideas
⢠Latent-Thought Internal RL for LLM Reasoning: Apply internal RL in the latent controller space to mathematical and program-synthesis reasoning, compressing chains-of-thought into temporally abstract latent actions for efficient credit assignment.
⢠Stacked Metacontrollers for Multi-Level Temporal Abstraction: Learn multi-tier metacontrollers that discover and coordinate abstractions across several timescales (skills ā subroutines ā strategies) for complex compositional tasks.
⢠Safe and Interpretable Internal Control via Sparse Autoencoders and JEPA: Combine metacontrollers with sparse/structured activation models to yield human-interpretable, safe, and verifiable long-horizon interventions in foundation models.
⢠Co-Training Without Collapse: Regularization and Optimization Schemes that Preserve Emergent Abstractions: Develop priors, constraints, or training curricula that retain subgoal-aligned switching when co-training the base model and controller (motivated by the rateādistortion gap in Fig. 7).
⢠Theory of Internal RL Variance Reduction and Credit Assignment: Formalize the variance and sample-complexity benefits of acting in latent controller spaces versus token- or residual-level RL, extending the analysis in Appendix E.2.
⢠From Sim to Real: Internal RL for Vision-Language-Action Robotics: Integrate perception and language conditioning to transfer temporally abstract latent actions to real robotic manipulation and navigation under sparse rewards.
Spatia: Video Generation with Updatable Spatial Memory
Abstract
Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model's ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.
šÆResearch Motivation
⢠Long-term spatial and temporal consistency in video generation is hard because video tokens are dense/high-dimensional, preventing models from attending to long histories like LLMs can (page 2)
⢠Existing bidirectional diffusion models are limited to short clips; autoregressive methods mitigate length but still lack explicit spatial memory and accumulate errors over time (pages 2ā3)
⢠Prior "memory" approaches often assume static scenes, failing to generate realistic dynamic entities while preserving scene consistency (pages 2ā3, Figure 1a)
⢠Camera control in current systems is indirect (latent conditioning), leading to inaccurate/unstable control; a geometrically grounded mechanism is needed (page 3, Figure 1c)
⢠There is no scalable mechanism for revisiting locations with coherent geometry across iterations, limiting applications like world models, game generation, and embodied AI (pages 2ā3)
⢠Lack of 3D-aware interactive editing: users cannot reliably edit scene structure and see consistent effects in generated videos (page 3, Figure 1d)
š§Research Method
Spatia maintains an explicit, updatable 3D scene point cloud as persistent spatial memory and iteratively generates video clips conditioned on this memory, prior frames, and a rendered view-specific point-cloud projection along a user-specified camera path; after each clip, the memory is updated via visual SLAM (MapAnything) while masking dynamics to keep the static scene. A Wan2.2-based multi-modal diffusion transformer with ControlNet and flow matching fuses text, retrieved spatially overlapping reference frames, preceding clips, and scene projections to achieve dynamicāstatic disentanglement, spatially consistent generation, and explicit camera control.
š”Research Ideas
⢠End-to-End Learned Spatial Memory for Video Generation: Replace external SLAM with a differentiable, jointly trained memory module that learns to build, update, and compress 3D scene memory during generation.
⢠Unified 4D DynamicāStatic World Memory for Consistent Video: Extend Spatiaās static point-cloud memory to a factorized 4D representation that co-models persistent scene structure and dynamic entities with motion priors.
⢠Differentiable Camera Control and Path Planning in Memory-Aware Video Models: Learn camera path optimization jointly with generation to maximize spatial consistency and content coverage, using gradients through the rendering/conditioning pipeline.
Schoenfeld's Anatomy of Mathematical Reasoning by Language Models
Abstract
Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld's Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.
šÆResearch Motivation
⢠Outcome-centric evaluations (accuracy/length) obscure how LLMs structure their reasoning traces, limiting interpretability and diagnosis.
⢠No standardized, scalable mapping from chain-of-thought text to functional steps (analysis, exploration, execution, verification, monitoring), making it unclear which parts do what.
⢠Overthinking persists: longer or more elaborate reasoning often fails to improve correctness, yet existing metrics cannot explain when/why this happens.
⢠Prior analyses are token-level or small/siloed (single model/dataset), lacking an intermediate, theory-grounded representation to compare models and reasoning styles at scale.
⢠Efficiency methods are assessed by length, not by which evaluative/feedback behaviors they remove, hindering principled optimization of reasoning cost.
š§Research Method
ThinkARM: a scalable, sentence-level annotation framework grounded in Schoenfeldās Episode Theory that labels reasoning into eight episodes (Read, Analyze, Plan, Implement, Explore, Verify, Monitor, Answer) using an LLM annotator guided by a human-verified guidebook. It enables large-scale lexical, temporal, and transition analyses (e.g., MI-based episode n-grams) and diagnostics such as correctness prediction from episode ratios/transitions and profiling of efficiency methods.
š”Research Ideas
⢠Episode-Aware RL for Reasoning LMs: Optimize explorationāmonitor/analysis routing and verification loops with episode-level rewards to boost correctness and reduce overthinking.
⢠Beyond Math: Generalizing Episode-Level Reasoning to Multimodal and Agentic Tasks: Extend the taxonomy and ThinkARM analyses to science, coding, tool-use, and vision-language settings, testing the "cognitive heartbeat" across domains.
⢠Mechanistic Mapping of Episodes to Internal States: Align episode tags with activations/attention to identify circuits for analysis/execution/verification and enable episode-conditioned decoding or early-exit controllers.
How Much 3D Do Video Foundation Models Encode?
Abstract
Videos are continuous 2D projections of 3D worlds. After training on large video data, will global 3D understanding naturally emerge? We study this by quantifying the 3D understanding of existing Video Foundation Models (VidFMs) pretrained on vast video data. We propose the first model-agnostic framework that measures the 3D awareness of various VidFMs by estimating multiple 3D properties from their features via shallow read-outs. Our study presents meaningful findings regarding the 3D awareness of VidFMs on multiple axes. In particular, we show that state-of-the-art video generation models exhibit a strong understanding of 3D objects and scenes, despite not being trained on any 3D data. Such understanding can even surpass that of large expert models specifically trained for 3D tasks. Our findings, together with the 3D benchmarking of major VidFMs, provide valuable observations for building scalable 3D models.
šÆResearch Motivation
⢠3D data scarcity limits scaling of 3D foundation models; it is unclear whether large video-only models (VidFMs) naturally learn usable 3D structure and egomotion from 2D videos.
⢠Existing approaches that leverage video models for 3D rely on 3D fine-tuning, explicit 3D memory, post-optimization, or task-specific engineeringāleading to 3D-inconsistency artifacts and poor out-of-domain generalization.
⢠Prior evaluations mainly target image models or use indirect proxies (e.g., relative depth, multi-view feature consistency, generator benchmarks), lacking a model-agnostic, quantitative probe of global 3D awareness.
⢠It is unknown where 3D information concentrates within video diffusion models (which layers/timesteps) and how factors like temporal reasoning, model scale, and 3D fine-tuning affect 3D awareness.
⢠In practice, it is important to know whether VidFM features can improve feedforward 3D reconstruction under limited 3D supervision compared to standard image features (e.g., DINO).
š§Research Method
Freeze a video foundation model, extract per-frame spatial features (for diffusion models: mid-layer activations at early-but-not-first denoising timesteps), and train a shallow alternating-attention transformer probe to directly predict dense 3D point maps, depth maps, and camera poses from four sampled frames. Evaluate 3D awareness via point/pose/depth errors on CO3Dv2 and DL3DV with controls (per-frame DINO, native 3D Fast3R), plus ablations over temporal reasoning, model scale, and layerātimestep choices.
š”Research Ideas
⢠VidFM-VGGT++: Scaling Feedforward 3D Reconstruction with Frozen Video Generator Features: Train large VGGT-style models on diverse 3D datasets using frozen VidFM features to study data efficiency, generalization to dynamic scenes, and performance ceilings versus DINO-based pipelines.
⢠Time Matters: Learning Optimal LayerāTimestep Readouts for 3D-Aware Diffusion Features: Meta-learn or distill readout schedules that select layers/timesteps per scene to maximize 3D awareness, with self-supervised geometric objectives that preserve synthesis quality.
⢠Domain-Adaptive 3D Fine-Tuning for Video Diffusion: Design 3D-aware fine-tuning objectives and regularizers (e.g., cross-domain distillation, anti-forgetting constraints) that improve geometry while maintaining out-of-domain generalization beyond the fine-tuning data.
VA-Ļ: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
Abstract
Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA-Ļ, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-Ļ formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-Ļ introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling. The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency of tokens. VA-Ļ enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744). Code is available at https://github.com/Lil-Shake/VA-Pi.
šÆResearch Motivation
⢠Tokenizerāgenerator misalignment: AR models are trained only for token likelihood while tokenizers are trained for pixel reconstruction, causing high-likelihood token sequences to decode into low-quality, off-manifold images.
⢠Lack of pixel-space supervision: Directly optimizing pixel-level likelihood is intractable; current AR training provides no direct pixel guidance, leading to artifacts and degraded perceptual quality.
⢠Limits of existing fixes: Noisy-context or order-randomization (generator-centric) and tokenizer post-training (tokenizer-centric) do not directly align to pixel space, can require costly retraining, and may oversmooth reconstructions.
⢠STE shortcomings: Straight-Through Estimator only updates along ground-truth paths under teacher forcing, limiting exploration and leaving sampling behavior (free-running) misaligned.
⢠RL overhead today: AR-GRPO-style methods need external reward models, reference policies, and expensive free-running rollouts, increasing cost and instability.
š§Research Method
VA-Ļ casts discrete token sequences as latent variables and derives an ELBO that unifies pixel-space reconstruction (decoder likelihood under teacher forcing) with a prior regularization to preserve token modeling, then optimizes it via RL by treating reconstruction quality (MSE + LPIPS) as the intrinsic reward and using a lightweight next-token prediction regularizer with contextual noise. Implemented with GRPO, it avoids free-running rollouts and external reward models, enabling fast, data-efficient post-training while keeping the tokenizer frozen.
š”Research Ideas
⢠Joint Variational Alignment of Tokenizer and AR Generator: Co-train tokenizer and AR under a stabilized pixel-aware ELBO with anti-oversmoothing constraints and adaptive posteriors to further shrink the generatorātokenizer gap.
⢠Preference-Augmented VA-Ļ: Blend pixel reconstruction rewards with human-aligned objectives (e.g., CLIP/HPS or RLHF) to jointly optimize perceptual fidelity and semantic/prompt alignment without sacrificing diversity.
⢠VA-Ļ for Video and 3D Autoregression: Extend pixel-aware policy alignment to spatiotemporal and multi-view settings with rewards enforcing temporal consistency, geometry coherence, and cross-view reconstruction fidelity.
GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
Abstract
Multi-turn reinforcement learning (RL) for multi-modal agents built upon vision-language models (VLMs) is hampered by sparse rewards and long-horizon credit assignment. Recent methods densify the reward by querying a teacher that provides step-level feedback, e.g., Guided Thought Reinforcement (GTR) and On-Policy Distillation, but rely on costly, often privileged models as the teacher, limiting practicality and reproducibility. We introduce GTR-Turbo, a highly efficient upgrade to GTR, which matches the performance without training or querying an expensive teacher model. Specifically, GTR-Turbo merges the weights of checkpoints produced during the ongoing RL training, and then uses this merged model as a "free" teacher to guide the subsequent RL via supervised fine-tuning or soft logit distillation. This design removes dependence on privileged VLMs (e.g., GPT or Gemini), mitigates the "entropy collapse" observed in prior work, and keeps training stable. Across diverse visual agentic tasks, GTR-Turbo improves the accuracy of the baseline model by 10-30% while reducing wall-clock training time by 50% and compute cost by 60% relative to GTR.
šÆResearch Motivation
⢠Multi-turn RL for VLM agents faces sparse rewards, long-horizon credit assignment, and noisy environments, causing thought/entropy collapse and unstable training.
⢠Existing dense-process guidance methods (e.g., GTR) rely on large external teacher models (GPT/Gemini), incurring high API cost, latency, potential inaccessibility, and making performance hinge on teacher quality.
⢠There is a need for an efficient, scalable, and reproducible approach that provides step-level guidance without external models, reduces wall-clock/compute, and preserves exploration in complex visual agent tasks (e.g., Points24, ALFWorld).
š§Research Method
GTR-Turbo merges historical RL checkpoints via TIES-merging (with SMA/EMA weighting) to form a stronger āfreeā teacher, then trains the student with PPO plus either online SFT of teacher thoughts or a reverse-KL token-level distillation reward from teacher logitsāeliminating external models while stabilizing learning.
š”Research Ideas
⢠Adaptive Checkpoint Merging for Self-Evolving VLM Agents: Learn task- or uncertainty-aware selection/weighting of checkpoints (beyond fixed SMA/EMA) and layer-wise pruning to further boost teacher quality and stability.
⢠Cold-Start GTR-Turbo via Hybrid Teachers: Combine brief external supervision or process reward models with merged-teacher guidance to overcome low initial capability, then anneal to fully self-contained training.
⢠Theoretical Guarantees for Reverse-KL Thought Guidance: Develop analysis of optimization/exploration dynamics and convergence for PPO with merged-teacher reverse-KL rewards, including low-variance, unbiased estimators.