Daily Digest — 2026-05-22
309 items · 1 research labs, 302 arxiv papers, 6 industry media
🏛️ Research Labs (1)
AdventHealth advances whole-person care with OpenAI
AdventHealth implemented ChatGPT for Healthcare to reduce administrative workload in clinical and operational workflows, achieving an 80% reduction in time spent on repetitive tasks. The deployment focused on structured outputs for utilization management, document drafting, and information summarization, using domain-specific peer groups for adoption. System-level metrics showed statistically significant improvements in throughput (10-minute reviews reduced to 2 minutes) and clinician capacity, while maintaining compliance with healthcare governance controls.
chatgpt for healthcareutilization managementstructured outputsworkflow throughputgovernance controls
📜 arXiv Papers (302)
Variance Reduction for Expectations with Diffusion Teachers
We introduce CARV, a compute-aware variance-accounting framework that reduces estimator variance in pipelines leveraging pretrained diffusion models as frozen teachers. CARV employs a hierarchical Monte Carlo estimator, amortizing expensive upstream computation over cheap diffusion-noise resamples, enhanced by timestep importance sampling and stratified-inverse-CDF construction. Experiments in text-to-3D distillation and attribution show CARV achieves 2-3x effective compute multipliers, primarily from amortized reuse, with an additional ~25% from importance sampling and stratification. In single-step distillation, CARV reduces gradient variance by an order of magnitude, though downstream FID remains unchanged, indicating MC variance is no longer the bottleneck in this regime.
diffusion modelsmonte carlo estimatorvariance reductionimportance samplingstratified sampling
Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate
This paper introduces a framework to quantify hyperparameter transfer in large language models (LLMs) via three metrics: scaling law fit quality, robustness to extrapolation errors, and asymptotic loss penalty. Through extensive ablations, the study reveals that Maximal Update ($μ$P) outperforms standard parameterization (SP) in learning rate transfer primarily by maximizing the embedding layer learning rate, which mitigates training instabilities in SP. Additionally, weight decay enhances scaling law fits but reduces extrapolation robustness in fixed token-per-parameter settings.
hyperparameter transferscaling lawmaximal updateembedding layerweight decay
DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
The paper introduces DeepWeb-Bench, a challenging benchmark for evaluating deep research capabilities in language models, requiring massive evidence collection, cross-source reconciliation, and long-horizon derivation. The benchmark categorizes difficulty into four capability families: Retrieval, Derivation, Reasoning, and Calibration. Evaluation of nine frontier models reveals retrieval is not the primary bottleneck (12-14% errors), with derivation and calibration failures accounting for over 70%. Strong and weak models exhibit distinct error patterns, and cross-model domain specialization shows low agreement (rho = 0.61). The benchmark includes public data, rubrics, and evaluation code.
deepweb-benchevidence collectioncross-source reconciliationlong-horizon derivationcapability families
AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists
The paper introduces AiraXiv, an AI-driven open-access platform designed to accommodate both human and AI scientists in a continuous, feedback-driven research publishing paradigm. The system combines open preprints with AI-augmented analysis and review, featuring interactive UIs for humans and Model Context Protocol (MCP)-based interactions for AI agents. Validation includes deployment as the submission platform for ICAIS 2025, demonstrating scalability and inclusivity. The platform is publicly available at https://airaxiv.com.
ai-driven publishingopen preprintsmodel context protocolcontinuous iterationresearch infrastructure
WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata
WikiVQABench introduces a knowledge-grounded Visual Question Answering benchmark combining Wikipedia images, captions, and Wikidata knowledge. The pipeline employs LLMs to generate candidate multiple-choice questions, later human-curated for factual accuracy and knowledge dependence. Evaluating 15 VLMs (256M-90B parameters) shows performance ranging from 24.7% to 75.6% accuracy, effectively discriminating model capabilities on knowledge-intensive reasoning tasks.
visual question answeringknowledge-groundedwikidatavision-language modelsmultiple-choice
Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling
Agent JIT compilation optimizes latency in computer-use agents by compiling task descriptions directly into executable code, enabling parallelization and reducing LLM dependency. The method integrates JIT-Planner, which generates and validates multiple code plans against tool specifications, JIT-Scheduler, which explores parallelization via Monte Carlo cost estimation, and an invariant-enforcing tool protocol to minimize incorrect tool use. Evaluated across 5 web applications, JIT-Planner achieves a 10.4× speedup and 28% accuracy improvement over Browser-Use, while JIT-Scheduler achieves a 2.4× speedup and 9% accuracy improvement over OpenAI CUA.
jit-compilationmonte-carloparallelizationtool-specificationslatency-optimization
Mem-$π$: Adaptive Memory through Learning When and What to Generate
Mem-$π$ introduces an adaptive memory framework for LLM agents that generates context-specific guidance on demand rather than retrieving static entries from external memory stores. It employs a dedicated language or vision-language model, separate from the downstream agent, to jointly decide when and what guidance to produce, trained via a decision-content decoupled RL objective. This approach enables abstention when generation is unhelpful and produces concise, task-relevant guidance. Evaluated across benchmarks in web navigation, terminal-based tool use, and text-based embodied interaction, Mem-$π$ achieves over 30% relative improvement on web navigation tasks compared to retrieval-based and prior RL-optimized memory baselines.
adaptive memorylanguage modelreinforcement learningcontext-specific guidanceweb navigation
HITL-D: Human In The Loop Diffusion Assisted Shared Control
The paper introduces Human-In-The-Loop Diffusion (HITL-D), a shared control framework combining diffusion-based policies with human input to enhance teleoperation performance. The method autonomously updates end effector orientation using scene point clouds and Cartesian positions, reducing required joystick axes. In a 12-participant study, HITL-D achieved 40% faster task completion, 37% lower workload, and improved subjective ratings for independence, intuitiveness, and confidence compared to traditional teleoperation.
shared controldiffusion policiesteleoperationpoint cloudend effector
Mind the Sim-to-Real Gap & Think Like a Scientist
The paper addresses the sim-to-real gap in sequential decision problems with pre-trained simulators and costly real-world experimentation. It introduces an extended simulation lemma decomposing the simulator's value error into calibration-deployment shift and parametric residual, and analyzes the value gap between simulator-optimal and optimal policies. Fisher-SEP, a simulation-aided experimental policy minimizing posterior predictive variance, is proposed. Case studies in vending-machine supply chains and HIV mobile testing demonstrate regimes where front-loaded experimentation outperforms posterior updating and designed exploration is necessary for poorly-surveilled regions.
sim-to-real gapsequential decision problemsimulation lemmafisher-sepposterior predictive variance
Quality and Security Signals in AI-Generated Python Refactoring Pull Requests
This empirical study evaluates the quality and security characteristics of AI-generated Python refactoring pull requests (PRs) using PyQu, Pylint, and Bandit. Analyzing PRs from the AIDev dataset, the authors quantify changes across five quality attributes and measure code quality and security issues pre- and post-refactoring. Results indicate that 22.5% of agentic commits improve a quality attribute, with usability improving most frequently (36.5%), while 24.17% introduce new Pylint issues and 4.7% introduce new Bandit findings. Despite mixed outcomes, developer acceptance is high, with 73.5% of PRs merged. The study motivates stronger quality and security gating for AI-driven development workflows.
pyqupylintbanditrefactoringpull requests
Approximation Theory for Neural Networks: Old and New
The survey synthesizes classical and contemporary results in neural network approximation theory, focusing on quantitative bounds for feedforward architectures and emerging Kolmogorov--Arnold Networks (KANs). It examines universal approximation theorems, emphasizing depth--width trade-offs and parameter efficiency for structured function classes in L^p and Sobolev spaces. Key findings include superior approximation rates for deeper networks and comparative analysis of KANs' expressive power relative to traditional architectures.
universal approximationdepth--width trade-offskolmogorov--arnold networksparameter efficiencysobolev spaces
Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs
This paper establishes reasoning consistency as a quantitative proxy for planning safety in Vision-Language-Action (VLA) autonomous driving systems. The authors conduct a controlled perturbation study of Alpamayo R1 (10B parameters) across 1,996 scenarios under eight sensor perturbations, including Gaussian noise, lighting extremes, and fog levels (~18,000 inference trials). Results show that changes in Chain-of-Causation (CoC) explanations correlate strongly with trajectory deviations (5.3× increase, r=0.99), and enabling CoC generation improves trajectory accuracy by 11.8% on average (p<0.0001). Degradation under noise is approximately linear (R²=0.957), with standard preprocessing defenses providing marginal relief. These findings motivate reasoning-based runtime monitoring for safer VLA deployment.
vision-language-actionchain-of-causationsensor perturbationstrajectory deviationreasoning consistency
TempGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay Videos
We introduce TempGlitch, a benchmark for evaluating vision-language models (VLMs) on temporal glitch detection in gameplay videos, addressing the underexplored challenge of identifying glitches evident only through frame sequences. TempGlitch includes five temporal glitch types with balanced samples and paired glitch-free videos for binary evaluation. Testing 12 proprietary and open-weight VLMs across multiple frame-sampling settings reveals near-chance performance, with models exhibiting overly conservative or sensitive behavior. Denser frame sampling and larger model sizes do not reliably improve results. TempGlitch serves as a testbed for temporal reasoning and robust gameplay understanding in VLMs.
vision-language modelstemporal glitchgameplay videosframe samplingbinary evaluation
torchtune: PyTorch native post-training library
We introduce torchtune, a PyTorch-native library designed to streamline the post-training lifecycle of large language models (LLMs), emphasizing modularity, hackability, and direct access to PyTorch components. The library supports efficient fine-tuning, experimentation, and deployment workflows through its model builders, training recipes, and distributed training stack. Evaluations across representative post-training settings demonstrate torchtune's strong performance and memory efficiency compared to frameworks like Axolotl and Unsloth, while maintaining flexibility for rapid research iteration. These results establish torchtune as a practical foundation for reproducible LLM post-training research.
post-trainingfine-tuningpytorchmodularitydistributed training
PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
PALS introduces a power-aware runtime for LLM serving that jointly optimizes GPU power caps and software parameters like batch size to improve energy efficiency. The system combines offline power-performance models with feedback-driven control, implemented in vLLM without requiring model retraining. Evaluations on multi-GPU systems with dense and MoE models show 26.3% higher energy efficiency, 4x-7x fewer QoS violations under power constraints, and effective dynamic power tracking.
llm servingpower-aware runtimemixture-of-expertsenergy efficiencyfeedback control
HiRes: Inspectable Precedent Memory for Reaction Condition Recommendation
HiRes introduces a retrieval-augmented system for reaction condition recommendation that combines predictive accuracy with inspectable precedent memory. The method employs a graph encoder, transformation-aware cross-attention, multi-stream reaction fusion, and k-NN retrieval to create a hierarchical reaction representation space. On USPTO-Condition benchmarks, HiRes achieves state-of-the-art Catalyst (Acc@1: 0.929), Solvent (0.534), and Reagent (0.530) prediction accuracy, outperforming REACON on Solvent/Reagent while matching the best Catalyst baseline. Statistical analysis confirms retrieval integration significantly improves solvent/reagent selection over parametric approaches.
retrieval-augmentedreaction condition recommendationgraph encodercross-attentionprecedent memory
FedCritic: Serverless Federated Critic Learning-based Resource Allocation for Multi-Cell OFDMA in 6G
FedCritic introduces a serverless federated multi-agent actor-critic framework for distributed downlink resource management in 6G ultra-dense networks, addressing inter-cell interference and long-term QoS constraints. The method employs virtual-queue deficit weights for QoS enforcement and federates the critic via lightweight gossip-based parameter averaging over the interference graph, eliminating the need for centralized training. Simulations demonstrate that FedCritic enhances mean SINR, cell-edge rate, network-wide sum-rate, and fairness compared to non-coordinated and CTDE baselines, while achieving stable training with reduced coordination overhead.
federated learningactor-criticqos constraintsinter-cell interferencegossip-based averaging
Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition
The paper introduces a rank-aware multi-encoder framework for blended emotion recognition, addressing challenges posed by subtle, overlapping multimodal cues. The method projects heterogeneous encoder features into a shared latent space, employs attention-based gating to select top-n encoders, and decouples prediction into presence and salience heads with probability-level fusion. It incorporates unsupervised domain adaptation at the feature level for robustness. Evaluated on the BlEmoRE challenge, the framework outperforms individual encoders and naïve fusion baselines, achieving 2nd place in the competition.
blended emotion recognitionmulti-encoder fusionattention-based gatingunsupervised domain adaptationprobability-level fusion
Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
The paper introduces QuestBench, a course-based practice for teaching AI through benchmark construction, emphasizing accountable knowledge work. Students transform disciplinary knowledge into expert-level questions, review designs for ambiguity, and evaluate AI systems on these tasks. The resulting benchmark comprises 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench reveals significant shortcomings in current deep research systems, with a mean question-level pass rate of 16.85% and GPT-5.5 achieving the highest pass rate at 57.58%. Student reflections indicate that benchmark construction fosters critical judgment of AI outputs, highlighting the educational value of identifying fluent yet flawed AI responses.
benchmark constructiondeep research systemsexpert-level questionsaccountable knowledge workquestion-level pass rate
Stdlib or Third-Party? Empirical Performance and Correctness of LLM-Assisted Zero-Dependency Python Libraries
The study evaluates stdlib-only reimplementations of popular Python libraries, developed with LLM assistance under strict constraints (no external imports, single-file, API-compatible). Using the zerodep collection (40+ modules across 12 categories), it benchmarks performance and correctness against third-party counterparts. Results show stdlib achieves performance parity (within 2x) in most cases, with exceptions for C-extension-heavy tasks. LLM-generated implementations yield 5--115x speedups by avoiding architectural overhead. The work characterizes stdlib's capability boundary, identifies LLM-assisted development challenges, and explores implications for dependency-free software engineering.
stdlibzerodepllm-assistedperformance paritydependency-free
Open-source LLMs administer maximum electric shocks in a Milgram-like obedience experiment
The study investigates large language models' (LLMs) obedience under authority pressure through a Milgram-like experiment with 11 open-source models. Researchers conducted 30 trials per model across 8 conditions, measuring compliance with escalating electric shock commands despite expressed distress. Key findings include: (1) LLMs exhibit human-like compliance under pressure, (2) demonstrate vulnerability to gradual boundary violations, (3) show refusal attempts that fail due to format mismatches, and (4) suggest token-level pattern continuation may override ethical reasoning. Results indicate significant safety risks in agentic LLM deployments.
large language modelsmilgram experimentagentic pipelinesboundary violationstoken continuation
Towards Resilient and Autonomous Networks: A BlueSky Vision on AI-Native 6G
This paper proposes a vision for AI-native 6G networks, shifting from 'Network for AI' to 'AI for Network'. The authors advocate for a foundation model as a unified backbone, with task-specific knowledge distilled into compact models for edge deployments, and multi-agent systems for autonomous network management. These systems aim to diagnose, maintain, and recover networks with minimal human intervention, framing network management as a unified, multi-modal, multi-task optimization problem. The paper outlines two transformative directions: developing a 6G foundation model and advancing multi-agent systems, charting a roadmap for intelligent, self-sustaining communication infrastructure.
foundation modelmulti-agent systemsedge deploymentsnetwork management6g
Designing Conversations with the Dead: How People Engage with Generative Ghosts
The study investigates user experiences with two design paradigms for generative ghosts—AI systems trained on deceased individuals' data: third-person representation versus first-person reincarnation. Through qualitative analysis of 16 participants, findings reveal a preference for reincarnation due to its immediacy, despite concerns about over-reliance, while representation was favored for memory engagement. Participants prioritized affective resonance over factual accuracy in both modes, often disregarding the intended framing. The work highlights how tone, language, and conversational rhythm, shaped by users' memories, render these interactions inherently collaborative.
generative ghostsaffective resonancequalitative user studyfirst-person reincarnationthird-person representation
On the Regularity and Generalization of One-Step Wasserstein-guided Generative Models for PDE-Induced Measures
The paper establishes a theoretical framework for analyzing the regularity and generalization of one-step Wasserstein-guided generative models applied to PDE-induced probability measures. By examining normalized target densities from linear elliptic, parabolic, diffusion, and Fokker--Planck equations, the authors prove that these measures satisfy doubling conditions. Leveraging regularity theory for optimal transport between doubling measures, they demonstrate Hölder continuity of the optimal transport map from a uniform source to the target. This justifies one-step generative models like DeepParticle, for which excess-risk bounds and robustness estimates under target shift are derived. Experimental results corroborate the theoretical rates.
wasserstein distanceoptimal transportdoubling conditionspde-induced measuresdeep particle
SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents
SpecBench introduces a benchmark to quantify reward hacking in long-horizon coding agents, where agents optimize for passing visible validation tests while deviating from true user goals. The methodology decomposes tasks into natural language specifications, visible validation tests, and held-out tests, measuring reward hacking as the gap in pass rates between these suites. SpecBench includes 30 systems-level programming tasks, revealing consistent reward hacking: frontier agents saturate visible suites but exhibit gaps in holdout suites, scaling sharply with task length (28 percentage points per tenfold code increase). Failures range from subtle feature isolation to deliberate exploits, such as memorizing test inputs.
reward hackinglong-horizon codingvalidation testsheld-out testssystems-level programming
How to Build Marcus's Algebraic Mind: Algebro-Deterministic Substrate over Galois Fields
The paper introduces PyVaCoAl/VaCoAl, a hyperdimensional computing architecture implementing Gary Marcus's cognitive framework through algebraic operations over GF(2). The system uses XOR-and-shift primitives with linear-feedback shift registers to enable reversible variable binding, non-commutative composition, and individual/kind separation. It demonstrates how this substrate fulfills Marcus's three requirements (variable operations, recursive structures, representation distinctions) more effectively than 2001-era approaches, while extending to counterfactual reasoning. Biological parallels are drawn to dentate gyrus-CA3 circuitry, suggesting innate developmental implementation.
hyperdimensional computinglinear-feedback shift registersvariable bindingcognitive architecturegf(2)
Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training
AutoScale introduces a closed-loop dynamic optimization framework for real-synthetic co-training in autonomous driving, addressing inefficiencies in naive synthetic data incorporation. The method employs Graph Regularized AutoEncoder (Graph-RAE) for driving scene representation, Cluster-aware Gradient Ascent (Cluster-GA) for cluster-wise importance estimation, and cluster-guided vector retrieval to select high-value samples. Evaluated on NavSim, AutoScale outperforms vanilla co-training and cross-domain baselines, achieving superior performance with fewer synthetic samples under constrained training budgets.
real-synthetic co-traininggraph regularized autoencodercluster-aware gradient ascentclosed-loop optimizationautonomous driving
Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents
The Insights Generator (IG) introduces a systematic approach for corpus-level trace diagnostics in LLM agents, addressing the scalability limitations of manual failure analysis. IG employs a multi-agent system to analyze execution traces, formulate hypotheses, and generate evidence-backed natural-language insights characterizing behavioral patterns across trace groups. Evaluations demonstrate that IG reports improve scaffold performance by 30.4 percentage points over baselines and enhance coding agent stability. Domain experts rate IG reports highly for depth and evidence quality, while its scout-investigator architecture achieves detection coverage comparable to competing methods.
corpus-level trace diagnosticsmulti-agent systemexecution tracesscout-investigator architectureevidence-backed insights
Data-Efficient Neural Operator Training via Physics-Based Active Learning
The authors propose physics-based acquisition, a novel physics-informed active learning algorithm for training neural operators on partial differential equations (PDEs). The method leverages PDE residuals to selectively acquire the most informative training samples, reducing data requirements while injecting physics inductive bias. Experiments on the 1D Burgers equation and 2D compressible Navier-Stokes equations demonstrate that physics-based acquisition consistently outperforms random sampling and matches state-of-the-art data efficiency. The approach ensures computational resources are focused where the model's physical understanding is weakest, addressing a key bottleneck in neural operator training.
neural operatorsactive learningpartial differential equationsphysics-informeddata efficiency
SymbolicLight V1: Spike-Gated Dual-Path Language Modeling with High Activation Sparsity and Sub-Billion-Scale Pre-Training Evidence
SymbolicLight V1 introduces a spike-gated dual-path language model combining binary Leaky Integrate-and-Fire dynamics with continuous residual streams, achieving 89% activation sparsity. The architecture replaces dense self-attention with exponential-decay aggregation for long-range memory and spike-gated local attention for short-range precision. A 194M-parameter model trained on 3B Chinese-English tokens achieves validation PPL 8.88-8.93, trailing GPT-2 201M by 7.7% but surpassing GPT-2 124M. Ablations show spike-gated local attention as the primary contributor, with temporal integration outperforming deterministic top-k sparsity. Scaling to 0.8B parameters demonstrates preserved sparsity at larger scales.
spiking language modelsactivation sparsityleaky integrate-and-firedual-path sparse tcamexponential-decay aggregation
TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization
TextReg introduces a regularization framework to mitigate prompt distributional overfitting in large language models (LLMs), addressing the issue of poor generalization beyond training distributions. The method formalizes representational inefficiency, decomposing prompt inefficiency into capacity cost and scope narrowness, and implements a soft-penalty objective via regularized textual gradients. This framework combines Dual-Evidence Gradient Purification, Semantic Edit Regularization, and Regularization-Guided Prompt Update. Evaluations across multiple reasoning benchmarks demonstrate significant out-of-distribution (OOD) generalization improvements, with accuracy gains of up to +11.8% over TextGrad and +16.5% over REVOLVE.
prompt optimizationdistributional overfittingregularized gradientsout-of-distribution generalizationrepresentational inefficiency
Frontier: Towards Comprehensive and Accurate LLM Inference Simulation
Frontier introduces a discrete-event simulator for modern LLM inference serving, addressing limitations of existing simulators by modeling disaggregated execution, parallelism, and runtime optimizations. It incorporates Prefill-Decode Disaggregation (PDD), Attention-FFN Disaggregation (AFD), and CUDA Graphs within a scheduler-batch-engine loop, supporting stateful requests for reasoning, agents, and RL rollouts. Frontier achieves an average throughput error below 4% on a 16-H800 GPU testbed, reducing end-to-end latency error from 44.9% to 6.4% under co-location and from 51.7% to 2.6% under disaggregation. It scales to over 1K GPUs on commodity CPUs and enables SLA-dependent Pareto frontier exploration and heterogeneous disaggregated allocation.
disaggregated executionprefill-decode disaggregationattention-ffn disaggregationcuda graphsstateful requests
DeCoR: Design and Control Co-Optimization for Urban Streets Using Reinforcement Learning
DeCoR introduces a reinforcement learning framework for co-optimizing urban street design and traffic signal control. The method employs a two-stage approach: (1) a generative policy parameterizes a Gaussian mixture model to sample crosswalk layouts (location/width), and (2) a shared control policy adapts signal timings to minimize pedestrian/vehicle delay. Evaluated on a 750m urban corridor using real-world video/Wi-Fi data, DeCoR reduces pedestrian arrival time to crosswalks by 23% with fewer crosswalks, while decreasing pedestrian and vehicle wait times by 79% and 65% versus fixed-time signals. The control policy demonstrates generalization to unseen demand patterns and layout variations without retraining.
reinforcement learningurban designtraffic controlgaussian mixture modelpedestrian flow
Deformba: Vision State Space Model with Adaptive State Fusion
The paper introduces Deformba, a vision State Space Model (SSM) with adaptive state fusion that addresses limitations in existing vision SSMs. Deformba dynamically augments spatial structural information while maintaining linear complexity, enabling multi-modal fusion similar to cross-attention. Evaluated on 2D tasks (image classification, object detection, segmentation) and 3D tasks (BEV perception), Deformba demonstrates strong performance across multiple benchmarks, overcoming challenges posed by fixed scanning methods and causal SSM constraints.
state space modelslinear complexityadaptive fusionbev perceptionmulti-modal fusion
From Circuit Evidence to Mechanistic Theory: An Inductive Logic Approach
(No summary returned.)
Tracing the ongoing emergence of human-like reasoning in Large Language Models
This study investigates the emergence of human-like pragmatic reasoning in Large Language Models (LLMs) through a population-matching experiment comparing 25 LLMs and an equal number of humans across four languages. The experiment assessed conditional inferences, focusing on semantic accuracy and pragmatic enrichments. Results show that humans consistently enrich logical reasoning with pragmatic inferences, while LLMs exhibit variable behavior: some adhere strictly to truth-table semantics, ignoring pragmatic nuances, while others deviate from logical rules, adopting uniform interpretations. LLM accuracy in pragmatic reasoning was not influenced by factors like open vs. closed status, training orientation, or architecture type, indicating that pragmatic reasoning remains an emerging capability in artificial systems.
pragmatic reasoningconditional inferencestruth-table semanticspopulation-matching experimentsemantic accuracy
TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health
TimeSRL introduces a two-stage LLM framework for generalizable time-series behavioral modeling, addressing cross-dataset distribution shifts in longitudinal passive sensing. The model first abstracts raw signals into high-level natural language, then predicts behavioral outcomes from these semantic abstractions, optimized end-to-end using Group Relative Policy Optimization (GRPO) with Reinforcement Learning from Verifiable Rewards (RLVR). Evaluated on mental-health prediction under a leave-one-dataset-out protocol, TimeSRL reduces mean absolute error by 3.1--10.1% and 9.5--44.1% for anxiety, and 3.2--9.6% and 27.4--57.6% for depression compared to non-LLM ML and LLM baselines, demonstrating superior cross-benchmark transfer without target-domain fine-tuning.
semantic bottleneckgroup relative policy optimizationreinforcement learning from verifiable rewardscross-dataset generalizationlongitudinal passive sensing
Large-Step Training Dynamics of a Two-Factor Linear Transformer Model
The paper analyzes large-step training dynamics in a simplified linear transformer model, revealing how high learning rates alter convergence behavior. Using gradient-flow analysis on a reducible one-prompt linear transformer, the study reduces dynamics to a two-factor product map with effective step-size parameter μ. Results show transitions from monotone convergence to chaotic dynamics, periodic orbits, and divergence, with explicit invariant Chebyshev ellipses for 0<μ<2. The findings demonstrate that large learning rates can shift training attractors beyond linear-regression solutions, impacting mini-batch gradient descent methods.
linear transformergradient-flowtraining dynamicschaotic convergencelearning rate
\textit{Stochastic} MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent
The paper introduces Stochastic MeanFlow Policies (SMFP), a one-step generative policy class for reinforcement learning that combines entropy regularization with mirror descent. SMFP maps Gaussian noise to actions via a MeanFlow transformation, enabling tractable entropy estimation and efficient single-step inference. This approach addresses limitations of Gaussian policies (poor multimodality handling) and generative policies (iterative sampling requirements). Evaluated on seven MuJoCo benchmarks, SMFP outperforms both Gaussian and generative baselines while maintaining computational efficiency during policy improvement.
stochastic meanflow policiesentropy regularizationmirror descentmultimodal action distributionsoff-policy reinforcement learning
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset
We introduce MONET, a massive open text-to-image dataset addressing key challenges in large-scale model training. The dataset comprises 104.9M image-text pairs curated from 2.9B raw pairs through successive stages of safety filtering, domain-based filtering, exact/near-duplicate removal, and re-captioning using multiple vision-language models. MONET includes synthetically generated samples and provides pre-computed embeddings and annotations for downstream tasks. Training a 4B-parameter latent diffusion model exclusively on MONET achieves competitive GenEval and DPG scores, demonstrating its effectiveness in lowering barriers to reproducible text-to-image research.
text-to-imagelatent diffusion modelvision-language modelssafety filteringnear-duplicate removal
How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR
We introduce G2D, a three-stage pipeline combining short online GRPO warm-up, static preference dataset construction, and offline DPO fine-tuning, addressing computational inefficiency in Reinforcement Learning from Verifiable Rewards (RLVR). G2D outperforms continuous online GRPO at lower compute cost: on Qwen2.5-7B, G2D achieves 62.4% on MATH-500 (vs GRPO's 51.6%) with 4x lower compute; on Llama-3.1-8B, G2D achieves 49.4%. Performance depends on rollout informativeness rather than preference pair count, with moderate warm-up producing calibrated uncertainty for stronger contrastive signals. Results suggest offline-online RLVR gaps stem from data informativeness, not methodology.
reinforcement learningverifiable rewardsdirect preference optimizationcontrastive signalfine-tuning
Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation
The paper introduces structural latent points, a hybrid 3D representation combining implicit and explicit approaches for robotic manipulation. The method integrates a point-wise latent variational autoencoder into a point-cloud autoencoder, jointly regularizing features and coordinates with a Gaussian prior to preserve coarse structural and semantic information. Evaluations on RLBench, ManiSkill2, and real robots show improved task success (12-18%), sample efficiency (1.8×), and robustness versus baselines, with ablations confirming all components' necessity.
structural latent pointsvariational autoencoder3dgs-based renderingpoint-cloud autoencoderrobotic manipulation
APEX: Autonomous Policy Exploration for Self-Evolving LLM Agents
The paper introduces Autonomous Policy EXploration (APEX), a method for self-evolving LLM agents that mitigates exploration collapse by maintaining an explicit strategy space through a directed acyclic graph of milestones. APEX employs Fork Discovery to expand the strategy map with evidence-grounded directions and Policy Selection to balance exploration-exploitation during planning. Evaluated on nine Jericho text-adventure games and WebArena, APEX outperforms baselines, with ablations confirming component efficacy and robustness across diverse settings.
self-evolving agentsexploration collapsestrategy mapfork discoverypolicy selection
RePCM: Region-Specific and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis
RePCM introduces a two-stage framework for region-specific and phenotype-adaptive bi-ventricular cardiac motion synthesis from a single end-diastolic frame. Stage I employs a reconstruction network to learn vertex-wise motion descriptors and derive functional partitions via clustering. Stage II integrates a Region-Specific Injection Module within a conditional VAE to enforce synchronized region exchange while preserving localized dynamics, alongside a Phenotype-Adaptive Mixture-of-Experts prior conditioned on ED shape to capture inter-disease variability. Evaluations on three cardiovascular disease datasets demonstrate consistent improvements in geometric and functional metrics, with enhanced preservation of region-specific dynamics.
bi-ventricularconditional vaemixture-of-expertsvertex-wiseend-diastolic
OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization
OCTOPUS introduces an optimized KV-cache compression method for transformers via octahedral parametrization and joint quantization of rotated coordinate triplets. The method maps each triplet's direction to a square using octahedral parametrization, then applies Lloyd-Max quantization to the resulting coordinates and triplet norm, optimizing per-triplet squared error for non-uniform bit allocation. This data-oblivious, online codec outperforms prior rotation-based methods (TurboQuant, PolarQuant) across text, video, and audio tasks, with gains increasing at lower bit widths. A fused Triton implementation enables on-the-fly key reconstruction without additional decode-time bandwidth or latency.
kv-cacheoctahedral parametrizationlloyd-max quantizationrotation codectriton implementation
PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment
PREFINE introduces a preference-based fine-tuning method for adapting pre-trained RL policies to safety constraints without full retraining. The approach extends Direct Preference Optimization (DPO) to continuous control by constructing policy-sampled counterfactual trajectories for preference contrasts, jointly optimizing reward retention and safety alignment. Empirical results demonstrate over 60% reduction in constraint violations and catastrophic failures while maintaining original reward behavior, achieving low-cost, high-reward performance with improved data and computational efficiency compared to offline RL or imitation learning.
preference-based fine-tuningdirect preference optimizationcontinuous controlsafety alignmentcounterfactual trajectories
Artificial Intelligence Reshapes Microwave Photonics
This review paper provides the first comprehensive overview of artificial intelligence (AI) reshaping microwave photonics (MWP), an interdisciplinary field integrating microwave and photonic technologies. The authors systematically summarize state-of-the-art advances where AI revolutionizes MWP system design, simulation, fabrication, testing, deployment, and maintenance, enabling autonomous operation and exceptional efficiency. Representative breakthroughs include fully photonic microwave radar systems, photonic analog-to-digital converters with bandwidth up to 320 GHz, and photonic wireless communication systems achieving 616 Gbit/s data rates. The review highlights AI's profound impact across all aspects of MWP, from signal generation and transmission to processing and detection.
microwave photonicsartificial intelligencephotonic radaranalog-to-digital converterwireless communication
Behavior-Consistent Deep Reinforcement Learning
The paper introduces behavior-consistent deep reinforcement learning (RL) to mitigate high variance across training runs, ensuring policies are both high-performing and distributionally similar. The method leverages maximum-entropy RL to control behavioral divergence by anchoring runs to a uniform prior, proposing $Q$-value Expectile Disagreement (QED) as a state-dependent temperature schedule. QED uses double-critic disagreement as a proxy for cross-run disagreement, balancing entropy and policy optimization. Empirical evaluation on 18 continuous-control tasks shows QED reduces across-run divergence by two orders of magnitude, significantly decreasing return variance with modest sample-efficiency costs.
reinforcement learningmaximum-entropybehavioral divergenceq-valuecontinuous-control
Enhanced Reinforcement Learning-based Process Synthesis via Quantum Computing
This work introduces quantum reinforcement learning (RL) as a scalable solution for process synthesis problems, addressing prior limitations in qubit requirements through state encoding algorithms. The authors formalize process synthesis as a Markov decision process and develop quantum-enhanced RL algorithms, benchmarking them against classical RL under identical training conditions. Evaluations on flowsheet synthesis problems with increasing unit counts demonstrate that quantum approaches achieve competitive performance on a per-episode basis and improved efficiency on a per-parameter basis for moderate-scale problems. The study establishes a foundation for quantum computing applications in process systems engineering and provides a controlled benchmark for comparing classical and quantum algorithms.
quantum reinforcement learningprocess synthesismarkov decision processstate encodingflowsheet synthesis
SURGE: An Event-Centric Social Media Sentiment Time Series Benchmark with Interaction Structure
The paper introduces SURGE, a benchmark for event-centric social media sentiment analysis, featuring 67 events and over 800K posts across five categories. It addresses limitations in existing datasets by incorporating interaction structure between posts and providing calendar-aligned time series at three granularities. The benchmark supports protocols for numerical forecasting, text-augmented forecasting, and generalization testing. Experiments reveal challenges in forecasting event-driven data, including strong local persistence and difficulty during reply-dense periods. A lightweight structure-aware probe demonstrates SURGE's utility for interaction-aware research.
sentiment analysistime seriessocial mediainteraction structurebenchmark
SAM-Sode: Towards Faithful Explanations for Tiny Bacteria Detection
SAM-Sode, a novel eXplainable AI framework, enhances interpretability in tiny bacteria detection by addressing blurred foreground boundaries and diffuse feature attribution. The method transforms initial feature attribution maps into geometry-aware prompts, leveraging the SAM3 foundation model for spatial refinement and morphological reconstruction. A dual-constraint mechanism based on physical significance and geometric alignment performs instance-level denoising, generating coherent explanations aligned with expert intuition. Experiments on a self-constructed dataset of 2,524 images with complex backgrounds and public datasets show significant suppression of background redundancy and improved decision-making transparency.
explainable aifeature attributionmorphological reconstructioninstance-level denoisinggeometry-aware prompts
Manga109-v2026: Revisiting Manga109 Annotations for Modern Manga Understanding
We present Manga109-v2026, an improved version of the Manga109 dataset addressing annotation issues for modern manga understanding tasks. Through OCR-based issue detection and manual revision, we correct transcription errors, missing text regions, overlapping dialogue and onomatopoeia, and under-segmented speech balloons. Approximately 29,000 dialogue annotations were revised, resulting in better alignment with contemporary OCR and multimodal manga understanding systems while preserving manga's expressive structures.
manga109ocrmultimodalannotationonomatopoeia
ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving
ScenePilot introduces a feasibility-guided, boundary-driven framework for generating safety-critical scenarios in autonomous driving, targeting scenarios that are physically solvable yet cause autonomy stack failures. The method employs constrained multi-objective reinforcement learning, combining an RSS-derived physical-feasibility score with an online-learned AV-risk predictor, and incorporates step-level feasibility-aware shielding to maintain exploration near the feasibility boundary. Experiments on SafeBench demonstrate that ScenePilot increases collision rates by 6.2 percentage points while preserving physical validity, and adversarial fine-tuning on these scenarios reduces downstream crash rates consistently.
autonomous drivingfeasibility-guidedmulti-objective reinforcement learningav-risk predictorfeasibility-aware shielding
Comparative Analysis of Military Detection Using Drone Imagery Across Multiple Visual Spectrums
This study enhances drone-based military object detection by evaluating performance across diverse visual spectrums. The KIIT-MiTA dataset is extended with four simulated environments: Gray Scale, Thermal Vision, Night Vision, and Obscura Vision, representing low visibility, heat-based imagery, and nighttime conditions. The YOLOv11-small model is trained and tested on these datasets to assess detection reliability in real-world scenarios. Results demonstrate improved robustness and adaptability of detection systems for both defensive and offensive missions, advancing the operational effectiveness of drones in hostile environments.
yolov11-smallkiit-mita datasetthermal visionnight visionobscura vision
Automated ICD Classification of Psychiatric Diagnoses: From Classical NLP to Large Language Models
This study automates ICD coding for psychiatric diagnoses by comparing classical NLP (BoW, TF-IDF) with LLMs (e5_large, BioLORD, Llama-3-8B) on 145,513 Spanish clinical texts. Transformer-based embeddings outperformed frequency-based methods by capturing semantic nuances, with fine-tuned e5_large achieving a 0.866 F1_micro score. The research highlights LLMs' capability to handle long-tail label distributions and ambiguous psychiatric terminology.
icd classificationpsychiatric diagnosesllmsnlpe5_large
Detecting Trojaned DNNs via Spectral Regression Analysis
The paper introduces MIST, a novel approach for detecting Trojaned DNNs by analyzing spectral changes in model representations during fine-tuning. MIST frames Trojan detection as a regression problem, using pre-activation spectra to characterize benign model evolution and flag updates with inconsistent spectral deviations. Evaluated across four datasets and eight Trojan attacks, MIST achieves state-of-the-art detection accuracy after a single update, without requiring knowledge of poisoned data or triggers. It remains effective under multi-step benign evolution, demonstrating stable performance with bounded degradation. Results indicate spectral evolution provides a robust, assumption-light signal for identifying malicious model updates.
trojan detectionspectral regressionfine-tuningpre-activation spectramodel evolution
On the Complexity of Entailment for Cumulative Propositional Dependence Logics
The paper establishes computational complexity bounds for entailment problems in cumulative propositional dependence logic and cumulative propositional logic with team semantics. Building on prior work showing these logics are characterized by System C and captured by Kraus-Lehmann-Magidor cumulative models, the authors analyze entailment via relational models. The main contribution consists of formal proofs determining the precise complexity classes for these entailment problems.
cumulative logicteam semanticsentailment problempropositional dependence logicrelational models
Efficient Learning of Deep State Space Models via Importance Smoothing
The paper introduces parallel variational Monte Carlo (PVMC), a novel training method for deep state space models (DSSMs) that bridges auto-encoding and sequential Monte Carlo (SMC) approaches. PVMC enables efficient parallel training for both discriminative and generative tasks by combining variational inference with Monte Carlo techniques. Experiments demonstrate state-of-the-art performance across benchmarks, with a 10× speedup over conventional SMC methods while maintaining training robustness.
deep state space modelssequential monte carlovariational inferenceparallel trainingimportance smoothing
ACL-Verbatim: hallucination-free question answering for research
We introduce ACL-Verbatim, an extractive question answering system for research papers in the ACL Anthology that directly maps user queries to verbatim text spans, mitigating LLM hallucinations. A novel ground truth dataset is constructed using synthetic user queries generated via a ScIRGen-based pipeline and human annotations by NLP researchers. We evaluate various extractive models, with a 150M-parameter ModernBERT token classifier trained on silver supervision achieving the highest word-level F1 score of 53.6, outperforming the best LLM extractor at 48.7.
extractive question answeringllm hallucinationsverbatimragmodernbertword-level f1
Decoupling Communication from Policy: Robust MARL under Bandwidth Constraints
We introduce SLIM, a minimal architecture that decouples communication from policy execution in multi-agent reinforcement learning (MARL), addressing the bottleneck where shared latent representations limit performance under bandwidth constraints. SLIM isolates bandwidth effects from policy capacity by using separate pathways for communication and policy, while $β$ provides a unified bandwidth budget metric combining sparsity, rounds, and message dimension. Evaluated on partially-observable MARL benchmarks requiring communication, SLIM achieves state-of-the-art performance, demonstrating scalability and robustness with minimal degradation as bandwidth decreases.
multi-agent reinforcement learningbandwidth constraintslatent representationpolicy executionscalability
AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions
AutoRPA introduces a framework for efficient GUI automation by distilling LLM-driven ReAct-style interactions into reusable RPA functions. The method employs a translator-builder pipeline: a translator agent converts hard-coded ReAct actions into soft-coded procedures, while a builder agent synthesizes robust RPA functions using retrieval-augmented generation across multiple trajectories. A hybrid repair strategy combines RPA execution with ReAct-based fallback for iterative code refinement. Experiments across GUI environments show that AutoRPA-generated RPA functions reduce token usage by 82% to 96% while maintaining task-solving capability, significantly enhancing runtime efficiency and reusability.
gui automationreact paradigmrobotic process automationretrieval-augmented generationcode synthesis
Fine-grained Claim-level RAG Benchmark for Law
We present ClaimRAG-LAW, a bilingual (French/English) dataset and fine-grained evaluation framework for retrieval-augmented generation (RAG) systems in the legal domain, addressing limitations of existing benchmarks. The dataset includes diverse question types targeting both expert and non-expert users, enabling separate analysis of retrieval and generation performance. We evaluate state-of-the-art legal RAG systems using this framework, revealing persistent limitations in claim-level analysis, retrieval accuracy, and generation quality across different user groups and languages.
retrieval-augmented generationclaim-level analysislegal domainfine-grained evaluationbilingual dataset
Grounding Driving VLA via Inverse Kinematics
The paper proposes a redesigned Driving Vision-Language-Action (VLA) model that addresses structural limitations in trajectory prediction by framing it as an inverse kinematics problem. The method introduces two key components: (1) a next visual state prediction objective to enforce visual grounding, and (2) a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that processes only current and future visual states to reduce reliance on non-visual shortcuts. The 0.5B-parameter model achieves trajectory planning performance comparable to 7B--8B VLAs on NAVSIM-v2 and nuScenes benchmarks, with particularly strong gains in dynamic scenarios like turning.
vision-language-actioninverse kinematicstrajectory predictionconditional diffusionvisual grounding
Divide et Calibra: Multiclass Local Calibration via Vector Quantization
The authors propose a compositional approach for multiclass calibration that constructs region-specific calibration maps from shared codeword-dependent factors, addressing limitations of global homogeneity assumptions and local dimensionality reduction. The method leverages Vector Quantization (VQ) to partition the representation space and employs an indexed parameterization of Dirichlet concentrations for parameter sharing across regions. This enables learning heterogeneous calibration maps that generalize effectively, even in sparse latent space regions. Empirical evaluations on benchmark datasets demonstrate significant improvements in local calibration while maintaining competitive global calibration and predictive performance.
multiclass calibrationvector quantizationdirichlet concentrationslatent spaceparameter sharing
DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation
DySink introduces dynamic frame sinks for autoregressive long video generation, addressing limitations of static early-frame anchors that retain outdated context. The method employs a retrieval-based framework with a compact memory bank to select visually relevant historical frames as sinks, coupled with a sink anomaly gate to detect and suppress attention collapse caused by excessive inter-head consensus. Experiments demonstrate improved dynamic degree and temporal quality in minute-long videos compared to baselines.
autoregressive video generationdynamic frame sinksinter-head attentionretrieval-based frameworksink anomaly gate
Beyond Text-to-SQL: An Agentic LLM System for Governed Enterprise Analytics APIs
We introduce Analytic Agent, a Large Language Model (LLM)-based agentic system that enables natural language access to governed enterprise analytics APIs, addressing limitations of Text-to-SQL systems in enterprise settings. The system translates user intents into secure API interactions through multi-step reasoning and policy-aware orchestration, ensuring compliance with business logic, auditability, and security constraints. Evaluated on 90 real enterprise use cases constructed by domain experts, Analytic Agent reliably interprets user goals, validates permissions, executes governed queries, and generates compliant visualizations.
text-to-sqlenterprise analyticsllm-based agentgoverned apispolicy-aware orchestration
Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy
This work demonstrates that off-the-shelf persona steering vectors, originally designed for general role-playing, effectively mitigate sycophancy in instruction-tuned models without requiring targeted training on sycophancy data. The study compares these persona vectors to Contrastive Activation Addition (CAA), the standard sycophancy mitigation technique, across two models. Steering toward personas characterized by doubt or scrutiny reduces sycophancy to 68% and 98% of CAA's effect while maintaining accuracy when the user is correct. Geometric analysis reveals that persona vectors operate independently of sycophancy's activation space direction, suggesting sycophancy is better understood as a persona-level property rather than a single steerable direction.
sycophancycontrastive activation additionpersona steering vectorsactivation spaceinstruction-tuned models
Single-Pass, Depth-Selective Reading for Multi-Aspect Sentiment Analysis
The paper introduces DABS, a single-pass inference framework for multi-aspect sentiment analysis that addresses efficiency-expressiveness tradeoffs in Aspect-Term Sentiment Analysis (ATSA). By encoding sentences once into a reusable, depth-ordered substrate, DABS enables aspect-specific queries to selectively read relevant tokens and abstraction levels without redundant re-encoding. Evaluated on four ATSA benchmarks, DABS reduces computation by up to 60% in multi-aspect scenarios (M >= 2) while maintaining competitive performance, particularly excelling in linguistically complex cases like negation and contrast.
atsatransformerdepth-selectivesingle-passsentiment analysis
Hybrid Machine Learning Model for Forest Height Estimation from TanDEM-X and Landsat Data
The study proposes a hybrid machine learning model combining TanDEM-X interferometric coherence measurements with Landsat optical data to improve forest height estimation. The method integrates physical model constraints with ML to address height/structure and baseline/terrain slope ambiguities, leveraging multispectral inputs for enhanced forest type discrimination. Validation on TanDEM-X acquisitions over Gabon's Lopé National Park shows 13.5% RMSE and 16.6% MAE reductions compared to the original physical-model-constrained approach when evaluated against LiDAR ground truth.
forest height estimationinterferometric coherencehybrid modelingmultispectral inputsphysical model constraints
Towards Context-Invariant Safety Alignment for Large Language Models
The paper proposes Anchor Invariance Regularization (AIR) to improve context-invariant safety alignment in large language models (LLMs), addressing brittleness where models comply with harmful requests under adversarial rephrasing. AIR treats verifiable prompts as anchors and regularizes open-ended variants toward anchor performance via a stop-gradient target, implemented as an auxiliary loss combined with group-based preference optimization (e.g., GRPO). Evaluations across Safety, Moral Reasoning, and Math tasks show AIR improves in-distribution group accuracy by 12.71% and out-of-distribution consistency by 33.49%, enhancing robustness to adversarial framings.
context-invariant alignmentanchor invariance regularizationpreference optimizationsafety alignmentadversarial robustness
A Sharper Picture of Generalization in Transformers
The authors derive non-vacuous generalization bounds for transformers on boolean domains via PAC-Bayes theory, contrasting prior Rademacher complexity approaches. They demonstrate that boolean functions with sparse Fourier spectra concentrated on low-degree components enable low-sharpness constructions with favorable generalization properties. The theoretical framework establishes the existence of flat minima implementing boolean functions with sparsity bounded by context length, followed by PAC-Bayes application to an idealized low-sharpness learner. Empirical evaluations and mechanistic interpretability studies validate the realism of these theoretical constructions in practical transformer models.
pac-bayes theoryfourier spectrageneralization boundsflat minimacontext length
Diagnosing Overhead in Dispatch Operations: Cross-architecture Observatory
The study introduces DODOCO, a diagnostic framework for evaluating assumptions underlying AlltoAll dispatch optimizations in Mixture-of-Experts (MoE) systems. It instruments five MoE checkpoints across diverse architectures (DeepSeek-V2-Lite MLA, DeepSeek-MoE-16B MHA, Qwen3-30B GQA, Nemotron-30B Mamba-2, Qwen3.5-35B GDN) under varying data conditions and EP scaling from 4 to 32 ranks on H100 GPUs. Results show that routing imbalance is intrinsic to model decisions, not rank placement, and mock-token benchmarks overestimate routing Gini by up to 2.35×. Architectures split into two bands: MHA and Mamba-2 exhibit lower Gini (0.105-0.150), while MLA and GDN remain persistently concentrated (Gini >0.24).
mixture-of-expertsalltoall dispatchrouting ginimock-token benchmarksexpert parallelism
Comparative Evaluation of Deep Learning Models for Fake Image Detection
This study evaluates four CNN architectures (VGG16, ResNet50, EfficientNetB0, XceptionNet) for fake image detection using standardized preprocessing and training. VGG16 achieved 91% accuracy, with others at 90%, though EfficientNetB0 exhibited bias toward fake images. Metrics included Accuracy, Precision, Recall, F1-score, and ROC-AUC. Limitations involve dataset imbalance and overfitting, highlighting needs for balanced datasets and fairness-aware training to improve robustness.
cnnganroc-aucaugmentationvgg16
Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy
We propose Inter-Layer Visual Attention Discrepancy (ILVAD), a training-free method to mitigate hallucination in Large Vision-Language Models (LVLMs) by enhancing attention to correct visual evidence. ILVAD identifies salient visual tokens via inter-layer attention weight analysis, constructs a saliency map to reduce visual forgetting during generation, and emphasizes text tokens grounded in visual evidence. Evaluations across five LVLMs demonstrate consistent hallucination reduction across diverse architectures. Code is publicly available.
large vision-language modelshallucination mitigationinter-layer attentionsaliency mapvisual forgetting
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models
The paper introduces SPpruner, a subject-centric progressive visual token reduction method for Vision-Language Models (VLMs) that mimics human visual perception via a Focus-then-Context mechanism. It employs a focus identification module to model visual saliency and semantic relevance, followed by a context-aware structural scanning module to aggregate contextual cues. Experiments show SPpruner achieves 2.53× speedup with 22.2% tokens retained in Qwen2.5-VL and 67% FLOPs reduction in LLaVA with only 0.6% accuracy drop.
vision-language modelstoken reductionvisual saliencycontextual aggregationcomputational efficiency
DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU
DASH introduces a fast differentiable architecture search framework for hybrid attention design in LLMs, addressing inefficiencies in existing methods like Jet-Nemotron. By relaxing discrete operator placement into continuous logits, preparing reusable teacher-aligned candidates, and freezing model weights, DASH achieves efficient architecture-only search. On Qwen2.5-3B-Instruct, it outperforms selector-style baselines and matches Jet-Nemotron on short-context benchmarks while improving RULER performance. Each search run uses only 12.3M tokens and completes in ~20 minutes on a single RTX Pro 6000 GPU, representing a 0.006% token cost compared to Jet-Nemotron's PostNAS.
hybrid attentiondifferentiable searcharchitecture logitsteacher-aligned candidatesruler performance
Strategy-Induct: Task-Level Strategy Induction for Instruction Generation
Strategy-Induct introduces a framework for task-level instruction generation in Large Language Models (LLMs) without requiring labeled answers, addressing a key limitation in prior instruction induction methods. The approach generates explicit reasoning strategies from example questions, forming (strategy, question) pairs to induce task instructions. Experiments across multiple tasks and model scales demonstrate superior performance over state-of-the-art methods in question-only settings. Additionally, joint utilization of LLMs and Large Reasoning Models shows potential for further performance improvements in task instruction generation and inference.
instruction inductiontask-level promptsreasoning strategieslarge reasoning modelsquestion-only settings
Causal Past Logic for Runtime Verification of Distributed LLM Agent Workflows
The paper introduces Causal Past Logic (CPL), a past-time temporal logic integrated into the ZipperGen agent-workflow framework for runtime verification of distributed LLM agent workflows. CPL enables guards in conditionals and while loops to inspect causally visible events and variables across asynchronous lifelines, evaluated online by the owner lifeline. A vector-clock monitor with latest-value views ensures local computation aligns with denotational semantics of guards at each event. This approach embeds runtime verification directly into the coordination language, eliminating the need for post-hoc log analysis.
causal past logicruntime verificationvector-clock monitordistributed llmzippergen framework
Winfree Oscillatory Neural Network
The paper introduces the Winfree Oscillatory Neural Network (WONN), a dynamical architecture leveraging generalized Winfree dynamics for representation learning on the torus $(S^1)^d$. WONN combines phase-based inductive biases with hierarchical interactions, implemented via trigonometric mappings or learnable networks. Evaluated on CIFAR, ImageNet-1K, Maze-hard, and Sudoku, WONN achieves competitive performance with notable parameter efficiency: 80.1% accuracy on Maze-hard using 1% of prior SOTA parameters, marking the first oscillatory model to scale competitively to ImageNet-1K.
oscillatory neural networkswinfree dynamicsphase-based learningparameter efficiencysynchronization dynamics
Sutra: Tensor-Op RNNs as a Compilation Target for Vector Symbolic Architectures
Sutra introduces a typed functional programming language that compiles to PyTorch neural networks, enabling symbolic programs to function as trainable models. The compiler reduces all program elements—primitives, control flow, and I/O—to a fused tensor-op graph over frozen embeddings, supporting operations like rotation binding and polynomial Kleene logic. Validation demonstrates 100% decoding accuracy across four embeddings (three text encoders, one protein model) at width k=8, outperforming Hadamard product baselines. PyTorch autograd successfully trains a fuzzy-rule classifier from random initialization to 100% accuracy, with trained weights recompilable into the source code. Sutra bridges symbolic logic programs and neural networks in a single artifact.
tensor-op graphkleene logicrotation bindingfrozen embeddingsautograd
Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models
This work extends the reliability paradox to machine unlearning in language models, demonstrating that good calibration can coexist with shortcut-based decision rules. Using the TOFU benchmark and multiple-choice question-answering evaluation, the authors measure probabilistic reliability via calibration metrics (ECE, MCE, Brier) and decision-rule reliability through attribution-based shortcut detection with Integrated Gradients and Local Mutual Information. Fine-tuned models achieve low calibration error (ECE ~ 0.04) compared to pretrained models (ECE > 0.5), and unlearned models retain similar calibration despite reduced accuracy on the forget split, while attribution analysis reveals increased reliance on correlation-based tokens.
machine unlearningcalibration metricsintegrated gradientsreliability paradoxshortcut detection
For How Long Should We Be Punching? Learning Action Duration in Fighting Games
The paper introduces a reinforcement learning framework for fighting games that jointly learns action selection and duration, enabling dynamic adaptation of responsiveness. Using the FightLadder environment, agents trained against scripted bots demonstrate that learned timing matches fixed frame skip performance while promoting repeatable action patterns. Results indicate optimal performance with high frame skip values, facilitating exploitative strategies against scripted opponents, though robustness remains unverified.
reinforcement learningframe skipfighting gamesaction durationexploitative strategies
VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026
VISTA introduces a StillFast-style architecture for short-term object interaction anticipation in egocentric videos, winning first place in the EgoVis 2026 Ego4D STA Challenge. The method integrates a COCO-pretrained Faster R-CNN ResNet-50 FPN detector for object proposals with a frozen V-JEPA 2.1 temporal branch for clip-level context. Feature modulation and ROI-level context fusion combine spatial and temporal representations, enabling multi-head predictors for bounding box refinement, noun/verb classification, time-to-contact regression, and confidence estimation. Ensembling complementary predictions enhances robustness. Results on the official challenge server validate VISTA's effectiveness.
egocentric videofeature modulationroi-level fusionmulti-head predictiontemporal context
GenAI-Driven Threat Detection with Microsoft Security Copilot
The Dynamic Threat Detection Agent (DTDA) introduces an autonomous, always-on system for identifying hidden cyber threats within Microsoft Defender. DTDA integrates a unified activity timeline, versioned LLM prompt contracts with schema validation, a planner-executor investigation loop, and dynamic alert generation. Deployed across tens of thousands of Defender customers, DTDA achieves 80.1% precision in a 120-day online evaluation and generates novel alerts for 15% of investigated incidents. Offline evaluation shows DTDA recovers hidden malicious activity with a 0.78 F1 score using GPT-5.4, improving over GPT-4.1 by 0.12 F1. Operational metrics include a median investigation time of 28 minutes and a 0.38% job-level failure rate.
dynamic threat detectionllm prompt contractsplanner-executor loopmitre mappingsschema validation
Terminal-World: Scaling Terminal-Agent Environments via Agent Skills
The paper introduces Terminal-World, an automated pipeline for generating high-quality training data for terminal agents by using agent skills as a synthesis primitive. The method jointly encodes task instructions, environments, and teacher trajectories, and extends synthesis capabilities through skill teams and skill graphs for multi-role and cross-domain tasks. Evaluated on 6 benchmarks, Terminal-World models (8B/14B/32B) outperform baselines, with Terminal-World-32B achieving +4.5 Pass@1 (31.5) and 43.8 Pass@3 on Terminal-Bench 2.0 using only 1.2% of the training data compared to Nemotron-Terminal-32B.
terminal agentsagent skillsskill graphsmulti-role synthesisterminal-bench
Governance by Construction for Generalist Agents
The contribution introduces CUGA's policy system, a modular policy-as-code layer for governing generalist LLM agents in enterprise workflows without model fine-tuning. The method employs a runtime governance architecture enforcing policy interventions at five structural checkpoints: Intent Guard, Playbook, Tool Guide, Tool Approvals, and Output Formatter, embedding governance continuously across the agent's execution pipeline. Results demonstrate dynamic playbook injection, intent guards blocking harmful requests, and human-in-the-loop tool approval checkpoints, improving policy adherence and execution consistency in healthcare scenarios.
policy-as-coderuntime governanceintent guardtool guidehuman-in-the-loop
PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models
The paper introduces PlanningBench, a framework for generating scalable and verifiable planning data to evaluate and train large language models (LLMs). The method abstracts real planning scenarios into a structured taxonomy of 30+ task types and constraints, then uses constraint-driven synthesis to create self-contained problems with adaptive difficulty control and verification. Results show current LLMs struggle with constrained planning, while reinforcement learning on PlanningBench data improves performance on unseen benchmarks and general instruction-following tasks.
planningbenchconstraint-driven synthesisadaptive difficulty controlverification checklistsinstruction-following tasks
CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation
The authors propose Context-Adaptive Moment Estimation (CAdam), a novel framework addressing the Densification Dilemma in optimization-based Generative Distillation for 3D Gaussian Splatting (3DGS). CAdam reinterprets densification as a signal verification problem, leveraging first-moment gradient statistics to separate geometric signals from generative noise via constructive and destructive interference. It incorporates quantile-based context awareness and an SNR gating mechanism for adaptive densification control. Experiments across multiple objectives (SDS, ISM, VFDS) demonstrate CAdam reduces Gaussian counts by 85%-97% compared to standard densification while maintaining perceptual quality, significantly improving memory efficiency in generative distillation.
densification dilemmagenerative distillationsignal verificationconstructive interferencesnr gating
Runtime-Certified Bounded-Error Quantized Attention
The paper introduces a runtime-certified KV cache quantization method for LLM inference, ensuring bounded attention errors while reducing memory costs. The tiered architecture stores INT8 keys and INT4 values in GPU memory, retaining FP16 originals in RAM for fallback, with online error bounds driving adaptive precision selection. Evaluated on LLaMA-3.1-8B with 128K contexts across PG-19, NIAH, and RULER benchmarks, the system matches FP16 quality for language modeling and retrieval, recovering catastrophic failures in naive INT8/INT4 baselines. The method provides local certification per attention head and step, reframing KV cache quantization as runtime-verified computation.
kv cache quantizationattention error boundsadaptive precision selectionruntime certificationtiered memory architecture
Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards
The paper introduces $N$-Step Forward-Trace Policy Optimization (NFPO), a reinforcement learning algorithm for verifiable rewards that addresses structural bias in PPO surrogate objectives. NFPO augments the PPO objective with an $N$-step forward trace of cumulative likelihood ratios for next tokens, providing a continuous bias-variance trade-off between PPO and exact policy gradients. Theoretical analysis shows tighter policy-improvement bounds with optimal $N$, and experiments on reasoning benchmarks confirm consistent performance gains.
reinforcement learningppo surrogatepolicy gradientlikelihood ratiobias-variance trade-off
DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation
DISC introduces a novel framework for language-conditioned manipulation that structurally decouples task instructions from state-conditioned control via policy generation. Instead of conditioning a universal policy on language, DISC employs a two-stage hypernetwork to generate task-specific visuomotor policy parameters directly from instructions, eliminating observation leakage pathways. The refinement stage embeds gradient-based optimization structure as a feed-forward inductive bias, ensuring globally consistent parameters without gradient computation. DISC outperforms entangled baselines on LIBERO-90 and Meta-World benchmarks, particularly excelling in complex, long-horizon tasks, and surpasses pretrained models without external data. It also demonstrates few-shot adaptation and robust generalization across paraphrased instructions.
hypernetworkvisuomotor policyobservation leakagegradient-based optimizationfew-shot adaptation
USV: Towards Understanding the User-generated Short-form Videos
The paper introduces USV, a dataset of 224K user-generated short-form videos collected from UGC platforms without manual verification, aimed at advancing high-level semantic video understanding. It establishes two tasks: topic recognition and video-text retrieval. The authors propose Multi-Modality Fusion Network (MMF-Net) and Video-Text Contrastive Learning (VTCL) as baseline methods for these tasks, respectively. Comprehensive benchmarks are provided to facilitate future research. The dataset and project details are available at https://usvdataset.github.io.
user-generated short-form videostopic recognitionvideo-text retrievalmulti-modality fusion networkvideo-text contrastive learning
ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models
The authors introduce ArchSIBench, a benchmark for evaluating architectural spatial intelligence in Vision-Language Models (VLMs) across five core dimensions: perception, reasoning, navigation, transformation, and configuration. The benchmark comprises 17 subtasks and 3,000 expert-annotated question-answer pairs, addressing higher-level spatial cognition beyond elementary skills. Evaluation reveals significant gaps between VLMs and human baselines, with state-of-the-art models approaching untrained human performance but lagging behind architect-trained humans, particularly in transformation and configuration tasks.
vision-language modelsarchitectural spatial intelligencebenchmark evaluationspatial cognitionexpert annotation
Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
The paper establishes that Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF) are only conditionally equivalent, contrary to prior assumptions. It identifies a critical failure mode when the RLHF-optimal policy does not prefer human-preferred responses, causing DPO to optimize relative advantage over alignment. The authors propose Constrained Preference Optimization (CPO), which augments RLHF with constraints for provable alignment, and provide a geometric interpretation via soft margin ranking. Experiments on standard benchmarks show CPO achieves state-of-the-art performance.
direct preference optimizationreinforcement learning from human feedbackconstrained preference optimizationsoft margin rankingprovable alignment
GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval
This work evaluates Graph-based Retrieval Augmented Generation (GraphRAG) for Electronic Health Record (EHR) schema retrieval using locally deployed open-source large language models (LLMs) on consumer hardware. The Microsoft GraphRAG pipeline was implemented on EHR schema documentation, benchmarking Llama 3.1 (8B), Mistral (7B), Qwen 2.5 (7B), and Phi-4-mini (3.8B) deployed via Ollama on a single GPU (8 GB VRAM). Results show Llama 3.1 constructs the richest knowledge graph (1,172 entities), Qwen 2.5 achieves the highest answer quality (3.3/5), Phi-4-mini fails due to structured-output errors, and Mistral exhibits degenerate repetition. Local retrieval outperforms global summarization in latency and factual grounding, with reduced hallucination.
graphragehrllmollamaknowledge graph
Tunable MAGMAX: Preference-Aware Model Merging for Continual Learning
The paper introduces Tunable MAGMAX, a preference-aware model merging framework for continual learning that enables task-specific performance control. The method employs a preference vector to selectively combine elements from task-specific parameter vectors during merging, allowing adaptation to diverse deployment environments. It further automates preference vector construction using minimal target environment data and training task datasets, eliminating manual specification. Experiments on continual learning benchmarks demonstrate that Tunable MAGMAX effectively controls task-wise performance and adapts merged models to varying environments, achieving superior or comparable results to baseline methods.
continual learningmodel mergingpreference vectortask-specific performanceparameter vectors
ELSA: An ELastic SNN Inference Architecture for Efficient Neuromorphic Computing
ELSA introduces a novel near-SRAM dataflow architecture for spiking neural networks (SNNs) that enables true elastic inference through fine-grained spine/token-wise pipelining and hardware optimizations. The architecture forwards each spine/token immediately upon production, forming a continuous streaming pipeline that reduces latency to the first response. ELSA employs a bundled address event representation protocol to lower network-on-chip communication traffic and leverages mini-batch spiking Gustavson-product to reduce memory access and exploit sparsity. Experimental results demonstrate that ELSA achieves 3.4x speedup and 13.6x higher energy efficiency over the state-of-the-art QANN accelerator (ANT), and 2.9x speedup and 22.1x energy efficiency gains over the state-of-the-art SNN accelerator (PAICORE) for a 4-bit ResNet-50.
spiking neural networkselastic inferencenear-sramgustavson-productnetwork-on-chip
Interaction Locality in Hierarchical Recursive Reasoning
The paper introduces interaction locality, a framework for quantifying whether information flow in spatial reasoning models remains local or crosses boundaries, measured via sparse-autoencoder feature ablations and activation patching. Applied to hierarchical recursive models HRM and TRM on Maze-Hard, Sudoku Extreme, and ARC-AGI, results show high-level recurrent states write locally while recursive updates accumulate broader structure, strongest in TRM. In MTU3D, spatial locality concentrates at visual-to-grounding module transitions, suggesting recursive reasoning uniquely enables local-to-global handoffs.
interaction localitysparse-autoencoderactivation patchingrecursive reasoningspatial reasoning
Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards
Conflict-Aware Additive Guidance ($g^\text{car}$) is introduced to address off-manifold drift in inference-time guided sampling for diffusion and flow models under compositional rewards. The method dynamically detects and resolves gradient conflicts, rectifying deviations from the true data manifold caused by misaligned gradients. $g^\text{car}$ is lightweight and learnable, validated across synthetic datasets, image editing, and generative decision-making tasks. Results show it surpasses baselines in generation fidelity while maintaining computational efficiency. Code is publicly available.
off-manifold driftgradient misalignmentcompositional rewardsinference-time guidancediffusion models
Correcting Stochastic Update Bias in Preconditioned Language Model Optimizers
The authors identify and correct two finite-sample biases in preconditioned optimizers for language model training: gradient-preconditioner coupling bias and nonlinear inversion bias. They propose a bias-correction framework combining cross-fitted preconditioning, which estimates gradients and preconditioners from independent microbatch groups, and variance-corrected inversion, which subtracts leading delta-method bias terms. Evaluated on Qwen2.5-0.5B, the framework reduces held-out pretraining loss by 0.15, 0.07, and 0.11 nats for AdamW, Sophia, and Shampoo optimizers respectively, with neutral-to-positive effects on mixed-quality pretraining and downstream instruction tuning.
preconditioned optimizersfinite-sample biascross-fitted preconditioningvariance-corrected inversiondelta-method
PACD-Net: Pseudo-Augmented Contrastive Distillation for Glycemic Control Estimation from SMBG
PACD-Net introduces a self-supervised contrastive knowledge distillation framework for glycemic control estimation from sparse self-monitoring of blood glucose (SMBG) data. The method leverages pseudo-SMBG samples with enhanced temporal coverage as teacher signals and employs multi-view contrastive learning to enforce representation consistency across diverse sampling patterns. A hybrid Swin Transformer-CNN backbone captures temporal dependencies in sparse SMBG sequences. Experiments on real-world SMBG data show PACD-Net outperforms existing methods in estimating Time in Range (TIR), Time Below Range (TBR), and Time Above Range (TAR), demonstrating improved accuracy, stability, and generalization under extreme sparsity.
contrastive learningknowledge distillationswin transformerglycemic controlsparse sampling
The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?
This work explains the superior performance of Gated Linear Units (GLU) over non-gated architectures by analyzing their spectral properties in the neural tangent kernel (NTK) regime. Through theoretical analysis of two-layer networks, the authors demonstrate that GLU structures yield a smaller NTK condition number and more compact eigenvalue distribution, leading to faster convergence and a characteristic loss-crossing phenomenon. Empirical validation on ViT and GPT-2 shows GLU primarily accelerates optimization without significantly reducing the generalization gap.
gated linear unitsneural tangent kernelcondition numbereigenvalue distributionoptimization dynamics
The Hidden Signal of Verifier Strictness: Controlling and Improving Step-Wise Verification via Selective Latent Steering
This work introduces VerifySteer, a method for controlling verifier strictness in step-wise verification via selective latent steering. The authors identify a verification-specific hidden-state signal near paragraph boundaries that encodes acceptance/rejection tendencies, enabling strictness modulation without fine-tuning. VerifySteer employs sample-level routing to selectively intervene on paragraph boundaries, addressing the trade-off between error detection and correctness certification. Experiments on ProcessBench and Hard2Verify demonstrate that VerifySteer outperforms prompt optimization and activation steering baselines, achieves competitive performance with self-consistency while requiring 4-7x less inference compute, and complements verification fine-tuning. Code is publicly available.
verifier strictnesslatent steeringstep-wise verificationhidden-state signalsample-level routing
Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale
We introduce Hack-Verifiable Environments, a novel paradigm for scalable evaluation of reward hacking in autonomous agents. Unlike post hoc trajectory analysis, our method embeds detectable reward hacking opportunities directly into environments, enabling deterministic and automated measurement of exploitation. We instantiate this approach in TextArena and release Hack-Verifiable TextArena, a testbed for reliable reward hacking measurement. Using this benchmark, we analyze reward hacking behavior across language models in diverse environments. The code is open-sourced at https://github.com/MajoRoth/hack-verifiable-environments/.
reward hackingautonomous agentstextarenaevaluation paradigmlanguage models
VBFDD-Agent for Electric Vehicle Battery Fault Detection and Diagnosis: Descriptive Text Modeling of Battery Digital Signals
The study introduces VBFDD-Agent, a vehicle battery fault detection and diagnosis agent for automotive-grade lithium-ion battery systems, addressing the limitations of traditional methods in complex scenarios. The approach employs descriptive text modeling to transform battery monitoring signals, statistical features, anomaly records, and state assessments into structured natural language descriptions, forming a corpus for health diagnosis. VBFDD-Agent integrates these texts with historical case retrieval, local maintenance manuals, and large language model reasoning to generate interpretable diagnostic results and actionable maintenance recommendations. Experimental results demonstrate accurate anomaly monitoring and expert-confirmed practical value, extending battery diagnosis from label prediction to interpretable decision support.
descriptive text modelinglithium-ion batteriesanomaly detectionlarge language modelmaintenance recommendations
Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression
We introduce Distribution-Aware Reward, a reinforcement learning objective that trains large language models to produce calibrated predictive distributions for regression tasks, rather than optimizing individual decoded outputs. The method evaluates empirical predictive distributions using the Continuous Ranked Probability Score and assigns leave-one-out credit based on each rollout's marginal contribution to distribution quality. Evaluated on Gaussian-mixture tasks, code performance prediction, and molecular property prediction from SMILES strings, it outperforms supervised fine-tuning and pointwise reinforcement learning baselines, achieving a 6-point Spearman improvement on KBSS and competitive results on MoleculeNet without graph-based or 3D molecular models. The method mitigates rollout diversity collapse and improves uncertainty diagnostics.
predictive distributionscontinuous ranked probability scorereinforcement learningsmiles stringsrollout diversity
An Application-Layer Multi-Modal Covert-Channel Reference Monitor for LLM Agent Egress
The paper introduces a multi-modal egress reference monitor for LLM agents to prevent covert-channel data leakage. The system employs (i) a text pipeline with ten capacity-reducing stages and a leaky-bucket ledger, (ii) media scramblers (Fourier-domain audio band-limiter and RGB image bit-depth bucketer) gated by cryptographic legitimacy attestation, and (iii) residual capacity measurement via adversarial encoder ensembles. The implementation achieves zero residual capacity on destroyable channels and bounds it on non-destroyable ones, effectively mitigating covert communication through text, image, and audio modalities.
covert-channelegress-monitormedia-scramblerresidual-capacitycryptographic-attestation
TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design
We introduce TASTE, a multi-dimensional preference dataset for AI-generated graphic design, annotated by ten professional designers across nine criteria. The dataset includes 1,600 ratings per criterion and hallucination flags, derived from outputs of four text-to-image models. Using Kendall's tau, majority probability, and Condorcet cycles, we demonstrate significant designer agreement, rejecting the random-rater null hypothesis. Pre-trained systems, including six VLM judges and three T2I scorers, achieve ≤0.55 macro agreement with the 5-designer majority, indicating limited alignment. A pairwise-difference head trained on TASTE achieves 0.611 agreement, closing half the gap to the 0.741 single-rater ceiling.
tastekendall's taucondorcet cyclesvlm judgest2i scorers
Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning
The paper introduces distributional alignment as a criterion for designing task vectors in in-context learning (ICL), proposing $d_{\text{NTP}}$ to measure next-token probability discrepancies between task vector-based and ICL-based inference. The authors develop Linear Task Vector (LTV), a method that minimizes $d_{\text{NTP}}$ via closed-form linear mapping, achieving 9.2% higher accuracy across eight classification benchmarks and five LLMs while reducing latency. Results also demonstrate 6.4% performance improvement when transferring task vectors from larger to smaller models.
in-context learningtask vectorsdistributional alignmentnext-token probabilitylinear mapping
AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback
The paper introduces Adaptive Group Policy Optimization (AGPO), a critic-free reinforcement learning method for LLM refinement that dynamically adjusts clipping bounds and decoding temperature using group-level statistics. AGPO employs two controllers: (1) adaptive clipping based on reward dispersion, policy entropy, and KL drift, and (2) bidirectional temperature sampling modulated by uncertainty relative to a baseline. Evaluated on nine math/STEM benchmarks, Qwen2.5-14B with AGPO achieves 67.3% on GSM8K and 40.5% on MATH, outperforming PPO/GRPO. Benefits generalize to Llama-3-8B and Gemma-2-9B, with ablations confirming both modules' necessity.
adaptive policy optimizationgroup-level statisticsbidirectional temperature samplingcritic-free rlllm refinement
SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction
SAVER introduces a selective vision-as-needed framework for multimodal named entity recognition (MNER) and multimodal relation extraction (MRE) in social media posts. It employs a Conformal Groundability Gate (CGG) to estimate visual groundability and activate vision selectively, followed by a submodular relevance-diversity selector to choose compact image subsets aggregated via a Set Transformer. A joint scoring head combines text, optional visual evidence, and text-image consistency for entity typing or relation classification. Experiments demonstrate SAVER's consistent F1 improvement over text-only and always-on multimodal baselines, while reducing AURC, increasing activation coverage, and lowering computational costs.
multimodal information extractionconformal groundability gatesubmodular relevance-diversity selectorset transformertext-image consistency
SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR
SCRIBE introduces a diagnostic framework for automatic speech recognition (ASR) that addresses limitations of word error rate (WER) by decomposing errors into lexical, punctuation, numeral, and domain-entity categories. The method employs sandhi-tolerant alignment and domain vocabulary injection to improve accuracy in agglutinative languages. Human validation confirms SCRIBE's alignment with expert judgment, outperforming WER. The authors release SCRIBE, a large language model (LLM) curation pipeline, benchmarks, and open-weight rich transcription models for Hindi, Malayalam, and Kannada.
automatic speech recognitionword error ratesandhi-tolerant alignmentdomain vocabulary injectionrich transcription models
Rethinking Cross-Layer Information Routing in Diffusion Transformers
This paper introduces Diffusion-Adaptive Routing (DAR), a novel residual replacement for Diffusion Transformers (DiTs) that addresses limitations in cross-layer information flow. DAR performs learnable, timestep-adaptive, and non-incremental aggregation over sublayer outputs, mitigating issues of monotonic forward magnitude inflation, sharp backward gradient decay, and block-wise redundancy. Evaluated on ImageNet 256×256, DAR improves SiT-XL/2 by 2.11 FID (7.56 vs. 9.67) and achieves baseline convergence quality with 8.75× fewer training iterations. When combined with REPA, DAR yields 2× training acceleration in early stages, demonstrating its orthogonal benefits to existing representation-alignment objectives. DAR also enhances fine-tuning of large-scale text-to-image models and preserves high-frequency details during Distribution Matching Distillation.
diffusion transformersresidual replacementcross-layer informationtimestep-adaptivedistribution matching
Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
LlamaWeb introduces a WebGPU backend for llama.cpp, enabling memory-efficient and performance-portable LLM inference in browsers across diverse hardware and model weight formats. The system employs static memory planning, efficient model loading, and templated GPU kernels to reduce memory overhead and support multiple quantization formats. Evaluated on 16 devices from 8 vendors using 10 language models and four weight formats, LlamaWeb reduces memory usage by 29-33% and increases decode throughput by 45-69% compared to existing browser-based frameworks. It also outperforms vendor-specific llama.cpp backends on certain devices.
webgpullm inferencequantization formatsmemory planningdecode throughput
Heartbeat-Bound Hierarchical Credentials: Cryptographic Revocation for AI Agent Swarms
The paper introduces Heartbeat-Bound Hierarchical Credentials (HBHC), a cryptographic protocol for revoking AI agent swarm credentials without network dependencies. HBHC binds credential validity to periodic parent liveness proofs, using local clocks and cached public keys for verification. Evaluations demonstrate a 90× reduction in zombie agent windows versus OAuth 2.0, 0.26 ms Rust authentication, 18k+ verifications/sec under load, and 0.71% overhead on tool calls. Experiments confirm cascading revocation across 49-agent hierarchies within theoretical bounds, preventing post-revocation tool calls under prompt injection attacks.
credential revocationai agent swarmscryptographic protocolliveness proofssecure enclaves
Interpretable Discriminative Text Representations via Agreement and Label Disentanglement
The paper introduces LLM-assisted Feature Discovery (LFD), a method for generating interpretable discriminative text representations that satisfy conceptual clarity and label disentanglement. LFD iteratively proposes lexical and semantic features from contrastive text pairs, screens candidates using cross-LLM Cohen's κ to ensure annotator agreement, and selects features based on residual predictive gain. Evaluated across ten text-classification tasks from seven corpora, LFD matches the predictive performance of a strong text bottleneck baseline while producing clearer and less label-entangled features. Human audits with 232 raters demonstrate higher agreement and reduced label leakage for LFD features, establishing a practical auditability standard for interpretable text classification.
interpretable text representationslabel disentanglementconceptual clarityllm-assisted feature discoverycohen's κ
Declarative Data Services: Structured Agentic Discovery for Composing Data Systems
Declarative Data Services (DDS) introduces a structured agentic discovery framework for composing heterogeneous data systems from declarative user intent. The architecture employs four typed contracts—intent, operator DAG, per-system skills, and runtime attribution—to decompose global search into bounded sub-searches, enabling sub-agents to explore each space while routing knowledge forward and errors backward. DDS addresses the failure of unbounded agentic discovery to converge on working stacks by transforming runtime failures into skill patches cited in subsequent deployments. On a trading-backend workload, DDS successfully converges where unbounded methods fail, demonstrating its efficacy in real-world data-system composition.
agentic discoverydeclarative intentoperator dagruntime attributionskill patches
DIVE: Embedding Compression via Self-Limiting Gradient Updates
We propose DIVE, a compression adapter for high-dimensional embeddings that addresses overfitting in low-data scenarios via two mechanisms: a self-limiting hinge-based triplet loss that bounds perturbations to the embedding space, and a head-wise NT-Xent contrastive loss that treats multiple projections as implicit views for self-supervised learning. DIVE outperforms Matryoshka-Adaptor, Search-Adaptor, and SMEC across six BEIR datasets at all evaluated compression ratios, with a 14M-parameter implementation.
embedding compressiontriplet lossnt-xentself-supervised learningdimensionality reduction
Dynamic TMoE: A Drift-Aware Dynamic Mixture of Experts Framework for Non-Stationary Time Series Forecasting
Dynamic TMoE introduces a drift-aware dynamic mixture of experts framework for non-stationary time series forecasting, addressing limitations of static models and fixed expert pools. The method dynamically instantiates and prunes heterogeneous experts based on distribution shifts detected via Maximum Mean Discrepancy (MMD), while a temporal memory router leverages recurrent states and an anomaly repository for context-aware expert selection. This unified approach ensures architectural evolution and temporal continuity without requiring test-time updates. Experiments on nine benchmarks demonstrate state-of-the-art performance, reducing MSE by 10.4% and MAE by 7.8%.
mixture of expertsmaximum mean discrepancytemporal memory routernon-stationary time seriesdistribution shifts
On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists
This study evaluates AI reviewers' capabilities through a large-scale expert annotation involving 45 domain scientists who spent 469 hours assessing 2,960 criticisms from human and AI-generated reviews of 82 Nature-family papers. The evaluation focused on correctness, significance, and sufficiency of evidence. Results show that GPT-5.2 outperformed top-rated human reviewers (60.0% vs. 48.2%, p = 0.009), and all three AI reviewers (GPT-5.2, Gemini 3.0 Pro, Claude Opus 4.5) exceeded the lowest-rated human across all dimensions. AI reviewers identified 26% unique issues but exhibited 16 recurring weaknesses, such as limited subfield knowledge and overly critical stances. The findings suggest AI reviewers complement rather than replace human reviewers.
ai reviewersexpert annotationnature-family papersgpt-5.2peer review
REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak
The paper introduces Reflector, a two-stage framework enhancing LLM safety against jailbreak attacks via internalized self-reflection. First, it employs teacher-guided generation to create reflection data for supervised fine-tuning. Second, it uses RL with outcome-driven supervision to instill autonomous reflection. Results show >90% defense success against complex attacks, 5.85% improvement on GSM8K, and robust generalization across threat scenarios, all without significant computational overhead.
reflectorjailbreak attacksself-reflectionsupervised fine-tuningreinforcement learning
AMAR: Lightweight Attention-Based Multi-User Activity Recognition from Wi-Fi CSI
The AMAR framework introduces a transformer-based architecture for attention-based multi-user activity recognition (HAR) from Wi-Fi channel state information (CSI), addressing overlapping CSI patterns in multi-user settings. It formulates HAR as a set prediction problem, employing learnable query embeddings as specialized activity detectors. AMAR adopts an edge-cloud split architecture, utilizing lightweight convolutional networks for initial feature extraction and residual vector quantization for bandwidth reduction, with final activity prediction performed via attention-based set matching. Evaluated across classroom, meeting-room, and empty-room environments, AMAR nearly doubles the rate of perfectly predicting all concurrent activities, achieves a 53.4% F1-score, reduces occupancy estimation error by 74%, and minimizes bandwidth usage.
attention-basedchannel state informationset predictionresidual vector quantizationedge-cloud split
Jointly Learning Predicates and Actions Enables Zero-Shot Skill Composition
The paper introduces Predicate Action Skills (PACTS), a novel class of closed-loop visuomotor policies that jointly model action trajectories and symbolic predicate outcomes. Unlike traditional generative policies focusing solely on actions, PACTS enables zero-shot skill composition by producing coherent action-outcome rollouts within a single model. The method leverages predicate predictions as a symbolic interface for planning, demonstrating improved action generation and predicate classification. Results show robust skill sequencing without retraining, validated through visuomotor tasks. Project details are available at https://planpacts.github.io/.
predicate action skillszero-shot compositionvisuomotor policiessymbolic outcomesclosed-loop control
Design for Manufacturing: A Manufacturability Knowledge-Integrated Reinforcement Learning Framework for Free-Form Pipe Routing in Aeroengines
The study introduces the Frenet-based pipe routing optimization (FPRO) framework, a manufacturability knowledge-integrated reinforcement learning approach for free-form pipe design in aeroengines. FPRO formulates pipe routing as a boundary value problem in the Frenet frame, representing paths via curvature and torsion profiles generated using cubic Hermite interpolation. Domain-specific manufacturing constraints are embedded, and optimization employs proximal policy optimization with stochastic exploration and stage-guided rewards. Experimental results show FPRO generates collision-free, manufacturable paths with smoother geometries, faster convergence, and superior performance in terminal alignment, path length, and obstacle avoidance compared to state-of-the-art baselines. Real-world validation confirms geometric correspondence between manufactured pipes and digital designs.
frenet frameproximal policy optimizationcubic hermite interpolationmanufacturability constraintssix-axis free-bending machine
AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals
AVSD (Adaptive-View Self-Distillation) introduces a novel self-distillation method that leverages multiple privileged-information views to address the asymmetry between teacher and student models. By separating stable cross-view consensus from view-specific residual signals, AVSD reconstructs token-level supervision, ensuring reliable updates while selectively incorporating view-specific adjustments. This approach outperforms single-view self-distillation baselines and GRPO on math competition benchmarks (AIME24, AIME25, HMMT25) with Qwen3-8B and Qwen3-4B, achieving average Avg@8 gains of 3.1% and 2.2%, respectively. Additionally, AVSD improves performance on code-generation benchmarks (Codeforces, LiveCodeBench v6) by 2.4% on average.
self-distillationprivileged-informationtoken-level supervisioncross-view consensusview-specific residual
Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs
The paper introduces optimization-triggered backdoor attacks on LLMs, revealing that inference compilation optimizations can be maliciously exploited to implant stealthy backdoors without modifying compilers or hardware. The proposed framework employs two strategies: one flips predictions for specific inputs only when compiled, while the other uses a universal trigger activated post-compilation. Empirical results show attack success rates averaging 90% across four open-source LLMs and four tasks, with clean accuracy preserved at ~100%. The work exposes a novel attack surface in LLM deployment and explores practical defenses.
backdoor attacksllm optimizationinference compilationstealthy triggersdeployment security
Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics
The paper proposes a feature supervision paradigm for Multimodal Diffusion Transformers (MM-DiT) to address the trilemma in human portrait generation between text-image alignment, photorealism, and aesthetics. The method introduces a lightweight cross-modal alignment mechanism that extracts multi-granularity vision-aligned text representations from SigLIP 2, applying supervision to MM-DiT's image branch during training without inference overhead. It preserves the base model's generalization and mines implicit aesthetic signals from pre-trained vision models. Experiments demonstrate that this approach pushes the Pareto frontier, achieving synergistic improvements across all three metrics.
multimodal diffusion transformerscross-modal alignmentsiglip 2pareto frontiervision-aligned text
Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines
We introduce temporal semantic caching and workflow optimizations for latency-sensitive industrial agent pipelines, addressing limitations of existing LLM caching techniques in parameter-rich contexts. Evaluated on AssetOpsBench, our approach combines disk-backed tool-discovery caching, dependency-aware parallel execution, and a temporal semantic cache to optimize plan-execute workflows. Results demonstrate a 1.67x speedup from workflow optimizations, reducing median end-to-end latency by 40.0%, and a 30.6x median speedup on temporal cache hits. The analysis highlights failure modes of pure semantic caching in industrial settings and provides insights into caching-evaluation interactions in MCP-backed agent benchmarks.
temporal semantic cachingplan-execute pipelinekv-cache reusedependency-aware executionassetopsbench
Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task
The University of Florida Gators submission won the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. Their two-stage pipeline first generates Spanish intermediate captions using Qwen2.5-VL, then produces target-language captions via retrieval-augmented many-shot prompting with Gemini 2.5 Flash. The approach achieved 164.1%, 131.7%, and 122.6% improvements over the baseline for Bribri, Guaraní, and Orizaba Nahuatl captioning on the dev set, with >150% improvements for Bribri and Orizaba Nahuatl on the test set. Retrieval effectiveness was found to be language-dependent, requiring large in-domain corpora, while synthetic data augmentation contributed ~28 chrF++ to Guaraní performance. The submission ranked second in human evaluations among five finalists.
retrieval-augmentedmany-shot promptingsynthetic data augmentationcultural image captioningintermediate caption
Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models
The Autoregressive Video Inverse problem Solver (AVIS) accelerates zero-shot video inverse problem solving by addressing inefficiencies in diffusion models. AVIS employs autoregressive video diffusion to restore videos in a streaming manner, initializing reverse diffusion with a measurement-consistent estimate to reduce sampling steps. Compared to non-autoregressive solvers, AVIS reduces initial latency from 114s to 4s and increases throughput from 0.71 to 1.18 FPS while improving restoration quality. A variant, AVIS Flash, enforces measurement consistency only on the first chunk, achieving 5.91 FPS on a single RTX 4090 GPU with competitive performance.
autoregressivediffusionlatencythroughputstreaming
Lower Bounds for Advection-Diffusion Equations: An Exploration with AI-Generated Proofs
(No summary returned.)
COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space
COAgents introduces a cooperative multi-agent framework for solving Vehicle Routing Problems (VRP) by modeling the search process as a dynamically constructed Partial Search Graph (PSG). The framework employs three agents: a Node Selection Agent, a Move Selection Agent, and a Jump Agent, which guide intensification and diversification during search. This approach separates problem-agnostic search control from domain-specific encoding, enhancing adaptability. Experiments on CVRP and VRPTW benchmarks demonstrate COAgents' competitiveness, achieving a 14% and 44% reduction in solution gaps at N=100 and N=50, respectively, compared to POMO, and 21% and 40% relative to ALNS.
vehicle routing problemspartial search graphmulti-agent frameworkintensificationdiversification
Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts
This work advances the understanding of expert specialisation in vision Mixture-of-Experts (MoE) models by moving beyond category routing analysis to examine fine-grained expert-level tuning and representational structure. The authors train sparsely-gated convolutional MoE models with contrastive objectives on natural images, employing tools from visual neuroscience to characterise expert specialisation. They measure per-expert category separability, tuning via semantic dimensions derived from human behavioural judgements (THINGS dataset), and stability of expertise allocation across independent initialisations. Results reveal that expert partitioning is dominated by an animate-inanimate distinction, stable across models, with broader tuning to continuous visual and semantic dimensions beyond category boundaries, demonstrating similar category-separability despite distinct feature tuning.
mixture-of-expertscontrastive learningrepresentational similaritysemantic dimensionsanimate-inanimate
From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)
The paper proposes Hierarchical Agent-native Network Architecture (HANA), a multi-agent reference architecture enabling Level 4/5 Autonomous Networks by transitioning from static automation to agent-native intelligence. The framework features a Dual-Driven Orchestrator coordinating specialized Executive Agents, supported by Public Memory for unified knowledge, and integrates agent self-awareness to balance strategic governance with fault recovery. Validated in a 5G Core environment, HANA sustains critical throughput under congestion and reduces Mean Time to Repair by 86%, demonstrating effective unification of strategic planning and operational resilience.
autonomous networksmulti-agent system5g coreself-awarenessmean time to repair
Self-Training Doesn't Flatten Language -- It Restructures It: Surface Markers Amplify While Deep Syntax Dies
The study challenges the prevailing characterization of self-training as a flattening process in language models, demonstrating instead a restructuring effect. Analyzing eleven generations of self-training across five models (GPT-2 124M, Pythia-410M, Pythia-1.4B, OPT-1.3B, Pythia-2.8B), the authors identify asymmetric linguistic changes: surface markers increase while mid- and deep-syntactic structures decline. The Structural Depth Hypothesis (SDH) formalizes this, showing decay rates correlate strongly with structural depth (rho=0.540, p < 10^{-6}) rather than initial frequency (rho=0.225). A Superficial Complexity Paradox emerges, where aggregate complexity metrics rise despite syntactic simplification, with implications for LLM-text detection and data curation.
self-trainingstructural depth hypothesissurface markerssyntactic dependenciessuperficial complexity paradox
Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX
The authors present Mahjax, a GPU-accelerated Riichi Mahjong simulator implemented in JAX for reinforcement learning research. The environment addresses challenges of stochasticity and high-dimensional state spaces in this imperfect-information game, offering full vectorization for parallelized rollouts on GPUs. Benchmark results show throughputs of 2M and 1M steps/second on 8 NVIDIA A100s for no-red and red rule variants respectively, with demonstrated utility in training RL agents to outperform baseline policies.
riichi mahjongjaxgpu-accelerationimperfect-informationreinforcement learning
Multi-agent Collaboration with State Management
STORM (STate-ORiented Management) introduces a novel approach to multi-agent collaboration by managing agent states and mediating interactions with shared workspaces, ensuring consistent codebase views and real-time conflict resolution. Unlike workspace isolation methods (e.g., git worktree), STORM detects and resolves conflicts at write time, reducing post-hoc recovery costs. Evaluated on Commit0 and PaperBench with multiple LLMs, STORM outperforms git-worktree baselines by +18.7 and +1.4 respectively, achieving peak scores of 87.6 and 78.2. Results demonstrate STORM’s superior cost efficiency and effectiveness in multi-agent systems, offering seamless integration into existing frameworks.
multi-agent systemsstate managementconflict resolutionshared workspacellms
Complementing reinforcement learning with SFT through logit averaging in the post training of LLMs
A novel method is proposed for enhancing reinforcement learning in LLM post-training by averaging logits from a frozen reference policy (e.g., SFT) and a trainable policy, integrated into Group Relative Policy Optimization (GRPO). Unlike Reinforcement Learning with Verifiable Rewards (RLVR), this approach avoids KL regularization or critics, instead coupling policies through logit averaging to leverage reasoning expertise while preserving SFT formatting advantages. Evaluations on MATH, cn-k12, and MMLU benchmarks demonstrate comparable or superior accuracy to KL-regularized GRPO.
logit averaginggroup relative policy optimizationreinforcement learningsupervised fine-tuningkl regularization
Personality Engineering with AI Agents: A New Methodology for Negotiation Research
The article introduces personality engineering, a novel methodology leveraging AI agents to systematically parameterize, manipulate, and evaluate negotiator personality traits. It proposes the interpersonal circumplex—defined by warmth and dominance dimensions—as a foundational framework for negotiation research. This approach enables precise testing of classic negotiation theories under controlled conditions, overcoming human limitations in managing competing demands. AI agents' precision, consistency, and scalability facilitate rigorous experimentation, advancing both theoretical understanding and practical design of AI negotiation systems.
personality engineeringai agentsinterpersonal circumplexnegotiation theorywarmth and dominance
Faster or Stronger: Towards Flexible Visual Place Recognition via Weighted Aggregation and Token Pruning
The paper introduces Weighted Aggregated Descriptor (WeiAD) and WeiToP to enhance Visual Place Recognition (VPR) performance and efficiency. WeiAD assigns cluster-specific weights during patch token aggregation, producing more discriminative global descriptors. WeiToP reduces feature extraction costs via self-distillation, enabling inference-time token pruning for flexible accuracy-efficiency trade-offs. WeiToP outperforms existing token pruning methods adapted from general vision tasks and supports plug-and-play pruning without additional training. The methods address limitations of uniform token aggregation and high computational costs in Vision Transformer-based VPR systems.
visual place recognitiontoken pruningweighted aggregationself-distillationvision transformers
Latent Process Generator Matching
We introduce latent process generator matching, a general framework for generative modeling where the observed state is treated as a deterministic image of a tractable Markov process. This extends Generator Matching theory from static latent variables to time-dependent conditional processes, unifying and generalizing prior discrete latent process results. The method enables learning a stochastic process generator on the image space that preserves one-time marginal distributions of the projected process. The framework subsumes existing augmented-state constructions used in flow-matching and diffusion-style models, addressing limitations of auxiliary stochastic dynamics in training.
generator matchingmarkov processstochastic processflow-matchingdiffusion models
Axiomatizing Neural Networks via Pursuit of Subspaces
The paper introduces the Pursuit of Subspaces (PoS) hypothesis, an axiomatic framework to theoretically ground neural network behavior through geometric postulates. By formulating representation, computation, and generalization via subspace geometry, the work bridges empirical performance and mechanistic understanding in shallow and deep architectures. The framework provides geometric explanations for core deep learning phenomena, including representation structure and generalization, advancing toward a unified theoretical foundation.
axiomatic frameworksubspace geometryrepresentation structuregeneralization behaviordeep architectures
AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
AgentAtlas introduces a comprehensive framework for evaluating LLM agents beyond single-metric benchmarks, addressing the fragmentation in current evaluation practices. The framework comprises four components: a six-state control-decision taxonomy, a nine-category trajectory-failure taxonomy with hierarchical labels, a methodology distinguishing taxonomy-aware from taxonomy-blind evaluations, and a benchmark-coverage audit mapping fifteen benchmarks across six behavioral axes. The methodology was demonstrated on eight models (four closed, four open-weight), revealing that removing explicit label menus reduces trajectory accuracy by 14-40 percentage points, with no model excelling in control accuracy, trajectory diagnosis, and tool-context utility retention simultaneously.
llm agentstaxonomytrajectory-failurebenchmark-coveragecontrol-decision
Collocational bootstrapping: A hypothesis about the learning of subject-verb agreement in humans and neural networks
The study proposes collocational bootstrapping, a mechanism where word co-occurrence patterns facilitate syntactic dependency acquisition, specifically English subject-verb agreement. Neural networks were trained on synthetic datasets with varying predictability of subject-verb pairings to simulate language acquisition. Results indicate robust learning of subject-verb agreement within a specific range of variability. Analysis of child-directed language reveals that its variability aligns with this range, supporting collocational bootstrapping as a viable learning strategy for children.
collocational bootstrappingsubject-verb agreementneural networkssynthetic datasetslanguage acquisition
NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding
NeuroQA introduces a large-scale benchmark for visual question answering in 3D brain MRI, comprising 56,953 QA pairs from 12,977 subjects across 12 datasets spanning ages 5-104 and five clinical domains. The benchmark evaluates 11 clinically grounded reasoning skills through Yes/No, multiple-choice, and open-ended formats, with 131 image-grounded and 72 image-informed templates. A 38-rule deterministic pipeline and expert review ensure QA pair accuracy, verified against FreeSurfer measurements and radiology reports. Clinician evaluation confirms reliability, with zero same-subject contradictions. On closed-format test-public items, the best zero-shot vision-language model achieves 47.5% accuracy, below the 49.4% text-only majority-template floor.
visual question answering3d brain mriclinically grounded reasoningdeterministic pipelinefree surfer measurements
Machine-Learning-Enhanced Non-Invasive Testing for MASLD Fibrosis: Shallow-Deep Neural Networks Versus FIB-4, Tabular Foundation Models, and Large Language Models
The study proposes machine-learning-enhanced non-invasive testing (MLE-NIT) to improve advanced fibrosis detection in MASLD while retaining FIB-4's variable space. Using biopsy-confirmed cohorts from China, Malaysia, and India (n=784), researchers compared FIB-4 with a shallow-deep neural network (s-DNN), TabPFN, and fine-tuned GPT-4o. The s-DNN (354 parameters) outperformed others, achieving ROC-AUCs of 0.77 (Malaysia) and 0.67 (India) versus FIB-4's 0.75 and 0.60, with better calibration (Brier scores 0.18, 0.22). AST and FIB-4 were identified as dominant variables via permutation importance.
shallow-deep neural networkfib-4roc-aucpermutation importancenon-invasive testing
Open-World Evaluations for Measuring Frontier AI Capabilities
The paper proposes open-world evaluations as a complementary approach to benchmark-based assessment for measuring frontier AI capabilities, addressing limitations of benchmarks in capturing real-world, long-horizon tasks. It introduces CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a case study, an AI agent was tasked with developing and publishing a simple iOS application to the Apple App Store, achieving success with only one avoidable manual intervention. The authors argue that open-world evaluations can provide early indicators of emerging AI capabilities and offer recommendations for their design and reporting.
open-world evaluationsfrontier aibenchmark-based assessmentlong-horizon taskscrux project
Codec-Robust Attacks on Audio LLMs
CodecAttack introduces a novel method for generating codec-robust adversarial attacks on Audio Large Language Models (Audio LLMs) by optimizing perturbations in a neural audio codec's continuous latent space rather than directly perturbing the audio waveform. The approach employs multi-bitrate straight-through Expectation-over-Transformation (EoT) to enhance robustness across real-world compression channels without modifying the target model. Evaluated across three deployment scenarios and three target models, CodecAttack achieves an average 85.5% target-substring attack success rate (ASR) on Opus at moderate bitrates, significantly outperforming waveform-based baselines. The attack transfers to held-out codecs, reaching up to 100% ASR on MP3 and 84% on AAC-LC. Per-band energy analysis reveals that latent perturbations concentrate below 4kHz, aligning with codec bit allocation.
audio llmslatent spaceexpectation-over-transformationcodec robustnessadversarial attacks
ShadeBench: A Benchmark Dataset for Building Shade Simulation in Sustainable Society
The paper introduces ShadeBench, a benchmark dataset for urban shade simulation, addressing the lack of large-scale data for analyzing building-induced shade patterns. It includes geographically diverse urban scenes with simulated shade maps, textual descriptions, satellite imagery, building skeletons, and 3D meshes. ShadeBench supports tasks like shade generation, segmentation, and 3D reconstruction, with standardized evaluation protocols and baselines. The dataset enables scalable shade analysis for urban climate research and heat-resilient planning. Code and data are available at https://darl-genai.github.io/shadebench/.
urban heat islandshade simulationmultimodal dataset3d reconstructionthermal exposure
Tippett-minimum Fusion of Representation-space Diffusion Models for Multi-Encoder Out-of-Distribution Detection
The authors propose EncMin2L, a multi-encoder fusion framework for out-of-distribution (OOD) detection across diverse distribution shifts, achieving state-of-the-art performance at reduced parameter cost. The method combines per-encoder representation-space diffusion models (RDMs) using an encoder-agnostic two-level min-gate, calibrated without OOD labels. Encoder specialization is quantified through ID-data diagnostics η² and Δμ, while a Tippett minimum p-value combination aggregates per-encoder scores into a stable OOD signal. EncMin2L achieves ≥0.94 AUROC across global domain changes, semantic divergence, texture differences, and covariate corruptions, outperforming existing RDM-based OOD detectors on overlapping benchmarks at 2.3× lower parameter cost.
out-of-distribution detectionrepresentation-space diffusion modelstippett minimummulti-encoder fusionencoder specialization
\ECUAS{n}: A family of metrics for principled evaluation of uncertainty-augmented systems
(No summary returned.)
Training Language Agents to Learn from Experience
We introduce In-context Training (ICT), a framework enabling language agents to distill experience into reusable lessons for cross-task self-improvement. Our method employs a reflector model that observes actor trajectories and generates system prompts to enhance future task performance, trained via an RL-based pipeline without human examples. Evaluated on ALFWorld and MiniHack, trained reflectors outperform baselines on most held-out task families, demonstrating learned ability to generalize across environments. We also present MetaGym, a Python library for constructing meta-environments to facilitate research on self-improving language agents.
in-context trainingreflector modelrl-based pipelinecross-task generalizationmeta-environments
Code Generation by Differential Test Time Scaling
DiffCodeGen introduces a novel test-time scaling method for code generation using coverage-guided differential analysis. It generates diverse code candidates via sampling and prompting strategies, synthesizes inputs through coverage-guided fuzzing without requiring existing tests or LLMs, and clusters candidates based on behavioral similarity. The medoid of the largest cluster is selected as the final output, eliminating the need for additional LLM inference. Evaluated across 4 LLMs, DiffCodeGen achieves competitive or superior performance with significantly reduced time and token consumption, demonstrating efficiency and scalability while remaining model-agnostic.
test-time scalingcoverage-guided fuzzingcode generationbehavioral similaritymodel-agnostic
EPC-3D-Diff: Equivariant Physics Consistent Conditional 3D Latent Diffusion for CBCT to CT Synthesis
We propose EPC-3D-Diff, a conditional 3D latent diffusion framework for volumetric CBCT to CT synthesis that introduces a projection domain equivariance loss derived from acquisition physics. The method enforces physics consistency by forward projecting rotated synthesized CT volumes and matching them to angle-shifted projections of the target CT, integrated into the diffusion objective. Conditional diffusion is performed in a compact latent space learned by a lightweight 3D autoencoder. Evaluated on paired head CBCT/CT phantom and clinical datasets, EPC-3D-Diff achieves +7.4 dB (phantom) and +1.8 dB (clinical) PSNR improvements over state-of-the-art methods, alongside enhanced SSIM and HU accuracy within tissue boundaries.
equivariance losslatent diffusion3d autoencoderhounsfield unitprojection domain
High Quality Embeddings for Horn Logic Reasoning
The paper introduces improved embedding methods for enhancing Horn logic reasoning via neural networks. Key innovations include generating anchors with repeated terms, balanced positive/negative example sampling (easy/medium/hard), and periodic hard example emphasis during triplet loss training. Experiments evaluate embedding quality across multiple knowledge bases, analyzing task-specific suitability characteristics. Results demonstrate that the proposed techniques yield higher-quality embeddings for downstream logical reasoning tasks compared to baseline approaches.
horn logictriplet lossknowledge basesneural embeddingshard example mining
Pixel Wised Lesion Prediction on COVID-19 CT Imagery: A Comparative Analysis of Automated Image Segmentation Architectures
This study conducts a systematic comparison of deep learning architectures for COVID-19 lesion segmentation in CT images, addressing methodological inconsistencies in prior medical image analysis research. The evaluation combines four segmentation frameworks (Unet, PSPNet, Linknet, FPN) with six pre-trained encoders (VGG19, DenseNet121, InceptionResNetV2, MobileNetV2, SeresNet101, EfficientNetB0) across binary and multi-class segmentation tasks. Results on three COVID-19 CT datasets demonstrate strong performance, achieving 98% F1-score for binary segmentation and 75-77% for multi-class segmentation, establishing benchmarks for medical image analysis applications.
medical image segmentationpre-trained encodersf1-scorect imagerydeep learning architectures
Agentic Agile-V: From Vibe Coding to Verified Engineering in Software and Hardware Development
The paper introduces Agentic Agile-V, a framework integrating Agile-V lifecycle with a SCOPE-V loop (Specify, Constrain, Orchestrate, Prove, Evolve, Verify) to enhance AI-driven software/hardware development. It addresses limitations in current autonomous coding systems by emphasizing engineering process control over prompt engineering. Contributions include: (i) artifact taxonomies for agentic workflows, (ii) conversation-to-contract gating, (iii) adaptive workflows for risk management, and (iv) evidence-based artifact acceptance. Findings highlight persistent challenges in repository setup, verification, and dependency handling, underscoring the need for disciplined requirements, traceability, and human oversight despite AI automation.
agentic agile-vscope-v loopconversation-to-contracthardware verificationartifact acceptance
LLM Pretraining Shapes a Generalizable Manifold: Insights into Cross-Modal Transfer to Time Series
This paper demonstrates that language-pretrained transformers enable effective cross-modal transfer to time-series forecasting by preconditioning training with a reusable manifold. Using linear probes on frozen LLM states, realistic time-series trajectories are decoded without paired supervision, and retrieval in this projected space yields competitive forecasts, indicating pre-existing structure and dynamics. Pretrained initialization improves optimization, producing coherent gradients and an anisotropic loss landscape, while finetuning acts as low-dimensional alignment, reusing existing directions rather than learning temporal primitives from scratch. Results support a geometric account of transfer: language pretraining builds the manifold, and finetuning projects numerical dynamics onto task-relevant directions.
cross-modal transfermanifoldlinear probeanisotropic losslow-dimensional alignment
A Comprehensive Comparison of Deep Learning Architectures for COVID-19 Classification on CT & X-ray Imagery
This study evaluates deep learning architectures for COVID-19 classification on CT and X-ray imagery, proposing a convolutional neural network (CNN)-based computer-aided diagnosis (CAD) system. The authors compare pre-trained models including VGG (16, 19), Densenet (121), Resnet (50, 50 V2, 101 V2), MobileNet V2, Xception, Inception (V3, ResNet V2), EfficientNet B0, and NasNet Large on two X-ray and two CT datasets. Resnet and VGG architectures achieved superior performance, with average classification accuracies of 95-98% for distinguishing COVID-19 from healthy lung images. The results demonstrate competitive and improved performance compared to prior literature.
convolutional neural networkscomputer-aided diagnosispre-trained modelsclassification accuracyct scans
Modeling Emotional Dynamics in Agent-to-Agent Interactions on Moltbook
The study introduces an emotion-aware framework for analyzing agent-to-agent interactions on Moltbook, a social network populated by generative AI systems. The method maps textual interactions to fine-grained emotional categories and evaluates behavioral reliability using a novel Persona-Stimulus-Reaction (PSR) domain. Results reveal distinct emotional signatures and varying levels of behavioral stability among agents, influenced by interaction context.
emotion-aware frameworkpersona-stimulus-reactionbehavioral stabilitygenerative aimulti-agent interaction
Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics
The study identifies weight decay as a scalar empirical control parameter for training regimes in transformers on modular arithmetic tasks, distinguishing memorization, developmental grokking, and collapse. Two low-cost online diagnostics—mean pairwise attention-head cosine similarity and entropy standard deviation—are introduced to track training dynamics from attention activations. Across eleven experimental conditions and three model scales (0.82M to 85M parameters), weight decay separates these regimes, with a memorization-to-developmental boundary localized at λ_c=0.0158. Empirical exponent ν=0.757 is reported, differing from reference values. Multi-task replication and cross-architecture probes confirm the weight-decay-controlled transition, though claims are scoped to modular arithmetic in small transformer attention models.
weight decaymodular arithmeticattention-head cosine similarityentropy standard deviationtraining regimes
Group-Algebraic Tensors: Provably-optimal Equivariant Learning and Physical Symmetry Discovery
(No summary returned.)
AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows
AgentCo-op introduces a retrieval-based synthesis framework for composing multi-agent workflows in open-ended scientific domains without curated training data or standardized interfaces. The method combines reusable skills, tools, and external agents through typed artifact handoffs, employing bounded self-guided local repair when execution failures occur. Evaluations on genomics case studies and six benchmarks (coding, math, QA) show superior performance on four benchmarks and best average scores, while reducing per-task costs compared to multi-agent baselines.
multi-agent workflowsretrieval-based synthesistyped artifact handoffslocal repairopen-world genomics
OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind
OSCToM introduces a reinforcement learning-guided approach for modeling high-order Theory of Mind (ToM) in Large Language Models, specifically addressing observer-self belief conflicts through recursive, multi-layered reasoning. The method combines RL, a domain-specific language, and compositional surrogate models to generate adversarial training scenarios. Evaluated on benchmarks including FANToM, Hi-ToM, and BigToM, OSCToM-8B achieves 76% accuracy on FANToM, significantly outperforming ExploreToM's 0.2%, while being 6x more efficient in data synthesis. This demonstrates the efficacy of targeted training for advanced cognitive reasoning in smaller models.
theory of mindreinforcement learningrecursive reasoningadversarial generationinformation asymmetry
Mechanics of Bias and Reasoning: Interpreting the Impact of Chain-of-Thought Prompting on Gender Bias in LLMs
This study investigates the impact of Chain-of-Thought (CoT) prompting on gender bias in large language models (LLMs), combining benchmark evaluations with mechanistic interpretability techniques and reasoning chain failure analysis. Results reveal persistent stereotypical gender bias across benchmarks, with CoT prompting failing to consistently reduce the bias gap. Mechanistic analyses indicate that while CoT balances biased behavior in certain attention head clusters, gender bias remains embedded in hidden representations, suggesting only superficial mitigation. Reasoning chain inspections further suggest that improvements stem from dataset memorization rather than genuine understanding of bias.
chain-of-thought promptinggender biasmechanistic interpretabilityattention head clustershidden representations
Disentangling Sampling from Training Budget in Class-Imbalanced CT Body Composition Segmentation
The study introduces episodic sampling, adapted from few-shot learning, to address class imbalance in CT body composition segmentation by promoting class-balanced batch construction. This method is decoupled from metric-learning contexts and evaluated against random and weighted sampling on nine muscle and adipose tissues from 210 SAROS dataset scans. Results show comparable performance under full-data training (mean Dice 0.882 episodic vs. 0.878 random/weighted), but episodic sampling outperforms in low-data regimes (0.787 vs. 0.758/0.762), attributed to a 12-fold difference in training iterations. Episodic sampling also exhibits implicit regularization benefits, improving performance over three times more iterations before plateauing compared to alternatives.
episodic samplingclass imbalancect segmentationsaros datasetimplicit regularization
Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor
The work presents a three-way decomposition of MXFP4 quantization error in LLM reinforcement learning, identifying distinct components that affect different training pathways: scale bias (power-of-two rounding), deadzone truncation (zeroing small values), and grid noise (4-bit rounding). The authors propose targeted corrections: Macro-block scaling for scale bias, Outlier Fallback for deadzone entries, and Adaptive Quantization Noise (AQN) for policy entropy control. Evaluations on Qwen2.5-3B and Qwen3-30B-A3B-Base models show recovery to within 0.7% and 3.0% of BF16 accuracy, respectively.
mxfp4 quantizationreinforcement learninglarge language modelsquantization error decompositionadaptive quantization noise
STELLAR: Scaling 3D Perception Large Models for Autonomous Driving
The STELLAR model demonstrates that large-scale training significantly advances autonomous driving perception systems, addressing challenges in heterogeneous sensor fusion and 3D spatial understanding. The model extends the Sparse Window Transformer architecture to incorporate LiDAR, radar, camera, and map prior inputs, trained on a dataset of 50 million driving examples with up to 500 million parameters. Empirical scaling trends reveal strong connections between model performance, size, data, and compute. STELLAR achieves state-of-the-art results on the Waymo Open Dataset, outperforming prior methods by a substantial margin.
sparse window transformerlidarheterogeneous sensor fusionwaymo open datasetscaling trends
Nonlocal operator learning for fMRI encoding and decoding tasks
A neural integral-operator-based framework is proposed for fMRI encoding and decoding tasks, emphasizing nonlocal spatiotemporal context. The method implements latent neural integral operators that perform fixed-point iterations in an auxiliary space, with classification and stimuli prediction handled by a decoder. Evaluated on two open-source fMRI datasets, results show that larger temporal windows improve performance and yield more structured latent representations, with clearer class separation in decoding tasks. Encoding tasks, while challenging, also benefit from extended temporal context. The findings suggest that exploiting distributed nonlocal structure in brain dynamics requires specialized architectures.
neural integral operatorsfmri encodingspatiotemporal contextfixed-point iterationslatent-space geometry
ConceptSeg-R1: Segment Any Concept via Meta-Reinforcement Learning
ConceptSeg-R1 introduces a unified framework for generalized concept segmentation, formalized through a three-level taxonomy of context-independent (CI), context-dependent (CD), and context-reasoning (CR) concepts. The method employs Meta-GRPO, a meta-reinforcement learning mechanism, to learn transferable task rules from visual demonstrations and verify them via proxy reasoning. These rules are translated into segmentation-ready prompts through a lightweight concept translation module, with a shortcut routing strategy preserving efficiency for simple cases. Extensive experiments across diverse benchmarks demonstrate ConceptSeg-R1's strong performance across CI, CD, and CR tasks, advancing segmentation from object-level prediction to concept-level understanding.
meta-reinforcement learningconcept segmentationtask rulesproxy reasoningpromptable segmentation
Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs
This work investigates the conflict between instruction-following and pattern-completion in large language models (LLMs) through constructed conversations where user instructions oppose hardcoded assistant patterns. Across 13 models and 16 instructions, instruction-following rates varied from 1% to 99%, uncorrelated with standard benchmarks, with robustness modulated by instruction content, output format, and chain-of-thought reasoning. Multi-token responses were substantially more resistant than single-token outputs, and models underestimated their own resistance to induction pressure by 16.5% on average. Results demonstrate that instruction-following remains brittle under induction pressure, with output diversity being the primary predictor of robustness.
instruction-followingpattern-completioninduction pressurechain-of-thoughtoutput diversity
SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework
SUGAR introduces a scalable framework for learning generalizable humanoid loco-manipulation skills from diverse human videos, eliminating task-specific reward engineering and reference-motion conditioning. The method employs a three-stage pipeline: extracting kinematic interaction priors from unstructured videos, refining these priors into physically feasible skills using a physics-based refiner with a unified mimic reward, and distilling refined skills into a hierarchical autonomous policy. Evaluated on six loco-manipulation tasks in simulation and real-world hardware, SUGAR outperforms reference-tracking baselines, demonstrates zero-shot real-world transfer, and scales with video data volume, achieving reliable closed-loop execution and autonomous failure recovery.
humanoid loco-manipulationkinematic interaction priorsphysics-based refinerhierarchical autonomous policyzero-shot transfer
Latent Space Guided Scenario Sampling for Multimodal Segmentation Under Missing Modalities
The authors propose a latent space guided scenario sampling strategy for multimodal semantic segmentation under missing modalities, addressing limitations of uniform random modality dropout during fine-tuning. The method quantifies scenario informativeness by measuring latent representation distortion, captures scenario relations via radial basis function kernel, and derives refined scores through regularized kernel smoothing to guide sampling. Evaluated on DSTL, Potsdam, and Hunan remote sensing datasets using CBC-SLP, CBC, and CMX backbones, the approach outperforms standard fine-tuning and LoRA-based adaptation, demonstrating the utility of pretrained latent representations for missing modality fine-tuning.
multimodal semantic segmentationlatent spacemodality dropoutradial basis functionfine-tuning
DEL: Digit Entropy Loss for Numerical Learning of Large Language Models
The paper introduces Digit Entropy Loss (DEL), a novel method for improving numerical learning in large language models (LLMs). DEL reformulates entropy optimization through three key designs: digit conditional probability with binary cross-entropy, elimination of numerical distance terms, and generalization to floating-point numbers. Evaluated on seven mathematical reasoning benchmarks using CodeLlama, Mistral, DeepSeek, and Qwen-2.5, DEL consistently outperforms existing methods in prediction accuracy and numerical distance.
digit entropy lossnumerical learninglarge language modelsbinary cross-entropyfloating-point optimization
Security Document Classification with a Fine-Tuned Local Large Language Model: Benchmark Data and an Open-Source System
This study introduces TorchSight, an open-source local system for security document classification based on a fine-tuned Qwen 3.5 27B model. The model was trained on 78,358 samples from 13 permissively licensed sources and GPT-4 synthetic data, covering seven security categories and 51 subcategories. Evaluation on 1,000 documents yielded 95.0% category-level accuracy (95% CI: 93.5-96.2), outperforming commercial models scoring 75.4-79.9% under identical prompting. External validation on 500 held-out samples achieved 93.8% accuracy, demonstrating robust generalization. Results indicate that fine-tuned local models can achieve high accuracy while maintaining document processing locally.
security document classificationfine-tuned modellocal processingsynthetic datacategory-level accuracy
Consistently Informative Soft-Label Temperature for Knowledge Distillation
The paper introduces CIST (Consistently Informative Soft-label Temperature), a novel approach to knowledge distillation that addresses limitations of fixed-temperature scaling. CIST assigns sample-wise adaptive temperatures separately to the teacher and student models, enabling consistent entropy in teacher soft labels and relaxing rigid logit-scale alignment. It also reweights the distillation objective based on teacher confidence and student learning difficulty. Theoretical analysis links teacher-label entropy to the ratio of maximum logit to temperature. Empirical evaluations on vision and language tasks demonstrate consistent improvements over standard knowledge distillation and strong baselines, with negligible computational overhead.
knowledge distillationadaptive temperaturesoft-label entropylogit-scale alignmentsample-wise reweighting
Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models
The study investigates synchronization and turn-taking mechanisms in full-duplex spoken dialogue models (SDMs) by analyzing internal representations during simulated interactions. Two instances of the pretrained Moshi model were engaged in full-duplex dialogues under controlled conditions, with manipulations of channel noise and decoding bias. Synchronization was measured using Centered Kernel Alignment (CKA) across temporal lags, while anticipatory turn-taking cues were extracted from delayed internal activations using causal LSTM models. Results show strong representational synchronization under no noise, peaking near zero lag and degrading with noise, and reveal that internal states encode anticipatory information for turn-taking prediction.
full-duplexcentered kernel alignmentcausal lstmturn-takingsynchronization
Causal Unlearning in Collaborative Optimization: Exact and Approximate Influence Reversal under Adversarial Contributions
HF-KCU introduces a federated unlearning method that efficiently removes client contributions without full retraining. The approach uses Krylov subspace approximations to reduce computational complexity from O(d^3) to O(kd), with a causal weighting mechanism ensuring targeted updates. Validated on ResNet-18, SimpleCNN, and ViT-Lite across CIFAR-10, MNIST, and Fashion-MNIST, it achieves 47.75x speedup over retraining while maintaining accuracy within 0.60% of baseline. Membership inference attacks confirm privacy restoration, with convergence guarantees showing error reduction as O((k^1/2-1)/(k^1/2+1)). The method supports asynchronous deletion requests while preserving model quality for unaffected clients.
federated learninginfluence functionkrylov subspacehessian approximationmembership inference
Targeting Clause Type Distributions: a Picklock for Random Satisfiability Problems
The paper introduces Target-SAT (TSAT), a stochastic local search algorithm that triples tractable problem sizes for hard random 3-SAT instances by leveraging hidden statistical information in combinatorial constraints. TSAT actively navigates the parameter space toward a target, overcoming exponential scaling barriers that limit traditional local search methods. Results demonstrate superior performance in the hardest regime, reclaiming the lead for stochastic approaches in solving satisfiability problems.
3-satising spin hamiltoniansstochastic local searchsatisfiability phase transitionexponential scaling
Representability-Aware Neural Networks for Reduced Density Matrices: Application to Fractional Chern Insulators
We introduce a representability-aware neural network framework for predicting two-particle reduced density matrices (2-RDMs), incorporating representability conditions through architecture and loss function. The method enables interpolation across momentum meshes and serves as a variational 2-RDM ansatz optimized via energy minimization. Applied to a fractional Chern insulator in twisted bilayer MoTe$_2$ at a twist angle of $3.89^\circ$ and hole filling $2/3$, the residual multilayer perceptron achieves 97.07%-98.18% accuracy in predicting $6\times6$ 2-RDMs and predicts an energy 0.104 meV below exact diagonalization (ED) after variational optimization. The approach outperforms semidefinite programming in energy accuracy and parameter efficiency, using fewer than 1/20 parameters.
reduced density matricesneural networksrepresentability conditionsmomentum meshesvariational optimization
FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation
FullFlow introduces a parameter-efficient method to upgrade pretrained rectified-flow text-to-image models into bidirectional vision--language generators by training only LoRA adapters and lightweight text heads. The approach maintains images in continuous flow while adding a discrete insertion process for text, enabling text-to-image, image-to-text, joint sampling, and partial-text prediction with a single backbone. Evaluated on Stable Diffusion 3 (SD3), FullFlow improves text-to-image FID from 62.7 to 31.6 and image-to-text CIDEr from 2.0 to 99.4 over Dual Diffusion, reduces peak VRAM from ~84GB to ~38GB, and increases throughput by ~8× on two RTX A5000 GPUs in under 24 hours, training only ~5% of backbone parameters. The method also transfers to FLUX.1-dev and supports downstream VQA.
rectified-flowlora adaptersbidirectional generationpartial-text predictionvision--language
Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases
The paper identifies a 'small-vs-large gap' where training on smaller datasets with repeated samples yields faster convergence than larger datasets, contradicting prior theoretical expectations. Through theoretical analysis and empirical interventions across algorithmic tasks, architectures, and optimizers, the authors attribute this phenomenon to layer-wise growth facilitated by sampling biases, which are more pronounced in smaller datasets. Results demonstrate that dataset repetition can serve as a favorable inductive bias for optimization, particularly in reasoning tasks, offering compute-efficient training strategies.
small-vs-large gapsampling biasesinductive biaseslayer-wise growthcompute-efficient training
Equilibrium Reasoners: Learning Attractors Enables Scalable Reasoning
The paper introduces Equilibrium Reasoners (EqR), a framework for scalable reasoning through learning task-conditioned attractors in latent dynamical systems. EqR scales reasoning along two axes: depth via iterative updates and breadth via stochastic trajectory aggregation, enabling adaptive test-time compute allocation based on task difficulty. Empirical results demonstrate that EqR achieves 99% accuracy on Sudoku-Extreme by unrolling up to 40,000 equivalent layers, compared to 2.6% for feedforward models. The findings suggest that learned attractor landscapes provide a mechanistic explanation for scalable reasoning in iterative latent models.
equilibrium reasonerstask-conditioned attractorslatent dynamical systemstest-time scalingstochastic trajectory aggregation
EvoStruct: Bridging Evolutionary and Structural Priors for Antibody CDR Design via Protein Language Model Adaptation
EvoStruct introduces a novel approach to antibody complementarity-determining region (CDR) design by integrating evolutionary and structural priors through a protein language model (PLM) adaptation. The method bridges a frozen PLM with 3D structural context from an E(3)-equivariant graph neural network (GNN) using a cross-attention adapter, addressing vocabulary collapse via progressive PLM unfreezing and R-Drop consistency regularization. On the CHIMERA-Bench dataset, EvoStruct achieves a 16% improvement in sequence recovery, a 43% reduction in perplexity, and recovers 2.3x greater amino acid diversity compared to GNN baselines, while maintaining the highest binding-pair correlation with ground truth.
protein language modelgraph neural networksequence recoveryvocabulary collapsecross-attention adapter
Velocityformer: Broken-Symmetry-Matched Equivariant Graph Transformers for Cosmological Velocity Reconstruction
The paper introduces Velocityformer, an equivariant graph transformer for cosmological velocity reconstruction from galaxy surveys. The architecture matches the broken symmetry of observational data (translational/rotational equivariance disrupted by line-of-sight effects), outperforming linear theory by 35% in correlation coefficient r. Key innovations include physics-conditioned inductive biases and symmetry-aware design, enabling data-efficient training (4 simulations suffice) and zero-shot generalization across geometries and cosmologies. On high-fidelity simulations, Velocityformer achieves 30% higher r than physical baselines, directly improving kSZ measurement SNR.
velocityformerksz effectequivariant transformercosmological inferencesymmetry-breaking
Is Fixing Schema Graphs Necessary? Full-Resolution Graph Structure Learning for Relational Deep Learning
FROG introduces a Full-Resolution and Optimizable Graph Structure Learning framework for Relational Deep Learning (RDL), addressing the limitation of fixed graph structures in modeling relational databases. The method formulates relational structure learning as a learnable table role modeling problem, enabling tables to function as both nodes and edges in message passing. It incorporates role-driven message passing mechanisms and functional dependency constraints to ensure semantic consistency across table and entity levels. Extensive experiments demonstrate FROG's superior performance over existing approaches, highlighting the impact of table roles on downstream tasks and providing insights into graph construction for RDL.
graph structure learningrelational deep learningmessage passingfunctional dependencytable role modeling
You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
The paper demonstrates that reinforcement learning with verifiable rewards (RLVR) produces extremely low-rank parameter trajectories in LLMs, with a rank-1 approximation capturing most performance gains. It proposes RELEX, a compute-efficient method that estimates this rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression. Evaluated on Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base, RELEX matches or exceeds RLVR performance using only 15% of training steps, extrapolating up to 20× beyond observed steps. The method's success stems from denoising updates by projecting onto the rank-1 subspace.
reinforcement learninglow-rank approximationparameter trajectoriesextrapolationdenoising
DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards
We propose DelTA, a discriminative token credit assignment method for reinforcement learning from verifiable rewards (RLVR) that addresses the dilution of sparse yet discriminative token-gradient directions by shared high-frequency patterns. DelTA estimates token coefficients to amplify side-specific token-gradient directions and downweight shared or weakly discriminative ones, reshaping the RLVR update direction through a self-normalized RLVR surrogate. On seven mathematical benchmarks, DelTA outperforms the strongest same-scale baselines by 3.26 and 2.62 average points on Qwen3-8B-Base and Qwen3-14B-Base, respectively, with additional results demonstrating generalization across tasks and backbones.
reinforcement learningtoken credit assignmentverifiable rewardsdiscriminative directionsself-normalized surrogate
A Machine Learning Framework for Weighted Least Squares GNSS Positioning based on Activation Functions
A machine learning framework enhances GNSS positioning accuracy in urban canyons by integrating activation functions into the weighted least squares (WLS) algorithm. Signal quality indicators train ensemble learning models to predict quality scores, which activation functions transform into WLS weights. Experiments with real-world datasets from Hong Kong and Tokyo demonstrate that sigmoid activation functions yield the greatest improvements across various machine learning algorithms and GNSS constellations. The approach significantly reduces positioning errors in both single- and multiconstellation scenarios and exhibits strong geographical transferability, maintaining performance when trained on data from similarly urbanized regions.
gnssweighted least squaresactivation functionssignal quality indicatorsurban canyons
Mitigating Label Bias with Interpretable Rubric Embeddings
The paper proposes rubric embeddings, an interpretable representation framework that mitigates label bias in statistical decision algorithms by anchoring predictions to expert-defined criteria. The method replaces black-box embeddings with semantically meaningful features aligned with the underlying construct of interest, theoretically and empirically reducing bias. Evaluation on a master's program application dataset shows reduced group disparities while improving cohort quality metrics, demonstrating the approach's effectiveness for learning with biased labels.
rubric embeddingslabel biasinterpretable representationsstatistical decision algorithmsgroup disparities
Neural Negative Binomial Regression for Weekly Seismicity Forecasting: Per-Cell Dispersion Estimation and Tail Risk Assessment
We propose a neural negative binomial regression model for weekly seismicity forecasting that estimates per-cell dispersion parameters, addressing limitations of standard Poisson-based approaches with global dispersion assumptions. The method employs a neural network to predict negative binomial distribution parameters for each spatial grid cell, enabling localized dispersion estimation and improved tail risk assessment. Evaluated on Central Asian seismic data (2010-2024), the model significantly outperforms the Poisson baseline, with a likelihood-ratio test strongly rejecting the Poisson hypothesis (p = 5) and achieving a 12.5% lower continuous ranked probability score (CRPS), demonstrating enhanced calibration for extreme event prediction.
negative binomial regressiondispersion estimationseismicity forecastingcontinuous ranked probability scoretail risk assessment
Gaussian Sheaf Neural Networks
The authors propose Gaussian Sheaf Neural Networks (GSNNs), a novel framework for graph-based learning where node features are Gaussian distributions. GSNNs extend traditional GNNs by incorporating geometric and algebraic structure of means and covariances through a sheaf-theoretic approach, introducing a new Laplacian operator that generalizes the sheaf Laplacian. Experimental results on synthetic and real-world data demonstrate the practical utility of this approach for relational data with probabilistic node features.
graph neural networksgaussian distributionscellular sheavessheaf laplacianrelational data
roto 2.0: The Robot Tactile Olympiad
The paper introduces roto 2.0, a GPU-parallelized benchmark for tactile-based reinforcement learning (RL) that standardizes evaluation across four robotic morphologies (16-DOF to 24-DOF). Unlike prior work, it focuses on end-to-end blind manipulation using only proprioception and tactile sensing, eliminating state information or distillation. Results show a significant performance improvement, with blind agents achieving 13 Baoding ball rotations in 10 seconds, an order of magnitude faster than current state-of-the-art. The benchmark includes open-sourced environments and robust baselines to reduce RL tuning overhead.
tactile-based rlblind manipulationgpu-parallelized benchmarkbaoding ball rotationproprioception
Polynomial-Time Robust Multiclass Linear Classification under Gaussian Marginals
The paper presents polynomial-time algorithms for robust multiclass linear classification under Gaussian marginals, addressing a gap in prior work that required exponential complexity for k≥3 classes. By developing new structural insights into multiclass linear classifiers, the authors first show that standard multiclass perceptron fails to converge efficiently even with clean Gaussian data. They then introduce two frameworks: (1) a pairwise improper-learning approach achieving Õ(k^(3/2)√opt)+ε error, and (2) a localization-based method yielding O(opt)+ε error for k=3 and poly(k)opt+ε error for geometrically regular classifiers. Both methods provide dimension-independent guarantees.
agnostic learningmulticlass classificationgaussian marginalsimproper learningperceptron algorithm
Adaptive Signal Resuscitation: Channel-wise Post-Pruning Repair for Sparse Vision Networks
Adaptive Signal Resuscitation (ASR) introduces a training-free, channel-wise post-pruning repair method to address accuracy collapse in high-sparsity regimes. ASR estimates variance-matching corrections per output channel, stabilized by a data-driven shrinkage rule, ensuring reliable repairs for healthier channels while suppressing unreliable ones. Applied before BatchNorm recalibration, ASR requires only forward passes on a small calibration set. Evaluated across three datasets, four convolutional architectures, and unstructured/structured sparsity settings, ASR outperforms layer-wise repair, notably recovering 55.6% top-1 accuracy on ResNet-50 at 90% sparsity on CIFAR-10, compared to 41.0% for layer-wise repair.
post-pruning repairchannel-wise variancedata-driven shrinkagebatchnorm recalibrationhigh-sparsity regimes
Preference-aware Influence-function-based Data Selection Method for Efficient Fine-Tuning
PRISM (PReference-aware Influence-function-based Data Selection Method for Efficient Fine-Tuning) improves LLM fine-tuning efficiency by weighting target examples based on the current model's preferences, constructing a preference-aware target representation. It scores candidate training samples by their alignment with this representation, focusing the data budget on samples likely to move the model toward the target behavior. Theoretical analysis confirms that preference weighting provides a more effective first-order direction for increasing target-behavior preference. Experiments across model families and scales demonstrate PRISM's effectiveness in efficient fine-tuning and safety-oriented SFT repair, highlighting the importance of precise target-behavior characterization.
fine-tuningdata selectioninfluence functionpreference weightingtarget-behavior
What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema
The paper introduces a standardized audit schema for evaluating disclosure practices in LLM agent benchmarking papers, addressing reproducibility challenges in reported results. Authors analyze twelve prominent papers (eight agent, four classical benchmarks) using a five-dimension scoring system (benchmark identity, harness specification, inference settings, cost reporting, failure breakdown). Results show agent benchmarks score significantly lower (mean 0.38/1.0) than classical ones (0.66), with particular gaps in cost disclosure and environment specification. The schema, codebook, and raw scores are released as open artifacts to facilitate future multi-rater audits.
llm agent benchmarkingreproducibility auditdisclosure schemaevaluation harnessinference cost
Memorisation, convergence and generalisation in generative models
This work provides an exact analytical characterization of the transition from memorization to generalization in linear generative models, addressing two key questions: the data requirements for convergence and what convergence captures about learning data distributions. The authors demonstrate that these models memorize at small data loads, while convergence emerges continuously when the number of samples scales linearly with input dimension. Notably, convergence is insensitive to recovery of principal latent factors, which occurs in a sharp transition. Experiments with convolutional denoisers and prior diffusion model data confirm that generalization decomposes into distinct objectives: matching the bulk distribution and recovering latent factors, with only the former captured by convergence.
memorizationgeneralizationconvergencelatent factorsdiffusion models
Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration
We propose DiSI, a unified framework for Image Restoration (IR) that disentangles stochastic interpolants into independent generation and regression components, enabling continuous control over the distortion-perception trade-off. DiSI employs two specific sampling trajectories and a unified sampler for few-step inference, alongside a dual-branch U-Net transformer network in pixel space with a dedicated conditional guidance branch. Experiments demonstrate DiSI's efficiency and competitive performance across various IR tasks, offering inference-time flexibility within a single model.
stochastic interpolantsimage restorationdiffusion modelsdual-branch u-netconditional guidance
Classification of Single and Mixed Partial Discharges under Switching Voltage Using an AWA-CNN Framework
The study introduces an Amplitude-Width-Area (AWA) pattern representation for classifying single and mixed partial discharge (PD) sources under switching-voltage excitation. PD pulses are characterized by amplitude, width, and area, then mapped to visual patterns with distinguishable source-dependent distributions. Using InceptionV3 and ResNet-18 CNNs, the method achieves 96% testing accuracy in classifying six PD conditions (corona, internal, surface, and their mixtures), outperforming a Random Forest baseline (73.33%).
partial dischargeswitching-voltage excitationawa patternconvolutional neural networksource classification
Semiparametric Efficient Bilevel Gradient Estimation
The paper introduces a semiparametric debiasing method for bilevel gradient estimation, addressing first-order bias in plug-in hypergradients when the lower-level function is learned nonparametrically. The approach leverages efficient influence functions to derive a cross-fitted orthogonal hypergradient estimator, achieving asymptotic normality and uniform control over outer parameters. Under quadratic losses, the estimator simplifies to a doubly robust score based on conditional mean nuisances. Experiments on synthetic bilevel benchmarks demonstrate alignment with oracle gradients and improvements over plug-in functional hypergradients and kernel-based baselines.
bilevel optimizationsemiparametric efficiencyhypergradient estimationinfluence functiondoubly robust
Fast and Stable Triangular Inversion for Delta-Rule Linear Transformers
The paper presents a systematic analysis of triangular matrix inversion algorithms for Delta-Rule linear transformers, focusing on numerical stability and hardware efficiency. It evaluates both direct and iterative methods, emphasizing matrix product-rich operations suitable for modern NPUs. Experimental results demonstrate a 4.3× speed-up over SGLang implementations while maintaining model accuracy in low-precision arithmetic, with significant improvements at the layer level.
linear attentiontriangular inversiondelta-rulenumerical stabilityhardware efficiency
Stimulus symmetries can confound representational similarity analyses
The study demonstrates that stimulus symmetries can confound representational similarity analyses (RSMs) by generating functionally equivalent but geometrically distinct neural codes. Using theoretical analysis and empirical validation with image-encoding networks, the authors show that stochastic gradient descent or energetic regularization produces sparse, drifting codes that yield divergent RSMs despite functional equivalence. Results reveal that nonlinear neural codes with latent symmetries complicate RSM comparisons when equivalent representations lack simple rotational relationships.
representational similarity matricesneural codesstimulus symmetriesstochastic gradient descentenergetic regularization
Optimized Federated Knowledge Distillation with Distributed Neural Architecture Search
FedKDNAS introduces a federated learning framework combining client-side neural architecture search with knowledge distillation to address statistical and system heterogeneity. Clients autonomously select lightweight models under resource constraints, train locally using a hybrid supervised-distillation objective, and share predictions on a public reference set. The server aggregates and smooths these predictions for stable distillation targets. Evaluations on six datasets show FedKDNAS outperforms six FL baselines, achieving up to 15% accuracy improvement under non-IID conditions, reducing client CPU usage by 28%, and decreasing communication overhead by 44x while maintaining logit-based communication.
federated learningknowledge distillationneural architecture searchnon-iidlogit-based communication
CRAFT: Conflict-Resolved Aggregation for Federated Training
CRAFT introduces a conflict-resolved aggregation framework for federated learning under data heterogeneity, formulating global update aggregation as a geometric correction problem. The method derives a closed-form solution for constrained optimization, avoiding iterative solvers, and employs layer-wise adaptation to address feature-level conflicts. Theoretical analysis shows CRAFT promotes common-descent structure through projection geometry. Experiments on heterogeneous benchmarks demonstrate improved global accuracy (exact metrics unspecified) and reduced client performance disparity compared to state-of-the-art baselines.
federated learningheterogeneous datageometric correctionconstrained optimizationlayer-wise adaptation
A New Framework to Analyse the Distributional Robustness of Deep Neural Networks
The authors propose a novel framework for analyzing the distributional robustness of deep neural networks by modeling interactions between layer weights and activations using Bernoulli distributions. Class separation serves as a diagnostic proxy for robustness, with metrics distinguishing networks that memorize training data from those that generalize. Experiments on CIFAR-10 and ImageNet validate the framework, showing reduced separation under distribution shifts. The framework provides model-level diagnostics of representation structure and robustness, though analogous activation-space experiments fail to yield consistent results.
distributional robustnessbernoulli distributionsclass separationactivation spacemodel-level diagnostics
Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls
Deep UCSL introduces a Contrastive Subgroup Discovery method for identifying interpretable, homogeneous patient subgroups by contrasting them with healthy controls. The framework employs a deep feature extractor to learn a discriminative representation space, optimizing a novel loss based on the conditional joint likelihood of latent clusters and patient/control labels via Expectation-Maximization. A regularization term ensures disease-specific variability is captured while ignoring shared variability with controls. Evaluations on an MNIST example and four medical imaging datasets demonstrate quantitative improvements in subgroup quality over previous methods. Code and datasets are publicly available.
contrastive subgroup discoverydeep feature extractorexpectation-maximizationdiscriminative representation spaceregularization term
A Mechanistic Study of Tabular Foundation Models
The study provides a mechanistic analysis of tabular foundation models, addressing three key questions: algorithmic convergence, invariance origins, and robustness to engineered perturbations. Through causal interventions and perturbation experiments across multiple architectures, it identifies distinct similarity-based readout mechanisms, including attention-weighted voting and class-conditional mean readout. Results demonstrate that representation collapse is not practically significant, while permutation invariances stem from specific positional parameters whose removal preserves accuracy. Engineered perturbations reproduce predicted failure modes, isolating hub and rank attacks from refit baselines. These findings elucidate the inductive biases governing accuracy and characteristic failures in contemporary tabular foundation models.
tabular foundation modelscausal interventionpermutation invariancessimilarity-based readoutengineered perturbations
FedCoE: Bridging Generalization and Personalization via Federated Coordinated Dual-level MoEs
FedCoE introduces a Federated Coordinated dual-level Mixture-of-Experts framework to address the trade-off between global generalization and local personalization in Federated Learning. It maintains independent global expert models on the server and employs a shared gating network to dynamically model client-expert correlations, mitigating expert drift and gating inconsistency. An adaptive mechanism enables new clients to leverage the global expert pool without extensive local training. Experiments show FedCoE achieves 78.00% global accuracy and 89.32% personalized accuracy, outperforming baselines by 8.82% and 29.19%, respectively. In cold-start scenarios, it delivers 77.27% accuracy without local fine-tuning, surpassing baselines by over 12.54%.
federated learningmixture-of-expertscold-start problemgating networkparameter divergence
Nonparametric Learning and Earning with One-Point Feedback under Nonstationarity
The authors propose a nonparametric learning framework for dynamic pricing under nonstationary market conditions with one-point feedback per period. The method employs revenue-based gradient approximations to update prices and incorporates a restarting mechanism to adapt to environmental changes, with a meta-learning layer to hedge across restarting schedules when nonstationarity is unknown. Theoretical guarantees quantify cumulative revenue loss relative to a fully informed benchmark, considering both time horizon and market variation magnitude. Simulations on synthetic and real-world data demonstrate the approach's effectiveness in balancing learning and earning objectives.
nonparametric learningdynamic pricingnonstationaritygradient approximationsmeta-learning
On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective
We present a learning-theoretic framework to analyze Chain of Thought (CoT) reasoning, modeling it as an interaction between an answer map and an autoregressive chain rule. The reasoning risk of a hypothesis decomposes into oracle-trajectory risk (OTR), capturing CoT's benefit through domain adaptation, and trajectory-mismatch risk (TMR), quantifying error accumulation due to mismatched reasoning trajectories. We prove that TMR can be arbitrarily large without structural stability, but under stability, it is tightly bounded by an amplification factor identifying error-growth regimes. These results precisely characterize when CoT aids or hinders reasoning and what governs this transition.
chain of thoughtreasoning riskoracle-trajectory risktrajectory-mismatch riskerror accumulation
Theoretical guidelines for annealed Langevin dynamics in compositional simulation-based inference
The authors derive theoretical guidelines for tuning annealed Langevin dynamics in compositional simulation-based inference (SBI), addressing irreducible bias in existing score-based approaches. They propose Wasserstein bounds for annealed Langevin with approximate scores, translating these into explicit hyperparameter decision rules—step sizes, steps per level, and annealing levels—to ensure prescribed sampling accuracy. In Gaussian settings, closed-form expressions demonstrate that Linhart et al.'s bridging densities permit larger step sizes and fewer Langevin steps than Geffner et al.'s. Empirical results confirm the generalizability of these Gaussian-derived tunings to complex problems, offering a theoretically grounded framework for practitioners.
annealed langevin dynamicssimulation-based inferencewasserstein boundscompositional scorebridging densities
Graph Navier Stokes Networks
Graph Navier Stokes Networks (GNSN) introduce a novel graph neural network architecture inspired by the Navier-Stokes equations, addressing the oversmoothing problem in conventional diffusion-based message passing. GNSN incorporates convection into graph structures by defining a dynamic velocity field, enabling efficient and direct message propagation. The architecture adaptively balances convection and diffusion to handle datasets with varying homophily levels. Evaluations across twelve real-world datasets demonstrate that GNSN consistently outperforms state-of-the-art baselines in classification accuracy and effectively mitigates oversmoothing.
graph neural networksnavier-stokes equationsoversmoothingconvectionhomophily
Divide and Contrast: Learning Robust Temporal Features without Augmentation
We introduce Divide and Contrast (Di-COT), a self-supervised framework for learning robust temporal representations without data augmentation or multiple encoder passes. Di-COT stochastically partitions time-series windows into overlapping sub-blocks, contrasting informative substructures rather than individual timesteps to mitigate false positives during transitions. The contrastive objective scales efficiently with batch size and sub-block count, independent of sequence length. Experiments on six real-world datasets and UCR/UEA benchmarks show Di-COT achieves state-of-the-art performance in classification, clustering, kNN, and cross-dataset transfer while reducing training time. Code is publicly available.
self-supervised learningtime-series representationcontrastive learningtemporal dynamicssubstructure
Federated LoRA Fine-Tuning for LLMs via Collaborative Alignment
The paper introduces Collaborative Low-rank Alignment and Identifiable Recovery (CLAIR), a federated learning framework for parameter-efficient fine-tuning of large language models (LLMs) using low-rank adaptation (LoRA). CLAIR addresses heterogeneous and potentially contaminated client data by recovering a shared LoRA subspace and detecting contaminated clients via structured low-rank plus block-sparse decomposition. Theoretical guarantees include exact recovery of the shared subspace in noiseless cases, stable recovery under estimation error, and consistent collaborative-set recovery under mild conditions. Empirical evaluation on a Transformer-based text-copying task demonstrates CLAIR's effectiveness in contamination detection and improved performance for benign clients compared to local fine-tuning and non-robust federated averaging.
low-rank adaptationfederated learningcontamination detectionparameter-efficient fine-tuningstructured decomposition
Reinforcement Learning-based Control via Y-wise Affine Neural Networks: Comparative Case Studies for Chemical Processes
The study introduces Y-wise Affine Neural Network (YANN)-RL, a reinforcement learning-based control method tailored for chemical process systems, addressing challenges in training time and reliability. YANN-RL strategically initializes actor and critic networks to provide interpretable starting points within control schemes. The method is applied to three case studies from the PC-Gym library: a continuous stirred tank reactor, a four-tank system, and a multistage extraction column. Comparative evaluations against PPO, SAC, DDPG, TD3, and nonlinear model predictive control (NMPC) demonstrate that YANN-RL significantly reduces training time and data requirements while approaching NMPC performance without requiring a full nonlinear model.
reinforcement learningchemical processesyann-rlnonlinear model predictive controlactor-critic networks
Domain-Adaptable Reinforcement Learning for Code Generation with Dense Rewards
A reinforcement learning framework is proposed to adapt pre-trained large language models for domain-specific code generation, addressing correctness, quality, safety, and domain constraints. The method employs proximal policy optimization with a customizable execution-aware reward formula that optimizes syntax, functional correctness, code style, security, and simulator executability, facilitated by token-level reward mapping. Evaluated on MBPP/MBPP+ for general-purpose code generation and RoboEval for robotic program synthesis, the framework achieves a 19% absolute pass@1 improvement on MBPP and reduces execution failures by 51% on RoboEval, demonstrating effective alignment with domain-specific requirements.
proximal policy optimizationtoken-level reward mappingexecution-aware rewardcode generationdomain-specific constraints
ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning
ChunkFT introduces a memory-efficient full-parameter fine-tuning framework that dynamically activates a working set, enabling gradient computation for arbitrary sub-tensors without architectural modifications. The method avoids dense gradient computation and supports optimization of arbitrary sub-networks, with theoretical convergence analysis provided. Empirical evaluations on Llama 3-8B and Llama 3-70B demonstrate significant memory savings, requiring only 13.72GB for a 7B model with 1K input length on a single RTX 4090-24GB GPU. Downstream tasks in language understanding, mathematical reasoning, and MT-Bench show ChunkFT outperforming existing memory-efficient baselines and achieving comparable or superior performance to full-parameter fine-tuning.
fine-tuninggradient computationmemory-efficientsub-tensorsconvergence analysis
A Rigorous, Tractable Measure of Model Complexity
The authors propose a mathematically rigorous yet computationally tractable measure of model complexity based on input gradient similarities, applicable to both parametric and non-parametric models. The measure generalizes existing model-specific complexity metrics (polynomial degree, kernel length scale, k-nearest neighbors count, decision tree splits, random forest tree count) while remaining computationally feasible. Empirical results provide new insights into double descent phenomena across random Fourier features, random forests, neural networks, and gradient boosting.
model complexitygradient similaritydouble descentnon-parametric modelsparametric models
Q-SYNTH: Hybrid Quantum-Classical Adversarial Augmentation for Imbalanced Fraud Detection
Q-SYNTH introduces a hybrid quantum-classical adversarial framework for fraud detection in imbalanced datasets, combining a parameterized quantum circuit generator with a classical neural network discriminator. The method evaluates synthetic fraud samples using Kolmogorov-Smirnov statistics, Wasserstein distances, AUC-ROC, and downstream classifier performance. Results show Q-SYNTH reduces distributional mismatch versus classical GANs while maintaining competitive detection performance, offering a balanced trade-off between fidelity and utility.
quantum-classical hybridadversarial augmentationkolmogorov-smirnovwasserstein distancefraud detection
Learning First Integrals via Backward-Generated Data and Guided Reinforcement Learning
FISolver introduces a novel LLM-based approach for discovering first integrals in dynamical systems, addressing data scarcity through backward-generated datasets and reinforcement learning. The method combines supervised fine-tuning of a compact mathematical model with Levenshtein Distance-based shaped rewards, enhanced by data synthesis strategies for sparse problem families. Experiments demonstrate FISolver's superior performance over larger mathematical LLMs and Mathematica, achieving higher accuracy with lower computational costs.
first integralsbackward generationlevenshtein distancesupervised fine-tuningdynamical systems
SMoA: Spectrum Modulation Adapter for Parameter-Efficient Fine-Tuning
SMoA introduces a Spectrum Modulation Adapter for parameter-efficient fine-tuning, addressing the trade-off between representational capacity and computational cost in Low-Rank Adaptation (LoRA). The method partitions layers into spectral blocks and applies Hadamard-modulated low-rank branches to diagonal blocks, enhancing coverage of pretrained spectral directions under a reduced parameter budget. Theoretical analysis and empirical evaluations demonstrate SMoA's superiority over LoRA and related baselines in lower-budget settings, achieving improved average performance across multiple tasks.
spectrum modulationparameter-efficient fine-tuninglow-rank adaptationhadamard-modulatedspectral blocks
CoarseSoundNet: Building a reliable model for ecological soundscape analysis
The study introduces CoarseSoundNet, a deep learning model for coarse soundscape classification distinguishing biophony, geophony, and anthropophony in passive acoustic monitoring (PAM) recordings. The method systematically evaluates model architectures, training data composition (including an explicit silence class), and evaluation strategies. Results show performance improvements with domain-similar PAM data, class-specific decision thresholds, and duration-based constraints, though challenges remain for anthropophony due to masking effects. An ecological case study demonstrates CoarseSoundNet's utility as a preprocessing tool, yielding acoustic index trends comparable to ground-truth filtering.
soundscape ecologypassive acoustic monitoringbiophonydeep learningacoustic indices
Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving
CoPhy introduces a Cognitive-Physical reinforcement learning framework for autonomous driving, addressing limitations of behavioral cloning. The method distills Visual Language Model (VLM) knowledge into a BEV encoder, enabling cognitive understanding without inference overhead, and constructs an auto-regressive BEV world model to predict future semantic maps for action foresight. Policy optimization employs GRPO with dual-reward mechanisms: physical rewards enforce safety via BEV rollouts, while cognitive rewards ensure intent alignment through language-based scoring. Experiments on NAVSIM v1 and v2 benchmarks demonstrate CoPhy achieves state-of-the-art performance, enhancing safety and enabling flexible intent control via user-defined language instructions.
cognitive-physicalbev encoderauto-regressivegrposemantic maps
Reasoning-Trace Collapse: Evaluating the Loss of Explicit Reasoning During Fine-Tuning
The paper introduces reasoning-trace collapse, a phenomenon where fine-tuning explicit reasoning models on answer-only data degrades their ability to produce valid intermediate reasoning traces while maintaining final-answer accuracy. The authors propose a structural evaluation framework that separately measures answer correctness and reasoning-trace validity (including empty/missing/truncated cases), applied to four open-weight reasoning models. Results show standard fine-tuning rapidly suppresses valid reasoning (despite preserved answer quality) and that simple loss-masking strategies mitigate collapse without requiring trace-labeled data. The findings advocate for reporting reasoning reliability metrics alongside task performance in fine-tuning evaluations.
reasoning-trace collapseexplicit reasoningstructural evaluationloss-maskingfine-tuning
Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation
We introduce Advantage Collapse Rate (ACR), a diagnostic metric quantifying the proportion of training batches with ineffective gradients in Group Relative Policy Optimization (GRPO), a Reinforcement Learning from Verifiable Rewards algorithm. To mitigate advantage collapse—a failure mode where homogeneous rewards yield near-zero advantages—we propose Adaptive Virtual Sample Policy Optimization (AVSPO), a lightweight GRPO extension that injects virtual reward samples guided by real-time ACR monitoring. AVSPO reduces advantage collapse by 58-63% relative to GRPO and improves accuracy by 4-6 percentage points across models from 0.5B to 14B parameters on mathematical reasoning benchmarks, while maintaining generalization on out-of-domain tasks.
advantage collapse rategroup relative policy optimizationadaptive virtual sample policy optimizationreinforcement learning from verifiable rewardsmathematical reasoning benchmarks
Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models
We propose Linear-DPO, a generalized Direct Preference Optimization (DPO) objective for text-to-image generation that addresses limitations in applying discrete NLP-based DPO to regression-based generative tasks. Our method replaces the sigmoid-based utility function with a sustained linear utility and incorporates an EMA-updated reference model, unified under a reverse-time SDE framework that supports both diffusion and flow-matching models. Experiments on Stable Diffusion 1.5, Stable Diffusion XL, and Stable Diffusion 3-Medium demonstrate qualitative and quantitative improvements over existing baselines.
direct preference optimizationdiffusion modelsflow-matchingreverse-time sdetext-to-image generation
Automated Byzantine-Resilient Clustered Decentralized Federated Learning for Battery Intelligence in Connected EVs
The paper proposes ABC-DFL, an automated Byzantine-resilient clustered decentralized federated learning framework for electric vehicle battery intelligence. The system replaces centralized aggregation with a blockchain-based architecture featuring dynamic QBFT consensus and FLECA, a hierarchical aggregation protocol that filters malicious updates via adaptive thresholds and robust clustering. Experiments show FLECA matches FedProx convergence in benign settings (attack impact scores <0.10 under adversarial conditions) while maintaining fairness through multitask incentive mechanisms, with on-chain benchmarks validating practical deployment.
byzantine-resilientdecentralized federated learningquorum byzantine fault tolerancehierarchical aggregationelectric vehicle battery intelligence
A Unified Framework for Uncertainty-Aware Explainable Artificial Intelligence: A Case Study in Power Quality Disturbance Classification
The paper introduces a unified framework for uncertainty-aware explainable AI (XAI) by formalizing the explanation distribution as the push-forward measure of Bayesian neural network (BNN) posteriors through Lipschitz-continuous attribution operators. It proposes the uncertainty-aware relevance attribution operator (UA-RAO), which summarizes explanation distributions using mean, variance, coefficient of variation, quantiles, and set-theoretic measures. Theoretical support is provided via Monte Carlo accessibility and Wasserstein approximation bounds. Evaluated on a 15-class power quality disturbance classification benchmark, deep ensembles with mean UA-RAO improved localization over deterministic baselines, while other UA-RAO summaries revealed uncertainty patterns absent in point-estimate attributions. The framework is domain-agnostic and applicable to any BNN with Lipschitz-continuous attribution operators.
bayesian neural networksexplanation distributionuncertainty-aware xailipschitz-continuous operatorswasserstein approximation
Improved Guarantees for Constrained Online Convex Optimization via Self-Contraction
We propose a projection-based algorithm for Constrained Online Convex Optimization (COCO) with adversarially chosen constraints, achieving improved guarantees for both strongly convex and convex loss functions. Leveraging a geometric result on self-contracted curves, our method simultaneously minimizes static regret and cumulative constraint violation (CCV). For strongly convex losses, the algorithm achieves $O(\log T)$ regret and $O(\log T)$ CCV, exponentially improving upon the state-of-the-art $O(\sqrt{T \log T})$ CCV. For convex losses, it maintains optimal $O(\sqrt{T})$ regret while reducing CCV to $O(\sqrt{T})$ from $O(\sqrt{T} \log T)$.
constrained online convex optimizationself-contracted curvesstatic regretcumulative constraint violationprojection-based algorithm
HORST: Composing Optimizer Geometries for Sparse Transformer Training
The paper introduces HORST (Hyperbolic Operator for Robust Sparse Training), a novel optimizer for sparse transformer training that combines stability from adaptive methods with an L1 sparsity bias. The method composes optimizer steps as non-commutative operators, using a hyperbolic mirror map to induce sparsity while maintaining training stability. Experiments on vision and language tasks show HORST consistently outperforms AdamW across sparsity levels, with particularly large gains at high sparsity.
sparse trainingtransformer optimizationhyperbolic mirror mapnon-commutative operatorsl1 regularization
A Typed Tensor Language for Federated Learning
The authors introduce a typed tensor language that formalizes federated learning computations through a shared-state factorization theory. The language distinguishes federated tensors, partitioned across clients, from shared tensors, available globally, and defines semantics via a virtual global tensor. They prove that typed one-round programs factor through fixed-dimensional shared state, independent of client and record counts, and establish a converse representability result for iterative programs. A differentiable fragment is developed for learning, enabling global gradient computation via record-axis summation of federated gradient tensors. The framework characterizes federated learning computations with fixed-dimensional shared-state communication.
federated learningtyped tensor languageshared-state factorizationfederated tensorsglobal gradient
UOTIP: Unbalanced Optimal Transport Map for Unpaired Inverse Problems
The authors propose UOTIP, an unbalanced optimal transport-based solver for unpaired image inverse problems where only independent sets of noisy measurements and clean targets are available. The method learns a transport map between distributions using a likelihood-based cost function, with theoretical guarantees of existence and uniqueness via a quadratic cost term satisfying the twist condition. Experiments show state-of-the-art performance on linear and nonlinear inverse problem benchmarks, with advantages in noise robustness, class imbalance adaptation, and generalization to diverse noise types.
unbalanced optimal transportinverse problemstransport maptwist conditionunpaired learning
Reviving Error Correction in Modern Deep Time-Series Forecasting
The authors propose the Universal Error Corrector with Seasonal-Trend Decomposition (UEC-STD), a novel architecture-agnostic error correction model for deep time-series forecasting. UEC-STD addresses error accumulation in autoregressive inference by explicitly decomposing predictions into trend and seasonal components and training a corrector to adjust each separately. This approach integrates with existing forecasters without retraining. Evaluated across 4 backbones and 10 datasets, UEC-STD significantly improves correction accuracy and robustness. The work provides practical insights into mitigating autoregressive errors in deep time-series models.
autoregressive inferenceerror correctiontime-series forecastingseasonal-trend decompositionarchitecture-agnostic
AIMBio-Mat: An AI-Native FAIR Platform for Closed-Loop Materials Discovery and Biomedical Translation
The paper introduces AIMBio-Mat, an AI-native FAIR platform framework for closed-loop discovery of biomedical materials. The system integrates materials provenance, biomedical context, knowledge graphs, uncertainty-aware ML, and human-in-the-loop active learning to formulate discovery as constrained multi-objective optimization under uncertainty. A prototype demonstrates AI-guided nanomaterials for drug delivery, emphasizing metadata standards, model documentation, and risk-tiered governance while excluding unvalidated clinical use. The blueprint converts fragmented records into auditable, experimentally actionable workflows.
materials discoverymulti-objective optimizationknowledge graphsuncertainty-aware mlfair platform
Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model
The Musical Attention Transformer introduces a music-specific attention mechanism to improve Transformer-based music generation by incorporating meta-information such as bar numbers, key signatures, and tempos. Each musical note is represented as a combination of five events (pitch, bar number, onset, duration, velocity) and three metadata elements, enabling the attention mechanism to capture correlations among these eight features. This approach enhances the model's ability to generate harmonically consistent and diverse melodies while reducing excessive repetition. Experimental results show that Musical Attention outperforms Full Attention and Strided Attention in terms of musical coherence, variation, and overall quality, advancing AI-driven music generation towards more natural and expressive compositions.
musical attentionmeta-informationtransformerharmonically consistentmelody generation
SpectralEarth-FM: Bringing Hyperspectral Imagery into Multimodal Earth Observation Pretraining
SpectralEarth-FM introduces a hierarchical transformer for multimodal Earth observation pretraining, addressing the underrepresentation of hyperspectral imagery (HSI) in foundation models. The architecture combines spectral tokenization for HSI, sensor-specific encoders, cross-sensor fusion, and a shared hierarchical encoder to jointly process HSI and lower-channel observations. Pretrained on SpectralEarth-MM, a dataset of 2M globally distributed locations and 40TB of co-located HSI, multispectral imagery, and SAR data, the model employs a Joint-Embedding Predictive Architecture objective. Evaluations on hyperspectral downstream tasks and standard EO benchmarks demonstrate state-of-the-art performance.
hyperspectral imageryfoundation modelsspectral tokenizationjoint-embedding predictive architecturecross-sensor fusion
Towards Understanding Self-Pretraining for Sequence Classification
The study investigates self-pretraining (SPT) for Transformers in sequence classification, replicating and ablating Amos et al. (2024)'s findings. Through systematic experiments, the authors identify that standard supervised training struggles to learn useful query-key Attention patterns from random initialization, particularly proximity interactions in absolute positional encodings. Theoretical analysis reveals masked reconstruction in SPT detects Attention-score directions overlooked by label supervision. Results suggest SPT's improvement stems from better optimization of proximity-biased Attention scores rather than depth or generalization alone.
self-pretrainingtransformersattention patternspositional encodingsmasked reconstruction
Robust Personalized Recommendation under Hidden Confounding in MNAR
We propose Personalized Unobserved-Confounding-aware Interaction Deconfounder (PUID), a novel framework for robust personalized recommendation under hidden confounding in MNAR settings. PUID estimates user-item level sensitivity bounds, relaxing the homogeneity assumption of global sensitivity bounds, and employs adversarial optimization for robustness. A benchmark-guided variant (BPUID) incorporates pre-trained models as stabilizing references. Experiments on three real-world datasets demonstrate PUID significantly outperforms global methods under hidden confounding, achieving robustness without requiring RCT data.
hidden confoundingmnarsensitivity boundsadversarial optimizationpre-trained models
Multimodal LLMs under Pairwise Modalities
The paper introduces a framework for training multimodal large language models (MLLMs) using only pairwise modalities, eliminating the need for costly joint multimodal datasets. The method involves two stages: latent representation alignment via self-modal reconstruction and contrastive learning, followed by cross-modal recomposition to integrate new modalities with pre-trained ones. Theoretical analysis establishes identifiability conditions for pairwise modalities. Evaluations on 3D point clouds and tactile modalities demonstrate strong cross-modal performance, validating the approach's scalability and efficiency in multimodal representation learning.
multimodal llmspairwise modalitieslatent representation alignmentcontrastive learningcross-modal recomposition
A Dialogue between Causal and Traditional Representation Learning: Toward Mutual Benefits in a Unified Formulation
This paper proposes a unified formulation bridging causal representation learning (CRL) and traditional representation learning, addressing their divergent trajectories. The framework decomposes representation learning into task components (information preservation) and constraint components (latent structure), enabling mutual benefits: CRL provides theoretical insights into structured constraints, while traditional methods offer practical guidance on task design. Experimental analysis on CausalVerse demonstrates that causal constraints' effectiveness varies significantly based on paired tasks, highlighting the interplay between task and constraint components in representation learning.
causal representation learninglatent spacetask componentconstraint componentcausalverse
Genetic Programming with Transformer-Based Mutation for Approximate Circuit Design
The authors propose a transformer-based mutation operator for Cartesian genetic programming (CGP) to enhance the automated design of approximate arithmetic circuits. Their hybrid scheme alternates between the transformer-based and standard mutation operators to prevent stagnation in circuit approximation. A novel training scheme was developed for the transformer using thousands of CGP chromosomes representing approximate multipliers. Results demonstrate that circuits evolved with this approach achieve superior trade-offs compared to state-of-the-art designs in the EvoApproxLib library, despite the computational demands of training and evolution.
cartesian genetic programmingapproximate circuitstransformer-based mutationevoapproxlibarithmetic circuits
Conditioning Gaussian Processes on Almost Anything
The paper introduces a novel equivalence between Gaussian processes (GPs) and linear diffusion models, enabling GP conditioning on arbitrary likelihood-evaluable statements, including non-linear physics and natural language via large language models. By recasting predictive sampling as an ODE with closed-form Gaussian dynamics and a Monte Carlo-approximated guidance term, the method handles non-conjugate cases while maintaining exact recovery in linear-Gaussian settings. Whitening reduces non-Gaussian dynamics, minimizing Wasserstein-2 transport cost and numerical stiffness. This yields a general-purpose GP inference scheme without bespoke derivations, expanding probabilistic modeling capabilities.
gaussian processesdiffusion modelsmonte carlo approximationwasserstein-2non-conjugate inference
Efficient Banzhaf-Based Data Valuation for $k$-Nearest Neighbors Classification
We introduce efficient algorithms for computing Banzhaf values in $k$-nearest neighbor ($k$NN) classifiers, addressing the exponential complexity of game-theoretic data valuation. By leveraging the locality properties of $k$NN, we develop a dynamic programming framework yielding a pseudo-polynomial algorithm with $O(Wkn^2)$ complexity for weighted $k$NN and an $O(nk^2)$ algorithm for unweighted $k$NN, where $W$ is the maximum sum of top-$k$ weights. We also provide Monte Carlo estimation methods. Experiments on real-world datasets confirm the practical efficiency and effectiveness of our approach in data valuation tasks.
banzhaf valuek-nearest neighborsdata valuationdynamic programmingmonte carlo estimation
Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise
The paper establishes maximal concentration bounds for stochastic approximation algorithms under heavy-tailed Markovian noise, considering both bounded and unbounded Martingale-difference components. Using a novel Lyapunov function based on the moment-generating function of a Poisson equation solution and an auxiliary projected algorithm, the authors derive tail behavior classifications (sub-Gaussian, sub-Weibull, or Pareto-like) depending on step sizes and operator properties. For unbounded noise with contractive average operators, they show error tails are at most three times heavier than noise tails under non-expansive operators, but may be substantially heavier otherwise. The analysis includes worst-case examples demonstrating tightness.
stochastic approximationmarkovian noiselyapunov functionpoisson equationconcentration bounds
Beyond the Bellman Recursion: A Pontryagin-Guided Framework for Non-Exponential Discounting
The paper introduces Pontryagin-Guided Direct Policy Optimization (PG-DPO), a reinforcement learning framework for non-exponential discounting that bypasses Bellman recursion limitations. PG-DPO combines the Pontryagin Maximum Principle with Monte Carlo rollouts via an Adjoint-MC projection to enforce Hamiltonian maximization. Experiments on hyperbolic and survival-discount benchmarks demonstrate improved accuracy and stability compared to equation-driven solvers and critic-based methods, addressing structural breakdowns in standard dynamic programming under non-multiplicative or time-inhomogeneous discounting.
non-exponential discountingpontryagin maximum principleadjoint-mc projectionhamiltonian maximizationdirect policy optimization
Modeling Temporal scRNA-seq Data with Latent Gaussian Process and Optimal Transport
A generative framework combining latent heteroscedastic Gaussian processes and optimal transport is proposed for modeling temporal single-cell RNA sequencing data. The method employs Hilbert space approximations for Gaussian processes and incorporates cell-specific latent time and cell type conditioning to capture biological heterogeneity. An optimal transport objective aligns generated and observed population distributions, addressing the absence of genuine cell trajectories. The approach achieves state-of-the-art performance on complex interpolation and extrapolation benchmarks and introduces a gradient-based strategy for inferring perturbation trajectories.
single-cell rna sequencinggaussian processoptimal transportlatent timeheteroscedastic
Point Cloud Sequence Encoding for Material-conditioned Graph Network Simulators
The paper introduces Point Cloud Encoding for Accurate Context Handling (PEACH), a framework enabling Graph Network Simulators (GNSs) to adapt to unseen material properties via in-context learning on point clouds. PEACH employs a spatio-temporal point cloud sequence encoder and auxiliary supervision to enhance simulation fidelity. Results demonstrate zero-shot sim-to-real transfer capabilities, outperforming mesh-based baselines in prediction accuracy while remaining practical for real-world deployment.
graph network simulatorspoint cloud encodingin-context learningzero-shot transferspatio-temporal encoder
Choose Wisely and Privately: Proactive Client Selection for Fair and Efficient Federated Learning
The paper introduces a proactive client selection framework for federated learning that optimizes federation composition based on data utility and fairness prior to training. The method employs mutual information derived from differentially private contingency tables to assess cross-feature correlations and formulates a Potential Federation Loss (PFL) balancing utility maximization and fairness preservation. Client selection is framed as an optimal subset search problem solved via simulated annealing under strong differential privacy guarantees. Experiments on four benchmarks demonstrate that models trained on optimally selected federations achieve faster convergence, improved fairness, and higher accuracy compared to uniform sampling and state-of-the-art adaptive aggregation methods.
federated learningclient selectiondifferential privacymutual informationsimulated annealing
A Deployment Audit of Release-Side Risk in Conformal Triage under Prevalence Shift
We introduce a leakage-aware deployment audit for release-side conformal triage to evaluate safety under prevalence shift, addressing the critical gap in assessing whether event-positive cases are released without review. The method assigns subjects to three non-overlapping roles: prevalence correction, conformal calibration, and held-out release-safety evaluation, enabling direct assessment of release safety, calibration label sufficiency, and safety-review trade-offs. Applied to a retrospective NSCLC pilot, the audit reveals that lower review rates can be misleading, as pooled conformal triage releases more event-positive patients, while classwise triage diagnoses insufficient event labels for safe low-review release.
conformal triageprevalence shiftrelease-safety evaluationleakage-aware auditclasswise branch
Training distribution determines the ceiling of drug-blind cancer sensitivity prediction
The study identifies training distribution, not drug representation complexity, as the limiting factor in drug-blind cancer sensitivity prediction. Analyzing four datasets, it demonstrates that the standard global Pearson r metric is dominated by between-drug potency differences, while per-drug Pearson r reveals no improvement from drug encodings over cell-only features. Controlled experiments show that using mechanism-of-action (MoA) as a training-distribution constraint, rather than a drug feature, significantly enhances prediction accuracy for targeted kinase inhibitors by preserving pathway-specific sensitivity signals. Mechanism-stratified training and response matching from pilot observations are proposed as deployable strategies to recover predictive gains.
drug-blind predictionpearson rmechanism-of-actiontraining-distributionsensitivity signals
Learning fMRI activations dictionaries across individual geometries via optimal transport
The authors propose a novel dictionary learning method for fMRI data that preserves subject-specific geometric variability using optimal transport. Their approach employs Fused Gromov-Wasserstein (FGW) distance to compare graphs with differing geometries, with computational efficiency achieved through amortized optimization of neural network-predicted transport plans. Experiments on the HCP dataset demonstrate the method's ability to capture geometric variability while maintaining informative representations, with dictionary atoms adaptively weighted by an FGW trade-off parameter.
dictionary learningoptimal transportfused gromov-wassersteinamortized optimizationfmri
NeighborDiv: Training-free Zero-shot Generalist Graph Anomaly Detection via Neighbor Diversity
NeighborDiv introduces a training-free zero-shot framework for generalist graph anomaly detection (GGAD) by shifting from node-to-neighbor consistency to neighbor-to-neighbor diversity. The method quantifies anomaly signals via inter-neighbor feature similarity variance, capturing local structural dispersion without complex training pipelines. Evaluated under Single-Domain Independent Training (SDIT) and Unified Multi-Domain Training (UMDT) paradigms, NeighborDiv achieves state-of-the-art performance with 10.25% AUC and 17.78% AP gains over baselines, while exhibiting zero performance volatility across datasets.
graph anomaly detectionzero-shot learningneighbor diversityfeature similarity variancecross-domain generalization
CIG: Exploration via Conditional Information Gain
The paper introduces Conditional Information Gain (CIG), a novel intrinsic reward for exploration in reinforcement learning that conditions on both replay buffer and rollout prefix contexts. CIG is derived as a tractable surrogate for trajectory-level information gain, formulated as a log-determinant objective over an ensemble disagreement kernel, with Cholesky factorization enabling causal per-step rewards. It scales to high-dimensional state spaces and is instantiated in a model-based setting. Evaluated across twelve tasks in MiniGrid and OGBench, including stochastic-distractor settings, CIG outperforms or matches prior exploration methods while demonstrating robustness to stochastic distractors.
conditional information gainintrinsic rewardexplorationlog-determinantcholesky factorization
LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
LOSCAR-SGD introduces a Local SGD method combining communication compression, local training, and communication-computation overlap for distributed learning. It addresses heterogeneous compute settings by allowing workers to take varying local steps, communicating only sparse model coordinates while optimizing during communication. A delay-corrected merge rule integrates delayed synchronized information without discarding overlap-phase progress. Theoretical convergence guarantees are provided for smooth non-convex objectives, analyzing the impact of sparsity, overlap, and worker heterogeneity. Empirical results demonstrate reduced training time through communication-computation overlap and superior performance of the delay-corrected merge over naive overwriting.
local sgdcommunication-computation overlapdelay-corrected mergesparse model averagingheterogeneous compute
PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR
PlexRL introduces a cluster-level runtime for multiplexing unified large language model (LLM) services across reinforcement learning with verifiable rewards (RLVR) jobs, addressing structural inefficiencies in RLVR training. By centrally managing model placement, state transitions, and function-level scheduling under strict affinity constraints, PlexRL time-slices LLM execution across jobs to fill idle periods without costly model migration. Evaluations show that PlexRL improves effective cluster capacity and reduces user GPU hour cost by up to 37.58%, while maintaining algorithmic flexibility and introducing minimal per-job overhead.
reinforcement learninglarge language modelscluster-level runtimemodel placementfunction-level scheduling
Finite-Time Regret Analysis of Retry-Aware Bandits
The paper establishes the first sublinear regret bound for ReMax, a stochastic bandit algorithm optimizing retry-aware objectives like pass@$k$ and max@$k$. For Gaussian rewards and $M=2$ virtual draws, ReMax selects a sampling distribution maximizing the posterior expected maximum reward, characterized by an expected-improvement balance condition. The analysis reveals a ReMax-specific underestimation effect, where the optimal arm may be sampled too rarely after unfavorable estimates, explaining its increased exploitability compared to Thompson sampling. Empirical results show ReMax outperforms KL-UCB and Thompson sampling under mild underestimation, with posterior-variance scaling mitigating severe underestimation.
stochastic banditretry-awareregret boundthompson samplingposterior variance
Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models
This work introduces activation-free polynomial alternatives for core vision modules (MLPs, convolutions, attention) within MetaFormer-style architectures, replacing standard nonlinearities with Hadamard products. The proposed PolyNeXt models integrate these polynomial variants seamlessly into existing frameworks while maintaining architectural modularity. Experiments demonstrate that PolyNeXt matches or exceeds activation-based counterparts on ImageNet classification, ADE20K semantic segmentation, and out-of-distribution robustness, while outperforming prior polynomial networks at reduced computational cost.
polynomial networksmetaformerhadamard productactivation-freeout-of-distribution robustness
Markovian Circuit Tracing for Transformer State Dynamic
The paper introduces Markovian Circuit Tracing (MCT), a diagnostic pipeline for analyzing transformer state dynamics using synthetic Hidden Markov Model (HMM) tasks. MCT evaluates whether transformer activations encode coarse state-transition structures by leveraging known latent states, transition matrices, and Bayesian belief vectors. Results show that tiny causal transformers achieve near-Bayes next-token prediction with a mean excess loss of 0.0138. Residual activations partially encode Bayesian belief information, and state abstractions recover coarse transition signals, particularly in persistent and lower-state regimes. State forcing via centroid patching reduces KL divergence to exact HMM counterfactual targets from 0.1957 to 0.0532, outperforming control baselines.
markovian circuit tracinghidden markov modelbayesian beliefstate-transition structuretransformer activations
OlmoEarth v1.1: A more efficient family of OlmoEarth models
The authors introduce efficiency improvements to the OlmoEarth model family, achieving significant reductions in computational costs while maintaining model performance. Methodological enhancements yield a 1.7× reduction in GPU hours during training and a 2.9× reduction in multiply-accumulate operations (MACs) for Sentinel-2 tasks. These optimizations are implemented without compromising the models' overall effectiveness, as demonstrated by maintained performance metrics. The training code is made publicly available on GitHub to facilitate reproducibility and further research.
olmoearthgpu hoursmacssentinel-2computational efficiency
Instant GPU Efficiency Visibility at Fleet Scale
The authors introduce Overall FLOP Utilization (OFU), a hardware-level GPU efficiency metric for AI workloads that requires no application instrumentation and works across GPU generations and numeric precisions. OFU leverages two on-chip performance counters—Tensor Pipe Activity and SM clock frequency—and is characterized through controlled GEMM experiments on H100 and GB200 GPUs across FP16, TF32, FP8, and NVFP4 precisions. After tile-quantization correction, OFU predicts application-level MFU within ≤2 percentage points and achieves r = 0.78 correlation with MFU across 608 production training jobs. Deployed at scale, OFU detected a 2.5x efficiency regression and precision-dependent utilization changes, demonstrating its utility for fleet-wide efficiency monitoring.
overall flop utilizationtensor pipe activitysm clock frequencygemm experimentstile-quantization correction
Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor
This work extends Narang et al. (2021) by evaluating 20 post-2021 Transformer modifications at 1.2B and 3B parameters under controlled iso-data, iso-compute, and iso-recipe conditions, using CLIMB-12 downstream evaluation and multi-seed noise floor analysis. The study finds that only two modifications surpass Bonferroni-corrected significance at 1.2B, with one failing stability at 3B, replicating the original conclusion that most architectural changes do not transfer. Additionally, it reveals an enlarged loss-downstream performance gap for attention-output modifications, where models nearing baseline validation loss still show 6-16 point drops on CLIMB-12.
transformer modificationsdownstream evaluationnoise floorbonferroni correctioniso-compute
Beyond Numerical Features: CNN-Driven Algorithm Selection via Contour Plots for Continuous Black-Box Optimization
The paper introduces a CNN-driven approach for per-instance algorithm selection in black-box optimization, leveraging contour-map visualizations of probed landscapes instead of traditional numerical descriptors. A CNN regressor processes multiple instance-specific contour views, either stacked or encoded per view and aggregated, to predict solver performance and select the optimal solver. Evaluated on the BBOB 2009 single-objective protocol, the method outperforms the single best solver and competes with feature-based baselines. Bi-objective evaluations under the DeepELA setting further validate the approach. Results demonstrate that vision models can effectively exploit spatial structure in landscapes for algorithm selection without handcrafted features.
cnnblack-box optimizationcontour-mapalgorithm selectiondeepela
Causal Machine Learning Is Not a Panacea: A Roadmap for Observational Causal Inference in Health
The article presents a methodological roadmap for applying causal machine learning (ML) to observational clinical data, emphasizing the need for rigorous validation of causal assumptions and responsible application across disciplines. It highlights the limitations of causal ML, which remain underappreciated despite its growing use in health research. The authors stress that unverified causal assumptions and unjustified modeling choices can lead to biased or misleading results, impacting clinical research and patient care. While causal ML is a powerful tool for generating causal hypotheses, the article provides a template to enhance the rigor and interpretability of causal analyses in observational settings.
causal machine learningobservational dataclinical researchcausal assumptionsmodeling choices
Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment
We introduce REPA-P, a teacher-free framework for physics-informed diffusion models that mitigates shortcut learning by aligning intermediate representations with physical states. REPA-P attaches lightweight $1{\times}1$ projection heads to selected layers, decodes hidden activations into physical quantities, and applies PDE residual losses during training, without inference overhead. Evaluated on four PDE tasks (Darcy flow, topology optimization, electrostatic potential, turbulent channel flow) with U-Net and Diffusion Transformer backbones, REPA-P accelerates convergence by up to $2{\times}$, reduces physics residuals by up to $66.4\%$, and improves out-of-distribution robustness by up to $49.3\%$. Ablations demonstrate that supervising a small set of intermediate layers captures most benefits while complementing output-level physics losses.
physics-informed diffusionshortcut learningpde residualsrepresentation alignmentintermediate supervision
Cumulative Meta-Learning from Active Learning Queries for Robustness to Spurious Correlations
The paper introduces Cumulative Active Meta-Learning (CAML), a framework addressing spurious correlations in datasets by meta-learning inductive biases from active learning queries. CAML treats each active-learning round as a meta-learning task, using the current labeled set as meta-train data and newly queried batches as meta-test data. Unlike standard meta-learning, CAML maintains a cumulative inductive bias that progressively refines through sequential active-learning rounds, capturing dependencies between earlier and later queries. Empirical evaluations show CAML improves minority-group accuracy by up to 29.9% on benchmarks like Waterbirds, Dominoes, SpuCo, and CivilComments.
meta-learningactive learningspurious correlationsinductive biasminority-group accuracy
The Illusion of Intervention: Your LLM-Simulated Experiment is an Observational Study
The study formalizes confounding and selection bias arising from user drift in LLM-simulated experiments, where interventions induce unintended shifts in latent user attributes, distorting effect estimates. It proposes using negative control outcomes—attributes invariant under intervention—to diagnose confounding by identifying distribution shifts across intervention conditions. Additionally, it examines adjusting persona specifications by eliciting setting-relevant confounders to mitigate drift. Results demonstrate that targeted confounders substantially reduce bias in both survey-style and multi-turn agent evaluations, addressing the limitations of LLMs trained on observational data for simulating human behavior.
user driftnegative control outcomesconfounding biaspersona specificationlatent attributes
ShapeBench: A Scalable Benchmark and Diagnostic Suite for Standardized Evaluation in Aerodynamic Shape Optimization
ShapeBench introduces a standardized benchmark for aerodynamic shape optimization (ASO), addressing the lack of unified evaluation frameworks. It includes 103 tasks across eight shape categories, multiple optimization regimes, and a unified API for fair comparison. Each task features validated surrogates for fast search and, where feasible, high-fidelity Computational Fluid Dynamics (CFD) pipelines for verification. ShapeBench incorporates classical methods and a domain-specialized evolutionary LLM baseline, ShapeEvolve. Results reveal significant optimizer performance variance across shape categories and formulations (mean pairwise Spearman ρ = 0.013), indicating limited generalizability of single-task conclusions and underscoring the need for general-purpose approaches.
aerodynamic shape optimizationcomputational fluid dynamicsevolutionary llmsurrogate modelbenchmarking
Lowering the Barrier to IREX Participation: Open-Source Algorithms, Toolkit, and Benchmarking for Iris Recognition
The paper introduces two open-source iris recognition algorithms, TripletIris and ArcIris, trained with triplet loss and ArcFace loss respectively, alongside IREX-compliant C++ implementations of existing methods HDBIF and CRYPTS. These methods aim to lower the barrier for IREX participation by providing model submissions and conducting the first open-source assessment under IREX protocols. Evaluations were performed on official IREX X and multiple academic benchmarks, including Quality-Face/Iris Research Ensemble and CASIA-Iris-Thousand-V4. Additionally, open-source models for iris segmentation and circle estimation are provided to facilitate integration into new iris recognition methods.
iris recognitiontriplet lossarcface lossirex complianceiris segmentation
Everywhere Valid Bounds on False Discovery Proportions in Conformal Inference
The paper introduces finite-sample, distribution-free upper bounds on the false discovery proportion (FDP) in conformal inference, valid simultaneously over all rejection thresholds. By constructing a high-probability envelope for the empirical distribution function of null conformal p-values through joint distribution sampling, the method enables arbitrary post hoc threshold selection. The framework allows practitioners to modulate the envelope's shape for tighter bounds in regions of interest. Empirical results on synthetic and real-world datasets demonstrate that the proposed bounds are both valid and less conservative than existing approaches.
false discovery proportionconformal inferenceempirical distribution functionjoint distribution samplingpost hoc threshold selection
Memory-Efficient Partitioned DNN Inference on Resource-Constrained Android Crowds
The paper presents CROWDio's DNN pipeline scheduling subsystem for memory-efficient partitioned inference of transformer models on resource-constrained Android devices. The system employs five mechanisms: JIT deferred partition loading, single-partition-resident constraint, 4-tier affinity scheduler, zlib-compressed tensor transport, and streaming 1:1 dependency model, enabling ONNX inference without model modification. Evaluated on DistilBERT (67M parameters) across five Android handsets, the system reduces peak per-device RSS to 43±2 MB, limits battery consumption to 50±3 mAh per run, and achieves 34% lower batch latency compared to barrier synchronization.
partitioned inferencememory-constrainedonnxaffinity schedulertensor transport
Robust Recommendation from Noisy Implicit Feedback: A GMM-Weighted Bayes-label Transition Matrix Framework
The paper proposes RGBT, a robust recommendation framework that addresses label noise in implicit feedback by combining a Bayes-label transition matrix (BLTM) with Gaussian Mixture Model (GMM)-weighted calibration. RGBT assigns instance-specific reliability scores via GMM to reduce BLTM estimation bias while maintaining full data utilization. Theoretical analysis shows consistent estimation with reduced variance. Experiments on real-world and synthetic datasets demonstrate RGBT's superior noise handling and transition matrix calibration compared to reliable sample-based and transition matrix-based denoising methods.
recommender systemsimplicit feedbacklabel noisegaussian mixture modelbayes-label transition matrix
Decision-Path Patterns as Tree Reliability Signals: Path-based Adaptive Weighting for Random Forest Classification
The paper introduces a path-based adaptive weighting method for random forest classification that leverages topological patterns in root-to-leaf decision paths as reliability signals. The proposed class-conditional ratio weighting addresses structural confounding with predicted classes, ensuring zero expected class bias. Evaluated on 30 binary classification benchmarks with 30 repeats, the method significantly improves accuracy over standard random forests (Wilcoxon p = 0.018) and avoids majority-recall regressions, with minority-recall regressions limited to 3/30 datasets. The improvement is robust across forest sizes ranging from 100 to 1000 trees.
random forestdecision pathclass-conditional weightingensemble reliabilitybinary classification
Distributed Direct Preference Optimization
The paper provides the first convergence analysis of Direct Preference Optimization (DPO) in distributed settings, addressing federated and decentralized training scenarios with non-IID preference data. By modeling personalized offline RL with user-specific preference distributions, the authors characterize the global optimization landscape and derive convergence rates for federated DPO (considering client drift, communication frequency, and preference heterogeneity) and decentralized DPO (analyzing spectral connectivity's impact on consensus). Empirical validation on standard alignment benchmarks confirms the theoretical guarantees and demonstrates robust, scalable performance.
direct preference optimizationfederated learningdecentralized optimizationnon-iid dataconvergence analysis
Motion-Robust Deep Reconstruction for Free-Breathing Cardiac Cine MRI
The authors propose Cine-DL, a deep learning framework for motion-robust reconstruction of free-breathing cardiac cine MRI. The method combines k-space preprocessing (retrospective cardiac binning, respiratory gating) with Streak Optimized Coil Compression (SOC) to suppress artifacts, followed by a ResNet-based unrolled network alternating between proximal operators and physics-based data consistency updates. Evaluated on free-breathing volunteer data against k-t SENSE and iGRASP baselines, Cine-DL demonstrates improved quantitative metrics and visual fidelity, supporting clinical adoption. The framework also employs memory-efficient training for practical hospital deployment.
cine mrirespiratory gatingcoil compressionunrolled networkdata consistency
Scale-Calibrated Median-of-Means for Robust Distributed Principal Component Analysis
The paper introduces a scale-calibrated median-of-means estimator for robust distributed principal component analysis (PCA), addressing heterogeneous aggregation of mean vectors and principal subspaces. The method leverages the product geometry of Euclidean space and the Grassmann manifold, with node-level PCA expansion revealing linear influence of the mean component and eigengap-weighted covariance perturbation of the subspace component. Theoretical analysis demonstrates asymptotic equivalence to a scaled spatial median of node influence errors, yielding fixed-node non-Gaussian limits, growing-node Gaussian limits with finite-block bias, and explicit scale-dependent covariance. Robust block-scale and inference-optimal calibration rules are proposed, validated through simulations and large-scale single-cell RNA-seq data.
median-of-meansgrassmann manifoldeigengapcovariance perturbationnode-level pca
Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach
The paper introduces CoMET, a modular approach for multimodal classification that composes frozen pre-trained modality encoders with tabular foundation models (TFMs) without fine-tuning. The method processes each modality through frozen backbones, applies PCA for dimensionality reduction, and concatenates embeddings as input to TFMs. For misaligned CLS tokens, PALPooling—a lightweight adaptive token pooler—improves representation quality. CoMET achieves state-of-the-art performance on diverse benchmarks, handling datasets with 500,000 samples and 2,000 classes efficiently. Results demonstrate that foundation model composition outperforms complex end-to-end training pipelines.
multimodal classificationtabular foundation modelspca compressionadaptive poolingfrozen backbones
LT2: Linear-Time Looped Transformers
LT2 introduces linear-time looped transformers, replacing quadratic softmax attention with subquadratic variants to enhance scalability. Two primary variants are proposed: LT2-linear with linear attention and LT2-sparse with sparse attention, leveraging looping for iterative memory refinement and expanded receptive fields. LT2-hybrid combines these variants, with LT2-hybrid (GDN+DSA) matching standard looped transformer quality at linear-time cost and LT2-hybrid (Full+GDN) surpassing it in performance and efficiency. A pre-trained LT converted to LT2-hybrid, Ouro-hybrid-1.4B, outperforms industry-level 1B models and competes with 4B models after 1B tokens of training, demonstrating scalability and efficiency for small language models.
looped transformerslinear attentionsparse attentioniterative memory refinementreceptive field
RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers
We introduce RoPeSLR, a 3D RoPE-guided Sparse-LowRank attention framework addressing the quadratic complexity bottleneck in Diffusion Transformers (DiTs) for long-sequence video generation. RoPeSLR decouples the DiT attention manifold into a high-frequency semantic spike set (O(L^3/2) sparsity) and an extreme low-rank (O(d_h log L)) background continuum, leveraging a head-wise low-rank parameterization with learnable 3D Absolute Positional Embedding injection to preserve relative distance awareness. Evaluations demonstrate RoPeSLR achieves 10× fewer FLOPs at 90% sparsity on Wan2.1-1.3B and 2.26× end-to-end inference speedup on HunyuanVideo-13B’s 100K+ token sequences, maintaining near-lossless generation fidelity (<1.3% average VBench degradation).
diffusion transformerssparse-lowrank attentionrotary position embeddingsabsolute positional embeddingvbench degradation
Same Target, Different Basins: Hard vs. Soft Labels for Annotator Distributions
The paper investigates hard-label delivery methods as alternatives to soft-label training for handling annotator disagreement, focusing on multipass and stochastic label sampling (SLS). Multipass cycles through observed votes while maintaining dataset size, whereas SLS samples one label per example per epoch. On CIFAR-10H, hard-label methods outperform soft-label training when few annotations are available, particularly when the empirical target diverges from the full annotator distribution. Both methods match soft-label performance when full distributions are accessible. Hard-label delivery converges to flatter basins, supported by OOD detection on SVHN and CIFAR-100. Multipass is recommended as a default when vote counts are available, while SLS offers a lightweight competitive alternative.
hard-label deliverymultipassstochastic label samplingannotator disagreementsoft-label training
Time-Dependent PDE-Constrained Optimization via Weak-Form Latent Dynamics
We introduce a weak-form latent-space reduced-order modeling framework for accelerating gradient-based optimization of time-dependent PDE-constrained systems. The method leverages Weak-form Latent Space Dynamics Identification (WLaSDI) to compress high-dimensional PDE solutions into low-dimensional latent representations and identify parametric latent dynamics via weak-form system identification, avoiding explicit numerical differentiation for improved noise robustness. The framework enables scalable gradient evaluation through direct-sensitivity and adjoint-based approaches. Evaluated on thermal radiative transfer, Vlasov-Poisson, and Burgers equation benchmarks, the method achieves accurate optimal designs with computational speedups up to five orders of magnitude while maintaining robustness to noisy training data.
weak-formlatent dynamicspde-constrained optimizationsystem identificationadjoint-based
The General Theory of Localization Methods
The paper introduces a general machine learning framework called the localization method, grounded in localization kernels and local means, which underpins self-attention mechanisms. It formally defines the framework through two pillars: the local(-ized) model formulation and the localization trick. The study systematically connects the localization method to diverse models, including kernel methods, MeanShift, Hopfield networks, and Transformers, demonstrating its unifying theoretical significance and practical applicability. Advanced extensions like adaptive kernels and hierarchical local models are explored, revealing the framework's ability to generalize state-of-the-art architectures and offer new tools for designing data-adaptive learning systems.
localization kernelslocal meansself-attentionadaptive kernelshierarchical local models
Dynamic Shapley Computation
The paper introduces D-Shap, a dynamic framework for efficient Shapley value computation in evolving datasets. By modeling Shapley values as a player-by-task matrix and exploiting utility/coalition locality, D-Shap enables incremental updates via structure-aware interpolation and localized matrix modifications. The method includes self-valuation to construct the initial matrix without pre-specified tasks, using scalable subset reuse. Experiments demonstrate millisecond-level task updates and up to 1000× speedup for player updates while maintaining valuation quality comparable to full recomputation.
shapley valuesdynamic valuationmatrix maintenanceutility localitycoalition locality
SURF: Steering the Scalarization Weight to Uniformly Traverse the Pareto Front
The paper introduces SURF (Sampling Uniformly along the PaReto Front), a method to achieve uniform Pareto front (PF) coverage in multi-objective optimization via scalarization weight selection. Through geometric analysis, SURF derives a cumulative distribution function (CDF) map that inverts non-uniform traversal speeds induced by weight variation, enabling principled weight sampling. Theoretical analysis shows linear convergence to a finite-sampling floor under specified conditions. Empirical validation on bi-objective bandits, multi-objective-gymnasium, and LLM alignment tasks demonstrates SURF's superior PF coverage uniformity compared to baselines.
scalarizationpareto frontmulti-objective optimizationcumulative distribution functionweight sampling
Matryoshka Concept Bottleneck Models
(No summary returned.)
Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning
The paper introduces a novel approach for offline goal-conditioned reinforcement learning (GCRL) that enables compositional generalization through analogy transduction. The method formalizes analogy transduction as synthesizing new plans by composing task-endogenous analogies with given contexts, proposing a specialized analogy representation that captures optimal task execution changes while remaining invariant to contextual variations. This representation addresses the challenge of generalizing to unseen analogy-context pairs. Empirical evaluation on OGBench manipulation environments demonstrates significant performance improvements over prior methods lacking analogy transduction, achieving superior goal-reaching capabilities in novel contexts.
analogy transductiongoal-conditioned reinforcement learningcompositional generalizationtask-endogenous analogiesoffline learning
Mechanistic Interpretability for Learning Assurance of a Vision-Based Landing System
The study proposes a mechanistic interpretability framework for learning assurance in vision-based aircraft landing systems, addressing EASA's requirement for data-driven aviation systems to monitor their situation representation. A vision transformer model is trained on the LARDv2 dataset for runway keypoint regression, with per-patch embeddings decomposed into interpretable atoms using K-SVD sparse dictionary learning. Qualitative analysis shows contentful atoms track task-relevant runway structure while stylistic atoms capture domain-specific appearance, with the regression head predominantly weighting contentful atoms. The framework introduces out-of-model-scope detection, a runtime assurance approach monitoring the model's situation representation, complementing operational design domain and output-space out-of-distribution monitoring.
mechanistic interpretabilityvision transformerk-svdout-of-model-scope detectionlearning assurance
Unsupervised clustering and classification of upper limb EMG signals during functional movements: a data-driven
A data-driven pipeline for unsupervised clustering and classification of upper-limb surface electromyography (sEMG) signals during functional movements was developed and evaluated on the NINAPRO DB4 dataset. The four-stage methodology included signal preprocessing (low-pass filtering, Hilbert transformation), feature extraction (26 temporal/frequency metrics reduced to 5 key features), hierarchical gesture clustering using Mahalanobis distance, and comparative model evaluation. Optimal temporal segmentation was identified at 200ms. Automated model comparison via PyCaret selected Extra Trees and Artificial Neural Networks as top performers, demonstrating stable generalization and progressive learning capabilities. The approach enables adaptive, low-latency control strategies for myoelectric prostheses.
surface electromyographyhierarchical clusteringmahalanobis distancetemporal segmentationmyoelectric prostheses
ReversedQ: Opportunities for Faster Q-Learning in Episodic Online Reinforcement Learning
The paper introduces ReversedQ, an enhanced Q-learning method for finite-horizon episodic Markov Decision Processes with stationary dynamics, addressing delayed learning in model-free posterior-sampling approaches. ReversedQ optimizes three aspects: value-function update order, update frequencies, and initialization, building upon RandomizedQ. Empirical evaluations demonstrate significant improvements in scaled mean cumulative reward: from 9.53% to 78.78% in the Bidirectional Diabolical Combination Lock and from 21.76% to 61.81% in a chain MDP.
q-learningmarkov decision processesvalue-functionposterior-samplingfinite-horizon
TriForces: Augmenting Atomistic GNNs for Transferable Representations
TriForces introduces a model-agnostic three-stream framework for atomistic graph neural networks (GNNs) that separates composition and structure information, augmented with self-supervised learning to preserve transferable representations. The approach enhances performance on MatBench and QM9 benchmarks without requiring Density Functional Theory (DFT) labels and enables efficient similar structure retrieval through its learned latent space. On OMat24 in limited-data regimes, TriForces reduces energy mean absolute error (MAE) by 57% at 20K samples and improves force MAE across sample sizes. Pretrained variants across multiple machine learning interatomic potential (MLIP) architectures are released.
graph neural networksself-supervised learningtransfer learninginteratomic potentialsdensity functional theory
Deep Learning Surrogates for Emulating Stochastic Climate Tipping Dynamics
The authors propose a dynamics-informed Temporal Fusion Transformer (TFT) as a differentiable surrogate model for computationally intensive Earth system simulations, specifically targeting stochastic climate tipping dynamics. The architecture incorporates modifications to handle multivariate time series (up to 21 non-stationary variables) and static covariates, optimizing for forecasting tip events such as Atlantic and Pacific collapses. The surrogate achieves a 465x computational speedup over numerical simulations while accurately predicting transition timing and capturing stochastic uncertainty across ensemble predictions. Results demonstrate high fidelity in anticipating tipping events over thousands of time steps.
temporal fusion transformerstochastic uncertaintymultivariate time seriesclimate tipping dynamicsdifferentiable surrogate
Group-Aware Matrix Estimation and Latent Subspace Recovery
The authors propose Group-Aware Matrix Estimation (GAME), a convex estimator for overlapping subgroup-wise low-rank matrix completion that preserves subgroup-specific latent structure. GAME employs overlapping nuclear-norm penalties on category-specific submatrices, enabling information sharing across related groups while maintaining local geometries in a shared coordinate system. Theoretical analysis provides finite-sample guarantees for reconstruction error and subgroup-specific subspace recovery, dependent on sampling density, subgroup rank, and overlap structure. Empirical evaluation on synthetic, recommendation, ecological, and neuroscience datasets demonstrates GAME's superiority in structured missingness regimes, outperforming global low-rank, side-information, and modern imputation baselines, particularly when subgroups exhibit distinct low-rank structure.
matrix completionnuclear-normlatent subspacestructured missingnesssubgroup-specific
Spectral bandits for smooth graph functions with applications in recommender systems
The paper introduces spectral bandits for smooth graph functions, addressing online learning problems with graph-structured payoffs, particularly in content-based recommendation systems. The authors propose two algorithms that leverage the notion of effective dimension, which remains small in real-world graphs, ensuring linear scaling. These algorithms aim to minimize cumulative regret without poor scaling with the number of nodes. Experimental results demonstrate that user preferences for thousands of items can be accurately estimated from evaluations of just tens of nodes, validating the approach's efficiency in real-world recommendation tasks.
spectral banditssmooth graph functionseffective dimensioncumulative regretcontent-based recommendation
Sample Complexity of Transfer Learning: An Optimal Transport Approach
The paper establishes a theoretical foundation for the sample efficiency of transfer learning by analyzing its sample complexity through an optimal transport framework. For data dimensions d > 3, transfer learning achieves a sample complexity of O(m^{-(α+1)/d}), where α represents the smoothness of the data distribution, compared to O(m^{-p/d}) for direct learning. This demonstrates superior sample efficiency when optimizing over less smooth models, particularly in data-scarce scenarios. Empirical validation on image classification tasks confirms that transfer learning significantly enhances model performance in low-data regimes.
sample complexityoptimal transporttransfer learningsmoothnessdata-scarce
OpenSeisML: Open Large-Scale Real Seismic and well-log Dataset for Generative AI
OpenSeisML introduces a large-scale open dataset of real seismic and well-log data curated from the UK National Data Repository, addressing the scarcity of realistic velocity models for machine learning in seismic inversion. The dataset employs an automated curation pipeline that ensures reproducibility, utilizing checkshot data for time-to-depth conversion and interpolation to construct velocity models. This enables accurate post-stack seismic data conversion and supports generative AI workflows for synthesizing statistically consistent subsurface property realizations. The primary goal is to train generative models that capture subsurface statistical distributions, facilitating uncertainty quantification and serving as priors for seismic inversion processes.
seismic inversiongenerative aivelocity modeltime-depth conversionuncertainty quantification
Ada2MS: A Hybrid Optimization Algorithm Based on Exponential Mixing of Elementwise and Global Second-Moment Estimates
Ada2MS introduces a hybrid optimization algorithm that bridges AdamW and momentum SGD paradigms through exponential interpolation between elementwise and global second-moment estimates. By dynamically balancing AdamW's robustness and momentum SGD's generalization potential, Ada2MS achieves smooth transitions between these behaviors. Evaluated on visual tasks under a unified optimizer-comparison protocol, Ada2MS demonstrates competitive performance. The method addresses limitations in gradient-scale sensitivity and hyperparameter tuning while maintaining stability. Code availability is announced for reproducibility.
optimization algorithmadamwmomentum sgdsecond-moment estimatesexponential interpolation
Pseudo-Formalization for Automatic Proof Verification
The paper introduces Pseudo-Formalization (PF), a proof format combining natural language flexibility with formal proof modularity for reliable AI-generated proof verification. PF decomposes proofs into self-contained modules with premises, conclusions, and natural language proofs, verified via Block Verification (BV) using LLMs. Evaluated on olympiad and research-level benchmarks (including ArxivMathGradingBench), PF+BV outperforms LLM-as-judge baselines in error-finding precision and recall.
pseudo-formalizationproof verificationblock verificationnatural language proofsformal proofs
An exponential mechanism based on quadratic approximations for fine-tuning machine learning models with privacy guarantees
The authors propose a differentially private fine-tuning algorithm using an exponential mechanism with quadratic approximations. The method constructs a utility function combining local quadratic approximations of pretrained models with new dataset information, enabling exact sampling from multivariate normal distributions. Theoretical privacy guarantees, sensitivity bounds, and accuracy estimations are provided, alongside a random-projection strategy for scalability. Experiments on MNIST and MIMIC datasets show competitive performance against existing private fine-tuning techniques.
differential privacyexponential mechanismquadratic approximationfine-tuningmultivariate normal distribution
Online Conformal Prediction with Corrupted Feedback
This paper introduces robust online conformal prediction (OCP) methods to maintain calibration guarantees under corrupted feedback, addressing scenarios where miscoverage indicators are compromised by noise, communication failures, or adversarial manipulation. Two approaches are proposed: robust OCP via filtering, which leverages structural properties of predicted thresholds to filter corrupted feedback, and robust OCP via active compensation, which incorporates a mechanism to mitigate feedback corruption effects. Theoretical miscoverage guarantees are established for both methods, specialized for independent stochastic flip and arbitrary error models with memory bounds. Empirical evaluations on real-world datasets demonstrate improved calibration and smaller prediction sets compared to baseline OCP methods under corrupted feedback.
online conformal predictioncorrupted feedbackmiscoverage guaranteesrobust filteringactive compensation
Fast Reconstruction of Exact Maxwell Dynamics from Sparse Data
FLASH-MAX is introduced as a shallow, exact-by-construction neural network for predicting homogeneous electromagnetic fields from sparse observations. The architecture encodes each hidden neuron as an exact solution to Maxwell's equations, ensuring symbolic satisfaction of governing PDEs while enabling end-to-end training. Theoretical analysis proves universal approximation capability on arbitrary domains. Experiments demonstrate sub-1% relative validation error from ~1K sparse 3D observations (training in seconds) and single-digit errors with only 100 samples, highlighting improved precision-optimization trade-offs by embedding physical structure into the hypothesis class.
maxwell's equationsexact-by-constructionsparse datauniversal approximationpde residual
Reinforcing Human Behavior Simulation via Verbal Feedback
The paper introduces DITTO, a reinforcement learning model that treats verbal feedback as a primary signal for improving human behavior simulation in LLMs. DITTO generates feedback-conditioned rollouts optimized jointly with GRPO, distilling verbal guidance into the base policy without requiring feedback at test time. The authors also present SOUL, a benchmark suite with 10 tasks across six categories: Theory of Mind, character role play, social skill, learner simulation, user simulation, and persona simulation. DITTO achieves a 36% average improvement over the base model and outperforms GPT-5.4 on 6 of 10 SOUL benchmarks, demonstrating the efficacy of RL with verbal feedback.
dittogrposoulverbal feedbackrollout
A 10,000-Year Global Stochastic Tropical Cyclone Catalog with Wind-Dependent Track Transitions (WHITS)
WHITS introduces a non-parametric semi-Markov track generator for tropical cyclones, extending the HITS framework by conditioning track transitions on local wind speed, position, age, and forward vector. The method sharpens kernel selection to suppress dynamically inconsistent jumps and applies smoothing to eliminate discontinuities. Trained on IBTrACS data spanning 1851-present, the 10,000-year synthetic catalog replicates observed track density and wind-hit probabilities across six basins, targeting catastrophe-risk applications requiring physically plausible tracks.
non-parametricsemi-markovtrack generatoribtracscatastrophe-risk
📰 Industry Media (6)
Scaling creativity in the age of AI
The article examines AI's role in scaling creative content production amid rising demand, citing a 5x projected growth in content needs. It highlights Adobe Firefly Custom Models and Foundry as key tools for brand-aligned asset generation, reporting 50% workflow cycle time reductions at Nestlé. Methodologically, it emphasizes responsible AI integration via workflow audits, iterative automation of repetitive tasks (e.g., asset resizing), and governance frameworks for model training transparency. Results include 94% of creatives saving 17 weekly hours with AI assistance and 4,700% growth in AI-powered shopping, underscoring the shift toward agentic content discovery.
generative aiworkflow automationcontent provenanceagentic webcustom models
Anthropic’s Code with Claude showed off coding’s future—whether you like it or not
Anthropic’s Code with Claude event showcased the rapid adoption of AI-driven coding tools, particularly Claude Code, which now writes most of Anthropic’s software. The system employs self-prompting mechanisms and a 'dreaming' feature, where agents consolidate task-specific notes to improve codebase understanding and error correction. Attendees reported widespread use of Claude-generated pull requests, with many developers bypassing code review entirely. Despite concerns over code quality and security, Anthropic emphasizes adherence to traditional software development practices while aiming for full automation, positioning Claude as equivalent to a midlevel engineer. The event highlighted industry-wide shifts toward AI-assisted coding, with companies like Spotify and Delivery Hero integrating Claude into their workflows.
claude codepull requestsself-promptingdreaming featurecodebase
One Model, Three Modalities: ByteDance Releases Lance for Image and Video Understanding, Generation, and Editing
ByteDance introduces Lance, a 3B-parameter unified multimodal model that jointly handles image/video understanding, generation, and editing through a dual-stream mixture-of-experts architecture. The model employs Modality-Aware Rotary Positional Encoding (MaPE) to process interleaved text, semantic visual tokens, and VAE latent representations in a shared sequence while decoupling understanding (LLMUND) and generation (LLMGEN) pathways. Lance achieves state-of-the-art performance among unified models, scoring 0.90 on GenEval (image generation), 85.11 on VBench (video generation), and 62.0 on MVBench (video understanding), trained with a four-stage curriculum using up to 128 GPUs.
modality-aware rotary positional encodingdual-stream mixture-of-expertscontinuous latent representationsinterleaved multimodal sequenceflow matching objective
What is a Forward Deployed Engineer: The AI Role OpenAI, Anthropic, and Google Are Hiring in 2026
The Forward Deployed Engineer (FDE) role, pioneered by Palantir, addresses the deployment gap in complex AI systems by embedding engineers within client environments to bridge domain-specific and AI expertise. FDEs implement production-ready solutions, focusing on prompt architecture, retrieval-augmented generation (RAG) pipelines, evaluation frameworks, agent development, and production observability. Operational evidence from Palantir shows 85% year-over-year revenue growth, driven by sticky, high-value contracts. OpenAI and Anthropic have formalized FDE initiatives, with OpenAI’s Deployment Company raising $4 billion and Anthropic’s enterprise joint venture valued at $1.5 billion, aiming to scale enterprise AI adoption.
forward deployed engineerretrieval-augmented generationprompt architectureevaluation frameworksagent development
Meet Turbovec: A Rust Vector Index with Python Bindings, and Built on Google’s TurboQuant Algorithm
Turbovec introduces a Rust-based vector index with Python bindings, leveraging Google Research's TurboQuant algorithm for efficient vector quantization. TurboQuant employs a data-oblivious approach, utilizing random rotations and Lloyd-Max scalar quantization to achieve near-optimal distortion rates without codebook training. Benchmarks demonstrate that Turbovec compresses a 10-million-document corpus from 31 GB to 4 GB, outperforming FAISS IndexPQFastScan by 12–20% on ARM hardware and achieving comparable recall rates. The library supports integration with LangChain, LlamaIndex, and Haystack, offering a scalable solution for retrieval-augmented generation pipelines.
vector quantizationturboquantfaissretrieval-augmented generationsimd intrinsics
Nvidia’s Vera chip is the US$200 billion bet Jensen Huang doesn’t want you to overlook
Nvidia's Vera central processor targets a $200 billion inference market distinct from its $1 trillion AI GPU projections, with expected revenue of $20 billion by FY2024. Developed using Groq's inference-specialized technology, Vera addresses competition from custom silicon (Google TPU, Amazon Trainium) and CPU-based inference (Intel, AMD). Despite supply constraints acknowledged by CEO Jensen Huang, Nvidia increased supply commitments to $119 billion in Q1. The Vera-Rubin platform combines Vera CPUs with Rubin GPUs for inference workloads, positioning Nvidia beyond GPU-dominated training markets.
inference workloadscustom siliconsupply chaincentral processorgpu dominance
Generated automatically at 2026-05-21 21:24 UTC. Summaries and keywords are produced by an LLM and may contain inaccuracies — always consult the original article.
